BIM-Edit: Benchmarking Large Language Models for IF...

BIM-Edit: Benchmarking Large Language Models for IF... | AI Research

Key Takeaways

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling introduces a new way to test how well artificial intelligence can mo...
Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions.
In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations.
However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness.
We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format.

Paper AbstractExpand

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling introduces a new way to test how well artificial intelligence can modify complex building designs. While many AI tools can generate simple 3D shapes from scratch, real-world engineering requires modifying existing, highly structured models without breaking their internal logic. This benchmark evaluates whether Large Language Models (LLMs) can perform precise, context-aware edits on Industry Foundation Classes (IFC) files—the standard format used in architecture and construction—while maintaining the integrity of the building's geometry, semantic properties, and structural relationships.

Testing Real-World Engineering Tasks

The benchmark consists of 324 distinct editing tasks, ranging from simple synthetic scenes to complex, realistic building models. These tasks are designed to mimic actual engineering workflows, where an AI must handle "create," "update," and "delete" operations. Crucially, the instructions provided to the AI vary in their specificity: some are direct, while others require the model to infer the target element through spatial context (such as position or distance) or topological relationships (such as connectivity or containment). This allows researchers to see if an AI can truly "understand" a building model rather than just following explicit, step-by-step commands.

A Three-Dimensional Evaluation

Because a building model is more than just a visual shape, the researchers developed a custom evaluation system that looks beyond simple geometry. Each edit is scored across three critical dimensions:

Geometric Accuracy: Does the change produce the correct physical shape and position?
Semantic Validity: Does the edited object retain the correct classification (e.g., is a door still a door, not a window) and appropriate properties?
Topological Consistency: Does the edit preserve the necessary connections and relationships between building components, such as ensuring a wall opening correctly hosts a window?
By combining these metrics, the benchmark ensures that an AI’s output is not just visually plausible, but also structurally and functionally sound for engineering purposes.

Current Limitations of AI

The results of the study reveal a significant gap between current AI capabilities and the requirements of professional engineering. Across seven leading LLMs tested, the best-performing model achieved an average score of only 49.5%. Furthermore, no model was able to fully solve more than 3.4% of the tasks. The data suggests that while AI models are becoming better at approximating the geometry of a design, they frequently struggle to maintain the complex semantic and relational rules required for valid Building Information Modeling. These findings highlight that current LLMs still require substantial improvements before they can be reliably used in structured, collaborative engineering design workflows.

BIM-Edit: Benchmarking Large Language Models for IF... | AI Research

Key Takeaways

Testing Real-World Engineering Tasks

A Three-Dimensional Evaluation

Current Limitations of AI

Comments (0)

No comments yet