BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling introduces a new way to test how well artificial intelligence can modify complex building designs. While many AI tools can generate simple 3D shapes from scratch, real-world engineering requires modifying existing, highly structured models without breaking their internal logic. This benchmark evaluates whether Large Language Models (LLMs) can perform precise, context-aware edits on Industry Foundation Classes (IFC) files—the standard format used in architecture and construction—while maintaining the integrity of the building's geometry, semantic properties, and structural relationships.
Testing Real-World Engineering Tasks
The benchmark consists of 324 distinct editing tasks, ranging from simple synthetic scenes to complex, realistic building models. These tasks are designed to mimic actual engineering workflows, where an AI must handle "create," "update," and "delete" operations. Crucially, the instructions provided to the AI vary in their specificity: some are direct, while others require the model to infer the target element through spatial context (such as position or distance) or topological relationships (such as connectivity or containment). This allows researchers to see if an AI can truly "understand" a building model rather than just following explicit, step-by-step commands.
A Three-Dimensional Evaluation
Because a building model is more than just a visual shape, the researchers developed a custom evaluation system that looks beyond simple geometry. Each edit is scored across three critical dimensions:
Geometric Accuracy: Does the change produce the correct physical shape and position?
Semantic Validity: Does the edited object retain the correct classification (e.g., is a door still a door, not a window) and appropriate properties?
Topological Consistency: Does the edit preserve the necessary connections and relationships between building components, such as ensuring a wall opening correctly hosts a window?
By combining these metrics, the benchmark ensures that an AI’s output is not just visually plausible, but also structurally and functionally sound for engineering purposes.
Current Limitations of AI
The results of the study reveal a significant gap between current AI capabilities and the requirements of professional engineering. Across seven leading LLMs tested, the best-performing model achieved an average score of only 49.5%. Furthermore, no model was able to fully solve more than 3.4% of the tasks. The data suggests that while AI models are becoming better at approximating the geometry of a design, they frequently struggle to maintain the complex semantic and relational rules required for valid Building Information Modeling. These findings highlight that current LLMs still require substantial improvements before they can be reliably used in structured, collaborative engineering design workflows.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!