BenchCAD: A Comprehensive, Industry-Standard Benchm...

BenchCAD: A Comprehensive, Industry-Standard Benchm... | AI Research

Key Takeaways

BenchCAD is a new benchmark designed to evaluate how well artificial intelligence models can generate and edit industrial Computer-Aided Design (CAD) program...
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs.
Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured.
Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings.
We present BenchCAD, a unified benchmark for industrial CAD reasoning.

Paper AbstractExpand

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

BenchCAD is a new benchmark designed to evaluate how well artificial intelligence models can generate and edit industrial Computer-Aided Design (CAD) programs. While many AI models can create simple 3D shapes, they often struggle to produce the precise, parametric code required for real-world manufacturing. BenchCAD provides a standardized way to test whether models truly understand the engineering logic, design parameters, and specific CAD operations—such as helical sweeps or lofts—that are essential for creating functional, editable industrial parts.

A Standardized Test for Industrial CAD

The benchmark consists of 17,900 expert-verified CadQuery programs covering 106 different industrial part families, including gears, springs, and fasteners. Nearly half of these designs are anchored to official engineering standards (such as ISO, DIN, and ASME), ensuring that the models are tested against real-world requirements rather than arbitrary shapes. By organizing these parts into a clear taxonomy, the researchers can measure a model's performance across different levels of complexity, from basic visual recognition to advanced spatial and code-based reasoning.

Evaluating Four Key CAD Capabilities

BenchCAD assesses models through four distinct tasks that isolate specific skills:

Vision2Code: Generating executable CAD code from multi-view images of a part.
Code Edit: Modifying an existing CAD program based on natural language instructions.
Vision QA: Answering design-related questions by looking at rendered images.
Code QA: Answering design-related questions by analyzing the underlying CAD code.
By comparing performance across these tasks, the researchers can pinpoint exactly where a model fails. For example, if a model performs well on Code QA but poorly on Vision QA, it suggests the model struggles with visual perception rather than the logic of the CAD operations themselves.

Current Limitations in AI CAD Generation

Testing over 10 frontier models revealed a significant gap between "looking" like a CAD part and "being" a functional one. While many systems can recover the general outer shape of an object, they frequently fail to use the correct engineering operations. Instead of using sophisticated techniques like lofts or twist-extrudes, models often default to simpler, less accurate methods. Furthermore, while fine-tuning and reinforcement learning help models perform better on familiar part families, they still struggle to generalize to new, unseen designs, indicating that current systems have not yet mastered the underlying principles of parametric engineering.

Why This Matters for Automation

The results suggest that current AI models are not yet ready for professional industrial use. The benchmark highlights that even the most advanced models often miss fine 3D structural details and struggle to maintain consistency when editing existing designs. By providing a rigorous, capability-decomposed evaluation framework, BenchCAD serves as a roadmap for researchers to improve the industrial readiness of AI, moving the field toward tools that can reliably assist in the complex, constraint-heavy world of mechanical engineering.