Improved Multimodal Generation of Editable, Parametric Computer-Aided Design Models

J. Jang, H. Jiang
StoryGold,
United States

Keywords: Computer-Aided Design (CAD); design-to-manufacturing; multimodal AI; open-source

Summary:

Computer-Aided Design (CAD) underpins nearly all modern product development, yet it remains a slow, expert-dependent, and costly bottleneck in the design-to-manufacturing process. Automating the translation of design intent into CAD using Artificial Intelligence (AI) has the potential to significantly increase industrial/manufacturing productivity and market participation (particularly for specialized/customized products) while overall reducing costs. However, current AI-driven CAD approaches lack effective representations for connecting multimodal design intent to the underlying logic of parametric CAD, limiting the utility of existing models in real-world design workflows. StoryGold is developing BELLA (Blueprint Extraction Large Learning Agent), an open-source, foundation-scale, vision-language model capable of generating fully editable, parametric, 3D CAD from multimodal inputs. Most existing AI-driven CAD systems collapse the parametric structure, constraints, and feature relationships that define actual 3D CAD models into static geometric representations (such as meshes or voxels). While these geometric outputs can be visually plausible, they lack the scalability (e.g., resolution) and editability of parametric CAD, and cannot be relied on for real-world manufacturing workflows that require design iteration and fine-scale geometric precision. In contrast, BELLA outputs executable Python code (using the CadQuery library) that produces parametric 3D CAD models. Our code-based approach preserves the full feature trees and parametric histories that typify professional CAD workflows, allowing for design outputs to be precisely edited through relatively simple parameter changes rather than requiring complete regeneration. Toward advancing BELLA’s development, here, we report the outcomes of building a pipeline that replicates and extends the current state-of-the-art CadQuery-generating model, cadrille. cadrille is a multimodal vision-language model built on Qwen-VL-2B that processes three input modalities, including point clouds (represented by 256 points via single linear projection layer), multi-view images (provided as four images arranged in a 2x2 grid via a native vision encoder), and text descriptions, and outputs executable Python code (CadQuery) for parametric CAD generation. Our pipeline followed a three-stage training paradigm that consisted of (1) unsupervised pre-training (using Qwen2-VL-2B), (2) supervised fine-tuning (SFT) on over one million samples of synthetically generated data, and (3) a reinforcement learning (RL) fine-tuning pipeline on online mesh (STL) data without CadQuery or any other parametric CAD labels, using an online RL pipeline that outperforms offline RL alternatives. Building on this baseline, StoryGold extended the input space with two additional modalities, including depth maps and normal maps, resulting in a total of five input modalities. Under SFT, StoryGold’s extended model outperformed the cadrille SFT baseline in reconstruction quality. In addition, StoryGold improved the RL objective by introducing a weighted Intersection-over-Union (IoU) score that weights points more heavily if they are in high curvature regions rather than weighing each data point in the object equally as in standard IoU scoring. As a next step, StoryGold plans to apply online RL to the extended five-modality model. Overall, BELLA represents a promising bridge between underlying design intent and real-world, manufacturable CAD models, with the potential to increase manufacturing productivity, lower design-to-manufacturing workflow costs, and expand access to production-ready design capabilities across diverse industries.

« Back