3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Published 9 Apr 2026 in cs.CV and cs.AI | (2604.08042v1)

Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages LLMs to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a training-free framework enabling LLMs to generate 3D wireframe sketches via contrastive experience optimization.
It leverages a novel pairwise CLIP-guided reward mechanism and self-supervised critique to refine spatial reasoning.
The approach achieves comparable semantic and geometric performance to trained models with significantly reduced inference cost.

3DrawAgent: Training-Free Language-Driven 3D Sketch Generation via Contrastive Experience

Introduction

The paper "3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience" (2604.08042) introduces a training-free framework that extends the zero-shot reasoning and drawing capabilities of LLMs from 2D vector graphics into 3D wireframe generation. The central contribution is a mechanism by which a frozen LLM, guided solely by prompt engineering and iterative, self-supervised critique, can refine and express spatial reasoning for 3D Bezier curve sketches. The system obviates ground-truth 3D data or gradient-based optimization, relying instead on a hybrid reward structure and a novel adaptation of training-free Group Reward Policy Optimization (GRPO) to systematically extract 3D spatial knowledge.

Methodology

Language-Driven 3D Sketch Representation

The agent operates over a formalized language for representing 3D Bezier curves. Each model rollout expresses a 3D sketch as a list of Bezier curves, where each curve is a sequence of four 3D control points. The framework employs structured in-context prompts to communicate drawing conventions, specify output formats, and instruct the LLM to generate spatially consistent and renderable geometry. Crucially, prompt construction enforces strict type safety and explicit coordinate system conventions, ensuring that the LLM outputs are directly parseable and suitable for evaluation.

Contrastive Experience Optimization

Instead of traditional scalar rewards or supervised 3D targets, the framework advances a pairwise contrastive mechanism. Candidates generated by the LLM are evaluated by a two-stage reward pipeline: first, a pre-trained CLIP model assigns perceptual similarity between multi-view renders and the input prompt or reference image; second, the LLM itself is prompted as a judge to provide qualitative comparative assessments emphasizing aspects such as topology, symmetry, and spatial arrangement.

Contrastive pairs are constructed by sampling candidate generations and selecting those with non-trivially different CLIP scores. The LLM, acting as a semantic critic, articulates why one candidate exhibits superior spatial or geometric properties over the other. These comparative insights are accumulated in an experience library and subsequently injected into future prompt contexts.

Training-Free Reinforcement Prompt Tuning

This iterative accumulation of contrastive experience achieves a form of RL-like adaptation—termed "black-box reinforcement prompt tuning"—without any model parameter updates. The LLM’s spatial reasoning is refined by the evolving, in-context experience library, which encodes transferable geometric principles and constraints distilled from prior self-critiques. The agent's generative process is thus conditioned on both the original user prompt and these extracted experiences.

Experimental Results

Semantic and Geometric Performance

Comprehensive experiments include comparisons with optimization- and training-based baselines such as Diff3DS, 3Doodle, and Dream3DVG. 3DrawAgent, using only prompt guidance and training-free adaptation, achieves CLIP-based semantic alignment and aesthetic quality scores (CLIP-ST and AES) comparable to or surpassing those of fully trained models. For example, the CLIP-ST score for 3DrawAgent (Gemini-2.5Pro) reaches 0.649 on category-level prompts, closely matching or exceeding trained systems like Diff3DS (0.648) and Dream3DVG (0.660).

The agent produces topologically faithful, clean 3D sketches even for category-rich prompts (e.g., furniture, vehicles, freehand forms) and demonstrates robust generalization to fine-grained descriptions and image-conditioned sketching. Qualitative results emphasize that 3DrawAgent recovers canonical structural features and maintains spatial coherence, outperforming baselines particularly on semantic plausibility and geometric consistency.

Efficiency and Practical Implications

3DrawAgent's inference-phase cost is substantially lower than optimization-based methods, producing sketches in about 2 minutes per instance with a marginal API cost, compared to 60–120 minutes and over 10x expense for baselines. The method’s controllable abstraction—regulating output complexity by constraining the number of Bezier curves—demonstrates that the LLM internalizes semantic priorities, allocating representational budget in a human-like, hierarchical manner.

Ablations and User Studies

Ablation studies confirm that the core performance gains result directly from Contrastive Knowledge Extraction (CKE), validating the necessity of pairwise CLIP-guided critique over random selection. Group size (K=5) strikes an optimal balance for comparative diversity and learning stability. Results also indicate that removing ground-truth supervision does not degrade the model’s convergence, confirming the sufficiency of CLIP-based perceptual rewards.

User studies substantiate the preference for 3DrawAgent outputs (46.66% of human raters) over alternatives, corroborating the advantage in both semantic and geometric quality.

Limitations and Future Directions

Despite strong performance as a training-free, prompt-based system, 3DrawAgent inherits unavoidable limitations from its lack of explicit geometric supervision. It is susceptible to failure modes such as imperfect curve connectivity, floating primitives, and semantic ambiguity in non-canonical or highly complex input prompts. These issues arise from the holistic nature of CLIP-based rewards (which overlook local geometric errors) and the absence of dense spatial priors. Future work should consider incorporating explicit geometric losses, structure-aware reward models, or hybrid pipelines combining self-supervised experience with geometric regularizers to address these challenges.

Furthermore, the accumulation of overly specific or task-biased experiences can induce over-reasoning and reduced generalizability, highlighting the need for automated distillation or pruning of the experience library.

Conclusion

3DrawAgent establishes a new paradigm for training-free, language-driven 3D sketch generation, demonstrating that LLMs can acquire robust spatial priors and geometric planning ability solely through prompt engineering and self-supervised contrastive experience. The results provide compelling evidence that generic, frozen LLMs can be adapted as spatial planners for 3D wireframes with no parameter updates, positioning prompt-based reinforcement as a promising direction for interactive 3D reasoning, design, and structural abstraction in future foundation model systems.

Markdown Report Issue