From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs

Published 15 May 2026 in cs.SE | (2605.15865v1)

Abstract: LLMs have shown increasing potential in automating model-driven software engineering tasks, particularly in generating models conforming to Domain Specific Languages (DSLs) from natural language. While most existing approaches rely on large proprietary models, their high cost and limited deployability hinder broader adoption. In this paper, we evaluate whether open-source LLMs of varying sizes (0.5B to 32B parameters) can generate DSL-conformant models using only few-shot prompting, without any fine-tuning. Our evaluation focuses on key model-driven engineering (MDE) requirements, including syntactic validity, semantic completeness, and inter-model reference consistency. We extend our prior work by moving from generating user interface models (referred to as "UI models" in this paper) over fixed, predefined data schemas ("data models") to generating both the UI and data models entirely from scratch. This shift serves two purposes: first, it highlights the LLM's ability to infer domain-specific relationships and maintain consistency across multiple interconnected models; second, it allows us to generalize earlier findings by testing DSL generation across models of different natures and structural roles. Our structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., \texttt{gemma3:12b}, \texttt{mistral:7b-instruct}) approach or match the quality of much larger models. These findings demonstrate the feasibility of using smaller, open-source LLMs for grammar-conformant DSL generation in MDE workflows, offering a cost-effective and deployable alternative to closed LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that prompt-driven DSL generation using open-source LLMs yields both syntactically valid and semantically rich models without fine-tuning.
The methodology employs structured prompts, iterative temperature reduction, and dual evaluation (automated parsing and expert review) to ensure DSL accuracy.
The evaluation reveals that compact, instruction-tuned LLMs can achieve comparable grammar conformity and domain completeness to larger models for low-code applications.

Evaluating Grammar-Conformant DSL Generation Using Open LLMs

Introduction

The paper "From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs" (2605.15865) systematically examines the DSL model generation capabilities of open-source LLMs, specifically focusing on syntactic validity, semantic completeness, and reference consistency. The study leverages few-shot prompting without model fine-tuning and extends prior work by not only generating UI models from natural language but also inferring and constructing data models, thereby increasing task complexity. The evaluation encompasses 39 LLMs ranging from 0.5B to 32B parameters, spanning multiple architectures, and emphasizes prompt engineering as the sole mechanism to guide model outputs.

Model-Based Approach and Evaluation Pipeline

The methodology is grounded in a modular pipeline which ingests a user's natural language application specification—e.g., “I want to create the website for an online ice cream parlour”—and sequentially generates the conceptual data model and UI model via LLM inference. Both grammar definition and domain semantics are embedded in structured prompts. All generated outputs are validated through automatic parsing (LARK framework) and human expert evaluation.

Figure 1: Pipeline overview illustrating prompt-driven DSL generation and multi-stage syntactic and semantic evaluation.

The process comprises:

Concept extraction: LLMs are prompted to identify key entities, attributes, and relationships.
DSL synthesis: Strict grammar conformance is enforced via prompt constraints (cardinality, enums, subset of, references, etc.).
Syntax validation: Outputs are parsed for token-level errors and resolved via iterative retries with reduced temperature.
Semantic review: Expert evaluation addresses domain correctness, covering advanced features such as ratings, promotions, and delivery tracking.

LLM Selection and Prompt Engineering

The study targets a heterogeneous sample of open LLMs: LLaMA, Qwen2, Phi, Gemma, GraniteMoE, StableLM, and others. Model variants include general-purpose, instruction-tuned, and code-centric flavors. The unified prompt template incorporates the DSL grammar, sample models, the inferred data model, and the user’s original specification. Strict directives are applied to enforce grammar conformity and encourage semantic expansion.

The reliance on prompt engineering, without fine-tuning or retraining, is central—addressing scalability and deployability constraints that hinder adoption of large closed models. The systematic retry loop on parsing failure, with progressive temperature reduction, ensures syntactic validity and leverages the stochastic nature of LLM outputs for robustness.

Human-Centered Semantic Evaluation

In addition to automatic parsing, DSL models are assessed by domain experts across four axes: Semantic Correctness, Concept Identification, Completeness, and Advanced Feature Coverage. Scores use a Likert scale (1-5), and the web-based interface standardizes participant input, demographic tracking, and criteria display.

Figure 2: Interactive expert evaluation platform capturing experiment selection, demographic input, and detailed model review with syntactic and semantic highlight.

This two-pronged approach—structural and semantic—enables fine-grained discrimination between models that merely satisfy grammar and those that generate domain-complete, semantically rich outputs.

Numerical Results and Comparative Performance

Out of 39 evaluated LLMs, 26 produced at least one syntactically valid DSL within retry constraints. Among these, notable compact models (e.g., gemma3:12b, mistral:7b-instruct) achieved semantic scores comparable to larger models (e.g., phi4:latest). Top-performing models consistently incorporated advanced semantics, including enums and constrained relationships.

Strong numerical results include:

Models ≤8B (e.g., notus:latest, codellama:latest, mistral:7b-instruct) matched or outperformed several larger variants.
Semantic evaluation scores demonstrated that parameter size did not reliably predict semantic completeness; smaller models performed robustly when prompted with well-structured templates and grammar cues.
Instruction-tuned models yielded higher consistency and correctness, confirming the impact of alignment training on DSL synthesis tasks.
Figure 3: Semantic evaluation scores for DSL models ≤8B parameters reveal that compact models can achieve high completeness and advanced feature coverage.

These findings directly address RQ1 (small/mid-size LLM DSL generation capability vs large models) and RQ2 (efficacy of prompt engineering without fine-tuning).

Implications and Future Directions

The demonstrated viability of prompt-driven DSL generation using small and mid-sized open-source LLMs has substantial implications for MDE and low-code/nocode platforms:

Cost and deployment efficiency: Models with modest compute footprints can produce grammar-conformant, semantically rich outputs, enabling private, edge, or enterprise-controlled deployments over proprietary SaaS LLMs.
Model-driven application synthesis: Reliable structural inference and inter-model reference resolution by LLMs supports automation and democratization of software modeling, reducing manual overhead.
Prompt-centric workflows: Structured prompts and retry-based syntactic validation provide practical pathways to harness LLM capabilities for DSL tasks, circumventing training data sparsity and fine-tuning obstacles.

Further research trajectories include fine-tuning compact LLMs for DSL-specific tasks, multi-turn interactive refinement workflows, and tighter integration of automated verification/repair tools for correctness and semantic alignment.

Conclusion

The systematic evaluation establishes that compact, open-source LLMs, properly guided by grammar-aware prompt engineering, can reliably synthesize DSL-conformant models directly from text inputs. Instruction-tuned models yield consistently superior outputs, and prompt design significantly mitigates dependencies on model scale. The results substantiate the practical feasibility of deploying lightweight LLMs for grammar-based modeling in low-code and MDE platforms, suggesting fertile ground for future investigation into prompt optimization, refinement pipelines, and domain-specialized LLM customization.

Markdown Report Issue