- The paper introduces a novel architecture that separates LLM-driven blueprint extraction from rule-based UVM code generation to ensure synthesis reliability.
- It achieves significant coverage improvements, raising code coverage from 84.6% to 90.6% and functional coverage to 87.9%, while minimizing manual intervention.
- The approach is cost-efficient ($0.37 per design) and robust across various protocols, including Direct, Wishbone, and AXI4-Lite, ensuring 100% compile reliability.
HAVEN: Hybrid Automated Verification Engine for LLM-Assisted UVM Testbench Synthesis
Motivation and Background
The verification phase in IC development remains a principal bottleneck, accounting for up to 70% of project effort due to the high labor involved in generating and maintaining protocol-correct UVM testbenches. The recent use of LLMs to automate this process has highlighted severe limitations: LLMs, even with moderate prompt engineering or fine-tuning, are consistently incapable of reliably generating syntactically and semantically valid HDL/UVM code. This is primarily due to low representation of HDLs in training corpora and the difficulty LLMs have with SystemVerilog-specific constructs such as correct clocking, bus protocols, and assignment modalities. Prior testbench generation frameworks using LLMs—AutoBench, UVM2, ConfiBench—largely depend on LLMs emitting large fragments of Verilog/SystemVerilog, leading to frequent syntax errors and requiring expensive iterative self-debugging.
HAVEN Architecture
HAVEN proposes a decisive architectural shift: LLMs are never used to generate HDL/UVM code directly. Instead, the pipeline utilizes LLM agents solely for structured information extraction from design specifications and coverage gap analysis. All testbench code emission is performed through a rule-based generator that instantiates protocol-correct Jinja2 templates—parameterized and instantiated based on the structured outputs from the LLMs.
Stage 1: UVM Testbench Synthesis via Templates
- LLM-driven Blueprint Extraction: LLMs parse design specs into a precise, hierarchical JSON 'Blueprint' encoding agent topologies, interface definitions, and protocol types.
- Stimulus Strategy Inference: A rule-based system, independent from LLMs, infers stimulus generation strategies tailored to signal semantics, width, and protocol.
- Template Rendering: All testbench components (drivers, monitors, scoreboards, BFMs, coverage subscribers) are instantiated from validated Jinja2 templates. Templates embed protocol-specific knowledge—e.g., handshake sequences, non-blocking assignment use, and bounded timeout mechanisms—eliminating sources of incorrect HDL emission.
- Predefined Sequence Generation: Rule-based generation of six patterns via a Protocol-Aware Sequence DSL—providing coverage for typical behaviors (CRV, enumeration, toggling, FIFOs) without risk of syntax errors.
- Compile-Fix Loop: The generated testbench is subjected to a bounded iterative repair process. Only LLM-generated sequence or scoreboard code is ever modified; template-rendered components are immutable, guaranteeing correctness against protocol and syntax.
Stage 2: Iterative Coverage-Guided Generation
- Protocol-Aware Sequence DSL: Both rule-based strategies and LLM gap analysis emit test sequences as structured JSON with a fixed set of atomic step types (register_write, poll, randomize_send, value_sweep, etc).
- Targeted Gap Closure: Coverage reports are parsed and presented to LLMs in natural language and structured formats. LLMs generate additional DSL sequences addressing uncovered states, conditions, or FSM transitions.
- CodeGen with Safety Filters: All LLM-generated DSL is post-processed through auto-fix, validation, and enforced translation to syntactically correct, protocol-compliant UVM sequences via rule-based codegen. Sequence accumulation ensures previously achieved coverage is preserved.
- Convergence Control: The coverage closure process proceeds iteratively, terminating either by achieving convergence (coverage improvement < 0.1 pp) or reaching a preset iteration budget (typically K=3).
Experimental Results
HAVEN is evaluated across 19 open-source IP cores spanning Direct, Wishbone, and AXI4-Lite interface protocols (180–11k LOC), including the most comprehensive benchmark overlap with UVM2. The methodology requires no manual or per-design tuning.
- Compile Reliability: 100% compilation success on all designs. In ablations, LLM-generated components without templates universally failed to compile.
- Coverage: Predefined-sequence-only pipeline achieves 84.6% code and 79.8% functional coverage (average). Stage 2 iterative gap closure yields final averages of 90.6% code and 87.9% functional coverage. On UVM2-overlapping designs, HAVEN produces a coverage improvement of +3.6 pp code and +1.1 pp functional.
- Cost Efficiency: Achieves full pipeline execution at $0.37 per design (averaging six LLM calls, ~46k tokens) on GPT-5.2. Open-source LLMs (e.g., Qwen3.5-27B) are competitive but underperform by 4–8 pp on complex peripherals.
- Robustness: The structure of HAVEN is largely LLM-agnostic. While stronger open-source models provide competitive coverage, design scaling and protocol complexity can stress model context and capacity limits.
Discussion
Although HAVEN represents a significant improvement over prior LLM-driven approaches, remaining coverage gaps persist in designs requiring multi-phase protocol handling, non-linear state machine traversal, or multi-agent coordination—limitations inherent to the linear, fixed-step DSL. Run-to-run coverage variance is non-negligible due to LLM non-determinism in gap-filling, although compile success remains strictly deterministic due to the hard separation between LLM and code generation responsibilities.
Template engineering emerges as a one-time, protocol-specific effort: the marginal cost of supporting new protocols, once a single driver/monitor template is written, is low. The system's utility scales with protocol coverage. Future enhancements would require extending the DSL to express conditional branching and multi-agent behaviors, as well as LLM-aided synthesis of new protocol templates from natural language specifications.
Implications and Future Directions
HAVEN's approach validates that LLMs, when used for information extraction and design intent understanding, significantly amplify IC verification automation—as long as code emission is rigorously separated and controlled. This blueprint-codegen-template paradigm is generalizable to other code generation domains characterized by low-data, high-semantic-density programming languages.
Potential improvements include:
- Extension of the DSL to support programmatic control flow, multi-phase and concurrent protocol behaviors.
- LLM-powered natural language to template synthesis, reducing the template engineering burden for new protocols.
- Enhanced integration with static, semantics-aware analyzers to dynamically expand DSL expressiveness and coverage directed refinement.
Conclusion
HAVEN establishes a robust paradigm for scalable, LLM-assisted UVM testbench generation, achieving state-of-the-art coverage and reliability through strict modularization of information extraction and code generation. This hybrid approach obviates the primary failure mode of LLM-driven code synthesis in HDL-centric domains and offers a scalable framework for future advances in automated IC verification.