HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

Published 30 Apr 2026 in cs.AR and cs.AI | (2604.27643v1)

Abstract: Integrated Circuit (IC) verification consumes nearly 70% of the IC development cycle, and recent research leverages LLMs to automatically generate testbenches and reduce verification overhead. However, LLMs have difficulty generating testbenches correctly. Unlike high-level programming languages, Hardware Description Languages (HDLs) are extremely rare in LLMs training data, leading LLMs to produce incorrect code. To overcome challenges when using LLMs to generate Universal Verification Methodology (UVM) testbenches and sequences, wepropose HAVEN (Hybrid Automated Verification ENgine) to prevent LLMs from writing HDL directly. For UVM testbench generation, HAVEN utilizes LLM agents to analyze design specifications to produce a structured architectural plan. The HAVEN Template Engine then combines with predefined and protocol-specific templates to generate all UVM components with correct bus-handshake timing. For UVM sequence generation, HAVEN introduces a Protocol-Aware Sequence Domain-Specific Language (DSL) that decomposes sequences into fine-grained step types. A set of predefined DSL patterns first establishes sequences that achieve a high coverage rate without LLM involvement. HAVEN continues to improve the coverage rate by iteratively leveraging LLM agents to analyze coverage gap reports and compose additional targeted DSL sequences. Unlike previous works, HAVEN is the first system that utilizes pre-defined, protocol-specific Jinja2 templates to generate all UVM components and UVM sequences using our proposed Protocol-Aware DSL and rule-based code generator. Our experimental results on 19 open-source IP designs spanning three interface protocols (Direct, Wishbone, AXI4-Lite) show that HAVEN achieves 100% compilation success, 90.6% code coverage, and 87.9% functional coverage on average, and is SOTA among LLM-assisted testbench generation systems.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel architecture that separates LLM-driven blueprint extraction from rule-based UVM code generation to ensure synthesis reliability.
It achieves significant coverage improvements, raising code coverage from 84.6% to 90.6% and functional coverage to 87.9%, while minimizing manual intervention.
The approach is cost-efficient ($0.37 per design) and robust across various protocols, including Direct, Wishbone, and AXI4-Lite, ensuring 100% compile reliability.

HAVEN: Hybrid Automated Verification Engine for LLM-Assisted UVM Testbench Synthesis

Motivation and Background

The verification phase in IC development remains a principal bottleneck, accounting for up to 70% of project effort due to the high labor involved in generating and maintaining protocol-correct UVM testbenches. The recent use of LLMs to automate this process has highlighted severe limitations: LLMs, even with moderate prompt engineering or fine-tuning, are consistently incapable of reliably generating syntactically and semantically valid HDL/UVM code. This is primarily due to low representation of HDLs in training corpora and the difficulty LLMs have with SystemVerilog-specific constructs such as correct clocking, bus protocols, and assignment modalities. Prior testbench generation frameworks using LLMs—AutoBench, UVM2, ConfiBench—largely depend on LLMs emitting large fragments of Verilog/SystemVerilog, leading to frequent syntax errors and requiring expensive iterative self-debugging.

HAVEN Architecture

HAVEN proposes a decisive architectural shift: LLMs are never used to generate HDL/UVM code directly. Instead, the pipeline utilizes LLM agents solely for structured information extraction from design specifications and coverage gap analysis. All testbench code emission is performed through a rule-based generator that instantiates protocol-correct Jinja2 templates—parameterized and instantiated based on the structured outputs from the LLMs.

Stage 1: UVM Testbench Synthesis via Templates

LLM-driven Blueprint Extraction: LLMs parse design specs into a precise, hierarchical JSON 'Blueprint' encoding agent topologies, interface definitions, and protocol types.
Stimulus Strategy Inference: A rule-based system, independent from LLMs, infers stimulus generation strategies tailored to signal semantics, width, and protocol.
Template Rendering: All testbench components (drivers, monitors, scoreboards, BFMs, coverage subscribers) are instantiated from validated Jinja2 templates. Templates embed protocol-specific knowledge—e.g., handshake sequences, non-blocking assignment use, and bounded timeout mechanisms—eliminating sources of incorrect HDL emission.
Predefined Sequence Generation: Rule-based generation of six patterns via a Protocol-Aware Sequence DSL—providing coverage for typical behaviors (CRV, enumeration, toggling, FIFOs) without risk of syntax errors.
Compile-Fix Loop: The generated testbench is subjected to a bounded iterative repair process. Only LLM-generated sequence or scoreboard code is ever modified; template-rendered components are immutable, guaranteeing correctness against protocol and syntax.

Stage 2: Iterative Coverage-Guided Generation

Protocol-Aware Sequence DSL: Both rule-based strategies and LLM gap analysis emit test sequences as structured JSON with a fixed set of atomic step types (register_write, poll, randomize_send, value_sweep, etc).
Targeted Gap Closure: Coverage reports are parsed and presented to LLMs in natural language and structured formats. LLMs generate additional DSL sequences addressing uncovered states, conditions, or FSM transitions.
CodeGen with Safety Filters: All LLM-generated DSL is post-processed through auto-fix, validation, and enforced translation to syntactically correct, protocol-compliant UVM sequences via rule-based codegen. Sequence accumulation ensures previously achieved coverage is preserved.
Convergence Control: The coverage closure process proceeds iteratively, terminating either by achieving convergence (coverage improvement < 0.1 pp) or reaching a preset iteration budget (typically K=3).

Experimental Results

HAVEN is evaluated across 19 open-source IP cores spanning Direct, Wishbone, and AXI4-Lite interface protocols (180–11k LOC), including the most comprehensive benchmark overlap with UVM2. The methodology requires no manual or per-design tuning.

Compile Reliability: 100% compilation success on all designs. In ablations, LLM-generated components without templates universally failed to compile.
Coverage: Predefined-sequence-only pipeline achieves 84.6% code and 79.8% functional coverage (average). Stage 2 iterative gap closure yields final averages of 90.6% code and 87.9% functional coverage. On UVM2-overlapping designs, HAVEN produces a coverage improvement of +3.6 pp code and +1.1 pp functional.
Cost Efficiency: Achieves full pipeline execution at $0.37 per design (averaging six LLM calls, ~46k tokens) on GPT-5.2. Open-source LLMs (e.g., Qwen3.5-27B) are competitive but underperform by 4–8 pp on complex peripherals.
Robustness: The structure of HAVEN is largely LLM-agnostic. While stronger open-source models provide competitive coverage, design scaling and protocol complexity can stress model context and capacity limits.

Discussion

Although HAVEN represents a significant improvement over prior LLM-driven approaches, remaining coverage gaps persist in designs requiring multi-phase protocol handling, non-linear state machine traversal, or multi-agent coordination—limitations inherent to the linear, fixed-step DSL. Run-to-run coverage variance is non-negligible due to LLM non-determinism in gap-filling, although compile success remains strictly deterministic due to the hard separation between LLM and code generation responsibilities.

Template engineering emerges as a one-time, protocol-specific effort: the marginal cost of supporting new protocols, once a single driver/monitor template is written, is low. The system's utility scales with protocol coverage. Future enhancements would require extending the DSL to express conditional branching and multi-agent behaviors, as well as LLM-aided synthesis of new protocol templates from natural language specifications.

Implications and Future Directions

HAVEN's approach validates that LLMs, when used for information extraction and design intent understanding, significantly amplify IC verification automation—as long as code emission is rigorously separated and controlled. This blueprint-codegen-template paradigm is generalizable to other code generation domains characterized by low-data, high-semantic-density programming languages.

Potential improvements include:

Extension of the DSL to support programmatic control flow, multi-phase and concurrent protocol behaviors.
LLM-powered natural language to template synthesis, reducing the template engineering burden for new protocols.
Enhanced integration with static, semantics-aware analyzers to dynamically expand DSL expressiveness and coverage directed refinement.

Conclusion

HAVEN establishes a robust paradigm for scalable, LLM-assisted UVM testbench generation, achieving state-of-the-art coverage and reliability through strict modularization of information extraction and code generation. This hybrid approach obviates the primary failure mode of LLM-driven code synthesis in HDL-centric domains and offers a scalable framework for future advances in automated IC verification.

Markdown Report Issue