- The paper presents a novel LLM-guided fuzzing framework, ClozeMaster, which infills masked Rust code using empirically mined bug-triggering patterns.
- It combines historical bug report analysis, bracket masking, and fine-tuned LLM infilling to generate test cases that achieved up to 65% code coverage and found 37 new bugs.
- The approach offers a continuous, automated pipeline for enhancing compiler resilience by systematically targeting under-tested, complex language constructs.
ClozeMaster: LLM-Guided Fuzzing for the Rust Compiler
Motivation and Empirical Foundations
The increasing adoption of Rust in security-critical software infrastructure amplifies the necessity for robust compiler validation, especially given Rust's intricate semantics involving ownership, lifetimes, and a strict type system. Despite advances in compiler fuzzing, random or templated program generation for Rust typically yields low validity rates due to the language’s complex syntax, and naïve LLM-based approaches inherit this defect due to the long-tailed, limited Rust corpora in their pretraining datasets.
Motivated by the empirical finding that code snippets historically triggering bugs tend to stress fragile, poorly-tested compiler paths, the authors conducted a large-scale analysis of bug reports (over 7,000 issues, with more than 6,000 extracted program snippets) in the official Rust and mrustc compilers. This study reveals that a significant fraction of bug-triggering test cases concentrate on specialized features (e.g., feature gates, unsafe blocks, extern blocks, inline assembly) and heavily exploit bracketed syntactic structures, confirming that these constructs are under-tested and disproportionately prone to exposing compiler bugs.
Figure 1: Heatmap analysis of bug reports across rustc internal components, highlighting persistent vulnerability regions over time.
Figure 2: Example Rust test code (from rustc) illustrating dense usage of feature declarations and bracketed constructs associated with prior compiler failures.
ClozeMaster System Architecture
ClozeMaster implements a new paradigm for Rust compiler fuzzing by leveraging LLMs for "cloze-style" completion over contextually rich, bug-inducing code skeletons. The system pipeline proceeds as:
- Corpus Construction and Historical Mining: A dedicated data scraper harvests closed "C-bug" and "T-compiler" issues from the Rust compiler bug tracker, as well as the official test suite and the Glacier database. All candidate code snippets previously triggering bugs are collected, and seed redundancy due to identical failure modes is explicitly de-duplicated.
- Data Augmentation and Model Tuning: Since open LLMs have insufficient Rust representation, the authors augment the corpora via probabilistic token deletion and statement reordering that preserve Rust’s syntactic validity. This is used to fine-tune the Incoder LLM for improved Rust-specific completion performance, particularly for code infilling tasks.
- clozeMask Mutation Strategy: Rather than full-program generation or random line removal, ClozeMaster masks the contents within bracketed structures (such as
{}, (), [], <>) at varying nesting depths within trigger-rich seed programs. The LLM is then prompted to infill the masked locations, thereby synthesizing new test cases that retain contextually realistic, bug-inducing scaffolds.
Figure 3: Schematic depiction of the ClozeMaster pipeline: code mining, masking, LLM-based infilling, and bug deduplication for continuous corpus enrichment.
- Fuzzing, Oracles, and Bug Deduplication: Synthesized programs are batch-fuzzed against rustc and mrustc under strict timeouts. Compiler behavior is monitored for ICEs and Hang bugs; duplicate failures are filtered using call stack and hang signature matching.
This process is iterated, allowing continuous compounding of seed diversity with empirical feedback from newly discovered bugs.
Empirical Evaluation and Comparative Analysis
ClozeMaster’s effectiveness was systematically evaluated on multiple stable releases of rustc and mrustc, with explicit comparisons to the strongest existing generative and mutational Rust fuzzers (RustSmith, Rustlantis, SPE).
Ablation Experiments: Removing historical bug-triggering seeds or substituting random line masking for bracket masking caused severe performance degradation, affirming the critical contribution of both empirical bug-knowledge and bracket-structure targeting. In contrast, direct program synthesis from the LLM without seed context (i.e., zero-shot generation) resulted in an overwhelming majority of syntactically invalid, non-triggering tests.
LLM Selection: The authors conducted a cross-model evaluation on cloze-filling benchmarks (Incoder, StarCoder, CodeShell). Incoder, augmented via fine-tuning, attained the highest success in valid code infilling and test-compilation rates, whereas non-adapted models exhibited high error rates and often misidentified Rust syntax.
Complexity and Nature of Synthesized Bugs
ClozeMaster is capable of generating highly-nested, semantically intricate testcases that combine advanced features, macro recursion, and deep trait/lifetime constraints. These testcases not only increase code coverage depth but routinely stress interactions between language features poorly exercised by other fuzzers.
Figure 5: Archetypal code skeletons (feature gates, unsafe, extern, asm) that historically recurrently trigger ICEs or Hangs.
Generated testcases have been confirmed to trigger hangs in type inference, recursive macro expansion, and novel ICEs in under-tested const-generic and async/trait interactions.
Theoretical and Practical Implications
ClozeMaster exemplifies a successful instantiation of the paradigm: "historical-bug-guided LLM-infilled mutation," suited to compilers with long-tailed language constructs and insufficient training corpora. The approach closes the gap between generic LLM code generation—which lacks Rust-specific depth—and rule-based program fuzzers, which fail to capture the bug-prone combinatorial space characteristic of Rust.
Practical implications: Compiler developers gain a continuous, automated bug-mining pipeline producing both regression and novelty tests. This will improve compiler soundness and resilience as the language evolves.
Theoretical implications: The combination of structure-aware masking with empirical seed mining facilitates coverage of rare syntactic and semantic code regions. This may serve as a template for similar fuzzing in other emerging languages lacking differential testing or mature toolchains.
Figure 6: Longitudinal distribution of ClozeMaster-discovered bugs across multiple stable Rust compiler versions, showing persistent, undetected failures over several years.
Conclusion
ClozeMaster demonstrates that harnessing LLMs for cloze-style completion over empirically-selected bug-triggering programs, guided by syntax-driven masking and targeted fine-tuning, provides state-of-the-art effectiveness in Rust compiler fuzzing. The critical innovation lies in leveraging contextual code knowledge captured via mining real bug reports and deploying LLM infilling not as indiscriminate generators but as domain-adaptive, context-guided mutators. The approach is extensible and provides a solid foundation for next-generation compiler fuzzing frameworks for new or complex languages (2605.00413).