ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

Published 1 May 2026 in cs.SE | (2605.00413v1)

Abstract: Ensuring the reliability of the Rust compiler is of paramount importance, given increasing adoption of Rust for critical systems development, due to its emphasis on memory and thread safety. However, generating valid test programs for the Rust compiler poses significant challenges, given Rust's complex syntax and strict requirements. With the growing popularity of LLMs, much research in software testing has explored using LLMs to generate test cases. Still, directly using LLMs to generate Rust programs often results in a large number of invalid test cases. Existing studies have indicated that test cases triggering historical compiler bugs can assist in software testing. Our investigation into Rust compiler bug issues supports this observation. Inspired by existing work and our empirical research, we introduce a bracket-based masking and filling strategy called clozeMask. The clozeMask strategy involves extracting test code from historical issue reports, identifying and masking code snippets with specific structures, and using an LLM to fill in the masked portions for synthesizing new test programs. This approach harnesses the generative capabilities of LLMs while retaining the ability to trigger Rust compiler bugs. It enables comprehensive testing of the compiler's behavior, particularly exploring edge cases. We implemented our approach as a prototype CLOZEMASTER. CLOZEMASTER has identified 27 confirmed bugs for rustc and mrustc, of which 10 have been fixed by developers. Furthermore, our experimental results indicate that CLOZEMASTER outperforms existing fuzzers in terms of code coverage and effectiveness.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel LLM-guided fuzzing framework, ClozeMaster, which infills masked Rust code using empirically mined bug-triggering patterns.
It combines historical bug report analysis, bracket masking, and fine-tuned LLM infilling to generate test cases that achieved up to 65% code coverage and found 37 new bugs.
The approach offers a continuous, automated pipeline for enhancing compiler resilience by systematically targeting under-tested, complex language constructs.

ClozeMaster: LLM-Guided Fuzzing for the Rust Compiler

Motivation and Empirical Foundations

The increasing adoption of Rust in security-critical software infrastructure amplifies the necessity for robust compiler validation, especially given Rust's intricate semantics involving ownership, lifetimes, and a strict type system. Despite advances in compiler fuzzing, random or templated program generation for Rust typically yields low validity rates due to the language’s complex syntax, and naïve LLM-based approaches inherit this defect due to the long-tailed, limited Rust corpora in their pretraining datasets.

Motivated by the empirical finding that code snippets historically triggering bugs tend to stress fragile, poorly-tested compiler paths, the authors conducted a large-scale analysis of bug reports (over 7,000 issues, with more than 6,000 extracted program snippets) in the official Rust and mrustc compilers. This study reveals that a significant fraction of bug-triggering test cases concentrate on specialized features (e.g., feature gates, unsafe blocks, extern blocks, inline assembly) and heavily exploit bracketed syntactic structures, confirming that these constructs are under-tested and disproportionately prone to exposing compiler bugs.

Figure 1: Heatmap analysis of bug reports across rustc internal components, highlighting persistent vulnerability regions over time.

Figure 2: Example Rust test code (from rustc) illustrating dense usage of feature declarations and bracketed constructs associated with prior compiler failures.

ClozeMaster System Architecture

ClozeMaster implements a new paradigm for Rust compiler fuzzing by leveraging LLMs for "cloze-style" completion over contextually rich, bug-inducing code skeletons. The system pipeline proceeds as:

Corpus Construction and Historical Mining: A dedicated data scraper harvests closed "C-bug" and "T-compiler" issues from the Rust compiler bug tracker, as well as the official test suite and the Glacier database. All candidate code snippets previously triggering bugs are collected, and seed redundancy due to identical failure modes is explicitly de-duplicated.
Data Augmentation and Model Tuning: Since open LLMs have insufficient Rust representation, the authors augment the corpora via probabilistic token deletion and statement reordering that preserve Rust’s syntactic validity. This is used to fine-tune the Incoder LLM for improved Rust-specific completion performance, particularly for code infilling tasks.
clozeMask Mutation Strategy: Rather than full-program generation or random line removal, ClozeMaster masks the contents within bracketed structures (such as {}, (), [], <>) at varying nesting depths within trigger-rich seed programs. The LLM is then prompted to infill the masked locations, thereby synthesizing new test cases that retain contextually realistic, bug-inducing scaffolds.
Figure 3: Schematic depiction of the ClozeMaster pipeline: code mining, masking, LLM-based infilling, and bug deduplication for continuous corpus enrichment.
Fuzzing, Oracles, and Bug Deduplication: Synthesized programs are batch-fuzzed against rustc and mrustc under strict timeouts. Compiler behavior is monitored for ICEs and Hang bugs; duplicate failures are filtered using call stack and hang signature matching.

This process is iterated, allowing continuous compounding of seed diversity with empirical feedback from newly discovered bugs.

Empirical Evaluation and Comparative Analysis

ClozeMaster’s effectiveness was systematically evaluated on multiple stable releases of rustc and mrustc, with explicit comparisons to the strongest existing generative and mutational Rust fuzzers (RustSmith, Rustlantis, SPE).

Bug-Finding Performance: In comprehensive test campaigns, ClozeMaster identified 37 new crashing or hanging bugs, 27 of which were independently confirmed by compiler developers. Notably, 10 were rapidly fixed, and many others had persisted undetected over multiple compiler releases.
Code Coverage: ClozeMaster-achieved coverage exceeded all tested baselines (up to 65%—significantly superseding RustSmith at 33% and SPE at 62%; see Table 1), reflecting its superior ability to stress diverse, rarely-exercised compiler paths.
Figure 4: Venn diagram of unique bugs detected by ClozeMaster versus RustSmith, Rustlantis, and SPE on rustc 1.73—demonstrating minimal overlap and high complementarity.

Ablation Experiments: Removing historical bug-triggering seeds or substituting random line masking for bracket masking caused severe performance degradation, affirming the critical contribution of both empirical bug-knowledge and bracket-structure targeting. In contrast, direct program synthesis from the LLM without seed context (i.e., zero-shot generation) resulted in an overwhelming majority of syntactically invalid, non-triggering tests.

LLM Selection: The authors conducted a cross-model evaluation on cloze-filling benchmarks (Incoder, StarCoder, CodeShell). Incoder, augmented via fine-tuning, attained the highest success in valid code infilling and test-compilation rates, whereas non-adapted models exhibited high error rates and often misidentified Rust syntax.

Complexity and Nature of Synthesized Bugs

ClozeMaster is capable of generating highly-nested, semantically intricate testcases that combine advanced features, macro recursion, and deep trait/lifetime constraints. These testcases not only increase code coverage depth but routinely stress interactions between language features poorly exercised by other fuzzers.

Figure 5: Archetypal code skeletons (feature gates, unsafe, extern, asm) that historically recurrently trigger ICEs or Hangs.

Generated testcases have been confirmed to trigger hangs in type inference, recursive macro expansion, and novel ICEs in under-tested const-generic and async/trait interactions.

Theoretical and Practical Implications

ClozeMaster exemplifies a successful instantiation of the paradigm: "historical-bug-guided LLM-infilled mutation," suited to compilers with long-tailed language constructs and insufficient training corpora. The approach closes the gap between generic LLM code generation—which lacks Rust-specific depth—and rule-based program fuzzers, which fail to capture the bug-prone combinatorial space characteristic of Rust.

Practical implications: Compiler developers gain a continuous, automated bug-mining pipeline producing both regression and novelty tests. This will improve compiler soundness and resilience as the language evolves.

Theoretical implications: The combination of structure-aware masking with empirical seed mining facilitates coverage of rare syntactic and semantic code regions. This may serve as a template for similar fuzzing in other emerging languages lacking differential testing or mature toolchains.

Figure 6: Longitudinal distribution of ClozeMaster-discovered bugs across multiple stable Rust compiler versions, showing persistent, undetected failures over several years.

Conclusion

ClozeMaster demonstrates that harnessing LLMs for cloze-style completion over empirically-selected bug-triggering programs, guided by syntax-driven masking and targeted fine-tuning, provides state-of-the-art effectiveness in Rust compiler fuzzing. The critical innovation lies in leveraging contextual code knowledge captured via mining real bug reports and deploying LLM infilling not as indiscriminate generators but as domain-adaptive, context-guided mutators. The approach is extensible and provides a solid foundation for next-generation compiler fuzzing frameworks for new or complex languages (2605.00413).

Markdown Report Issue