An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

Published 9 Apr 2026 in cs.CL and cs.SE | (2604.07755v2)

Abstract: Despite extensive research, LLMs continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses. One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper quantifies static analysis methods for detecting LLM-induced library hallucinations, establishing an empirical upper bound of 77% recall.
It compares grammar-based techniques, off-the-shelf analyzers, and an LLM-as-a-judge baseline, revealing distinct precision and recall rates across benchmarks.
Integrating static analysis into code generation pipelines improves execution metrics but cannot fully eliminate non-existent library features.

Empirical Assessment of Static Analysis Methods for Library Hallucination in Code Generation

Introduction

LLMs exhibit persistent deficiencies in faithfully generating code, often introducing hallucinated library features—functionality non-existent in target APIs or libraries but fabricated by model inference. Code library hallucinations impact both the reliability and security of LLM-facilitated software engineering. "An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations" (2604.07755) systematically investigates the effectiveness of static analysis techniques for detecting and mitigating such hallucinations, quantifying the achievable upper bounds of these approaches across diverse models and benchmarks.

Problem Formulation and Methodology

The paper operationalizes library hallucinations as deviations from ground-truth library, API, or user-specified context. These errors can manifest as syntactically valid code invoking non-existent functions, misusing argument types, or importing phantom modules. The authors focus on static, post hoc, and in-generation detection, which contrasts with prior research that predominantly tackles hallucination reduction during model training or via retrieval-augmentation (Eghbali et al., 2024, Liu et al., 2024, Agarwal et al., 2024, Tian et al., 2024).

Three approaches are compared:

Grammar-based Analysis: Construction of a GBNF grammar from library docstrings and common aliases enables detection of non-conforming code either post-generation or during constrained decoding.
Off-the-shelf Static Analysis: Tools such as Mypy and Pyright, with auto-generated type stubs, are leveraged to identify inconsistencies in code against library contracts.
LLM-as-a-judge Baseline: A compact LLM (o3-mini) is prompted to predict code executability without code execution, representing a heuristic alternative.

Detection and mitigation are evaluated across leading LLMs (Claude-3, GPT-4, GPT-3.5, IBM-Granite) and three natural-language-to-code benchmarks (DS-1000, Odex, BigCodeBench) with explicit library usage requirements.

Results

Library Hallucination Prevalence

Empirical evaluation reveals that LLMs generate code containing non-existent library features in 8.1–40% of outputs, underscoring the extent of unresolved hallucination issues in current SOTA models.

Detection Capabilities

Static Analysis (Mypy/Pyright): Achieves recall in the range of 16–70% for general errors and 14–85% for library hallucinations, with performance contingent on dataset and code context complexity.
Grammar-based Method: Detects up to 15% of library hallucinations; effectiveness is limited by the expressiveness of the extracted grammar and lack of scope/data-flow analysis.
LLM-as-a-judge Baseline: While yielding high precision, recall is consistently lower and less reliable than off-the-shelf analyzers.

Manual error annotation establishes an upper bound of 48.5–77% on the fraction of hallucinations theoretically capturable by static analysis, demonstrating the inherent limitation of static, non-execution-based approaches for semantic error types.

Mitigation Effects

Error repair strategies leveraging static analysis (prompting LLMs to correct flagged errors) improve both execution (up to ~85.3% pass rates in DS-1000) and correctness metrics, with marginal improvements observed versus pure self-repair. However, not all hallucinations are recoverable via these toolchains. Constrained decoding, powered by extracted grammars, reduces the emission of imaginary library features (RIF metric), particularly in data science-centric benchmarks (DS-1000), but does not universally enhance functional correctness due to grammar incompleteness and noise in docstrings. Sampling cost and decoding efficiency are impacted, increasing latency under grammar constraints.

Analytical Insights

Blind spots for static analysers are systematically annotated. Notably, ambiguous prompt specifications and test case logic that require dynamic context undermine static detection. Areas such as control/data flow within DataFrame pipelines, function-local scoping, or implicit library aliasing evade complete static validation. Benchmark design weakness—insufficiently specified prompts or test cases—further confounds evaluation, highlighting that true progress in hallucination mitigation necessitates improved benchmarks as well as detection techniques.

Theoretical and Practical Implications

The work robustly demonstrates that static analysis, while resource-efficient and able to detect a substantial subset of hallucinations, fundamentally cannot solve the hallucination problem in its entirety. This stems from two causes: (1) information-theoretic identification of dynamically induced or semantically ambiguous errors is infeasible for static analyzers; (2) library documentation (docstrings) is insufficiently expressive to support high-coverage grammar extraction for non-trivial APIs.

Constrained decoding, long hypothesized as a means of eliminating invalid code at generation time (Fu et al., 2024, Eghbali et al., 2024, Agarwal et al., 2024), proves to offer at best partial mitigation due to incomplete semantic knowledge in grammars. Even the best conventional static tools (such as Pyright and Mypy), though effective for type and syntax errors, lack the capacity to bridge mismatches between runtime behavior and high-level intent.

For LLM-based programming, integration of static analyzers into post-processing pipelines can reduce developer burden by flagging up to three-quarters of hallucination cases. To fully address the hallucination problem, complementary approaches—augmented retrieval (RAG), program analysis via partial execution or symbolic reasoning (Chen et al., 8 May 2025), more expressive benchmark/task design, and possibly deeper integration of static constraints into model pre-training—are required.

Future Directions

Improved Benchmarks: The lack of sufficient NL-to-code benchmarks with robust library coverage and explicit context is a recurring limitation for evaluating hallucination mitigation (2604.07755, Tian et al., 2024, Liu et al., 2024). Future benchmark design must prioritize test suite diversity and minimize prompt ambiguity.
Enhanced Grammar Extraction: Mining higher fidelity semantic constraints from library documentation, possibly via formal specification or API introspection, may boost grammar-constrained detection rates.
Hybrid Analysis: Fusion of static and dynamic/symbolic execution methods may overcome the current static analysis detection cap, especially for data-flow and higher-order logic errors.
Model-Side Solutions: LLMs with integrated awareness of external tool outputs or memory/state tracking could further reduce hallucination rates, as suggested in recent program reasoning research (Liu et al., 2024, Chen et al., 8 May 2025).
Generalization Beyond Python: While the present study is Python-centric, the findings are likely extensible to other dynamically typed languages, provided suitable benchmarks and toolchains are developed.

Conclusion

"An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations" (2604.07755) delivers a comprehensive quantification of how static analysis can and cannot address the hallucination problem in LLM-generated code. The results clarify the current limits (with an empirical upper bound of ∼77% recall on library hallucinations) and the sources of these limits. The advocated static techniques are expedient and provide clear practical value, but the persistence of undetectable/unrepairable hallucinations highlights the necessity of continued research into complementary AI reasoning, dynamic analysis, improved code benchmarks, and tighter integration of contextual and tool knowledge for trustworthy code generation.

Markdown Report Issue