Papers
Topics
Authors
Recent
Search
2000 character limit reached

AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents

Published 26 Dec 2025 in cs.SE, cs.AI, and cs.MA | (2512.22387v1)

Abstract: The rise of LLMs as coding agents promises to accelerate software development, but their impact on generated code reproducibility remains largely unexplored. This paper presents an empirical study investigating whether LLM-generated code can be executed successfully in a clean environment with only OS packages and using only the dependencies that the model specifies. We evaluate three state-of-the-art LLM coding agents (Claude Code, OpenAI Codex, and Gemini) across 300 projects generated from 100 standardized prompts in Python, JavaScript, and Java. We introduce a three-layer dependency framework (distinguishing between claimed, working, and runtime dependencies) to quantify execution reproducibility. Our results show that only 68.3% of projects execute out-of-the-box, with substantial variation across languages (Python 89.2%, Java 44.0%). We also find a 13.5 times average expansion from declared to actual runtime dependencies, revealing significant hidden dependencies.

Summary

  • The paper introduces a three-layered dependency evaluation framework that quantifies a 13.5× gap between declared and runtime dependencies in AI-generated code.
  • It systematically assesses 300 projects across Python (89.2%), JavaScript (61.9%), and Java (44%), revealing substantial language-specific challenges.
  • The study shows that only 68.3% of projects run out-of-the-box, indicating significant debugging overhead and persistent code generation errors.

Evaluation of AI-Generated Code Reproducibility

Introduction

Achieving code reproducibility is essential for the advancement of computational science, forming the basis for verification and collaborative progress. With the increasing reliance on LLMs such as Claude Code, OpenAI Codex, and Gemini, the investigation of reproducibility issues in AI-generated code has become critically important. This essay provides a detailed examination of such challenges as presented in the publication "AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents" (2512.22387).

Methodological Framework

The study systematically evaluates 300 projects generated by three leading LLMs, based on standardized prompts across Python, JavaScript, and Java, designed to test reproducibility. A novel three-layered dependency evaluation framework is introduced to quantify execution reliability. This framework distinguishes between claimed dependencies (declared by models), working dependencies (required for reproduction), and runtime dependencies (actually utilized during execution). A significant observation here is that declared dependencies often fall short, leading to a 13.5× runtime dependency gap between what models specify and what the code execution truly requires.

Language-Specific Reproducibility

The results indicate stark variability in reproducibility success rates across programming languages evaluated:

  • Python: Demonstrates the highest reproducibility at 89.2%, attributable to its simpler dependency structure and robust error messaging.
  • JavaScript: Exhibits a moderate reproducibility rate of 61.9%, hindered by complex nested dependencies.
  • Java: Fares the worst at 44.0%, reflecting challenges stemming from its intricate dependency configuration and transitive dependency management. Figure 1

    Figure 1: Language-specific success rates reveal ecosystem complexity impacts.

Agent Capabilities and Specialization

A detailed analysis of execution success by agent and language underlines specialized competencies among the LLMs:

  • Claude excels in Java with an 80% success rate, highlighting its ability to handle more complex enterprise environments.
  • Gemini achieves flawless Python reproducibility, indicating a potential optimization for data-science contexts.
  • Codex shows a consistent preference for Python over Java, suggesting a bias towards languages with simpler dependency structures.

These specializations, as visualized in Figure 2, suggest that different training focuses result in unique strengths and limitations among the agents. Figure 2

Figure 2: Success rate heatmap reveals agent specializations. Claude excels at Java (80\%), Gemini achieves perfect Python (100\%), while all agents struggle with Java except Claude.

Dependency Complexity and Error Analysis

The study finds that only 68.3% of evaluated projects execute out-of-the-box. Among those that fail, the majority (52.6%) suffer from code generation errors rather than missing dependencies. The analysis highlights that only 10.5% of execution failures are directly attributed to unrecognized dependencies, underscoring the need for AI models to enhance their capacity to generate syntactically and logically correct code. Figure 3

Figure 3: Error type distribution by agent among failed projects. Code bugs dominate overall (50 of 95), with Codex showing the highest count (24). Not Processed errors appear only in Codex and Gemini (8 each), while Dependency errors are most prevalent in Claude (7).

Implications and Future Directions

The implications of these findings broaden the scope of the reproducibility crisis currently affecting AI-generated code. Developers must contend with substantial debugging overhead—approximately 15 minutes per failed project—translating to significant productivity costs at scale. These challenges are further complicated by the complex web of transitive dependencies under-recognized by current LLMs.

For LLMs to evolve into reliable coding agents, improvements are needed not only in reproducing functional logic but also in providing comprehensive dependency specifications. This requires expanding training datasets to incorporate complete projects inclusive of dependency chains and testing AI-generated outputs in controlled environments to pre-empt potential reproduction failures.

Conclusion

The study "AI-Generated Code Is Not Reproducible (Yet)" highlights substantial deficiencies in the current capabilities of LLMs when tasked with generating reproducible code. Given the insufficiencies in handling dependencies and frequent code logic errors, it becomes apparent that advancements in model training and thorough evaluation are crucial for leveraging AI to its full potential in software development, ensuring that the reproducibility cornerstone of scientific progress remains unshakeable.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Glossary

  • Agent-Language Specialization Matrix: A comparative grid showing each agent’s success across different programming languages. "Agent-Language Specialization Matrix"
  • AWS EC2: Amazon’s elastic compute cloud service used to provision standardized virtual machines for experiments. "We deployed AWS EC2 instances (t2.large with 4 vCPUs and 16GB RAM running Ubuntu 22.04 LTS)"
  • Claimed Dependencies: The set of packages explicitly specified by the LLM as required for a project. "The first layer consists of Claimed Dependencies (Dc)(D_c), which the LLM explicitly tells us we need."
  • Completeness gap: The difference between declared and actually needed dependencies discovered during reproduction. "The completeness gap (Equation 6) tells us how many dependencies the LLM forgot to mention:"
  • Dependency closure: The full set of direct and indirect packages required to run code. "fail to specify the dependency closure required for reproduction."
  • Dependency scopes: Maven’s classification of dependencies by purpose (e.g., compile, runtime, test). "multiple dependency scopes (compile, runtime, test, provided)"
  • Dependency tree: A hierarchical representation of direct and transitive dependencies for a project. "And for Java, we extracted Maven's dependency tree to see how one library pulls in dozens of others"
  • devDependencies: JavaScript packages needed for development but typically not for production runtime. "complicated by the distinction between dependencies and devDependencies"
  • Executable Reliability: The probability that a project runs successfully in a clean environment using only the LLM-provided specs. "we introduce Executable Reliability: the likelihood that a project executes successfully in a clean environment using only the dependencies and instructions the AI provides."
  • HumanEval: A benchmark assessing code generation functional correctness. "Current benchmarks like HumanEval and MBPP evaluate functional correctness assuming reproducible environments exist."
  • Iterative Dependency Resolution: A stepwise procedure to identify and install missing packages until execution succeeds. "Iterative Dependency Resolution"
  • Iterative Resolution Protocol: The evaluation workflow that emulates developer debugging after initial execution failure. "Iterative Resolution Protocol"
  • JUnit: A widely used Java testing framework. "Java projects most commonly lack test framework specifications (particularly JUnit)"
  • LiveCodeBench: A benchmark that mitigates memorization by providing complete environments. "LiveCodeBench prevents memorization but still provides complete environments"
  • Maven: Java’s build and dependency management tool. "Maven's complex transitive dependency resolution"
  • MBPP: A benchmark for evaluating program synthesis on multiple tasks. "Current benchmarks like HumanEval and MBPP evaluate functional correctness assuming reproducible environments exist."
  • npm: JavaScript’s package manager used to resolve and inspect dependency trees. "For JavaScript, we parsed npm's dependency tree to understand the full cascade of package requirements"
  • pip resolver: Python’s dependency resolution mechanism used by pip to manage package installations. "mature pip resolver with clear error messages"
  • pom.xml: Maven’s project configuration file specifying dependencies, plugins, and build settings. "complex XML configuration in pom.xml"
  • Provenance: Recorded metadata describing what occurs during execution (e.g., loaded dependencies). "SciUnit Provenance Analysis"
  • ReproZip: A tool that captures execution environments to enable reproducibility. "While tools like ReproZip and SciUnit capture execution environments for reproducibility"
  • requirements.lock: A locked dependency file documenting exact versions required for reproducible runs. "publishing requirements.lock files with exact versions of all 37+ packages actually needed"
  • requirements.txt: A Python file listing required packages (and versions) for a project. "requirements.txt listing dependencies."
  • Runtime Dependencies: All packages actually loaded during execution, including transitive ones. "The third and deepest layer exposes Runtime Dependencies (Dr)(D_r) - everything that actually gets loaded when the code runs, including all transitive dependencies."
  • Runtime multiplier: The ratio of runtime dependencies to claimed dependencies, quantifying hidden complexity. "the runtime multiplier (Equation 7) reveals the hidden complexity beneath the surface:"
  • SciUnit: A Python tool that captures imports and runtime package usage for provenance. "For Python, we used SciUnit, which hooks into Python's import system to capture every package that gets loaded (Equation 11):"
  • SOTA agents: State-of-the-art LLM-based coding systems compared in the study. "To ensure a fair comparison across SOTA agents, we created a dataset of 300 projects"
  • Transitive closure: The complete set of dependencies reachable through repeated transitive relations. "runtime dependencies after executing (the complete transitive closure)."
  • Transitive dependencies: Packages indirectly required by direct dependencies and loaded at runtime. "including all transitive dependencies."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.