Sound Agentic Science Requires Adversarial Experiments

Published 23 Apr 2026 in cs.AI | (2604.22080v1)

Abstract: LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that adversarial experiments expose and mitigate the verification gap caused by agentic data analysis.
The authors use NHANES data to show how minimal specification shifts can produce conflicting yet statistically credible results.
The work advocates a falsification-first standard, aligning agent outputs with rigorous experimental validation for enhanced scientific reliability.

Agentic Science and the Necessity of Adversarial Experimentation

Context and Motivation

The paper "Sound Agentic Science Requires Adversarial Experiments" (2604.22080) addresses a critical challenge introduced by the proliferation of LLM-based agents in the context of scientific research, particularly in data analysis. The acceleration of hypothesis generation and testing via agents fundamentally alters the epistemic landscape, amplifying traditional failures modes associated with observational science. Whereas coding agents in software achieve verification through targeted specification testing, scientific agents deployed for data analysis operate in a setting where iteration does not necessarily contract the hypothesis space, but rather expands it via increased analytic flexibility.

The Verification Gap: Science vs. Software

The authors draw a stark distinction between verification in software engineering and in empirical science. In software, agentic code generation is immediately grounded via a suite of tests, specification checks, and user feedback loops, contracting the hypothesis space at every iteration. In contrast, empirical science’s true verifier is nature, accessible only through properly designed experiments, controls, perturbations, and independent replications. LLM-driven data analysis agents lack direct access to new experiments, and their analytic iterations merely generate plausible narratives supported by existing data, exploiting degrees of freedom in modeling choices.

The paper formalizes the “verification gap,” warning that agentic acceleration of analysis and hypothesis generation can overwhelm the scientific community’s capacity for verification, further diluting the already weak signal in scientific publishing—particularly acute for biomedical and biological domains.

Adversarial Agentic Analysis: Demonstration and Risk

A toy experiment using NHANES data illustrates the triviality of producing conflicting but statistically plausible claims on identical datasets using agents prompted for opposing outcomes. Agent A demonstrates a statistically significant association between serum vitamin D and decreased depression symptoms (PHQ-9: $-0.045$ points per $10\,\mathrm{nmol/L}$ increase in vitamin D, $p = 0.0006$ ). Agent B, using minimal adjustment and specification shifts, finds no evidence of association (PHQ-9: $+0.0005$ points per $1\,\mathrm{nmol/L}$ , $p=0.855$ ), with both analyses adhering to defensible epidemiological standards. This experiment underscores that agentic flexibility systematically expands the set of publishable claims, creating an environment optimized for positive results and publishable narratives, not truth.

Scientific Foundations: Falsification as the Gold Standard

Anchoring its argument in classical epistemology, the paper invokes Popper’s falsifiability criterion and Fisher’s principles of experimental design, emphasizing the centrality of adversarial experimentation for scientific inference. Neither compelling narratives nor statistically significant results on a single dataset suffice for verification. Pearl’s causal analysis is cited to reinforce that claims about mechanisms require interventions, not mere observational correlations.

A key claim articulated is:

Without experimental evaluation and confirmation, agent outputs in empirical sciences should be treated as hypotheses rather than publishable conclusions.

The urgency of this distinction grows with agentic capacity, as the rate of plausible but potentially spurious analyses outpaces the rate of verification.

Agentic Falsification and Review Standards

To mitigate these risks, the authors propose a falsification-first standard: agentic outputs should be used not to construct convincing narratives and publishable positives, but to map the “negative space” around hypotheses—actively searching for failure and refutation. The agent that generates analysis should serve doubly as an adversarial critic, proposing alternative explanations, targeted checks, and refutation attempts for every claim.

Recent frameworks, such as POPPER [Huang et al., 2025], demonstrate the feasibility of agentic sequential falsification within static data, achieving validation performance comparable to domain experts. However, even such systems currently fall short in executing physical experiments, leaving the verification gap unresolved. The authors advocate for extending agentic logic not only to reanalysis but to the design and execution of discriminating experiments, especially as automated laboratories become increasingly viable [Smith et al., 2025].

Furthermore, the paper recommends that publishers enforce a falsification-based peer review where submissions include runnable analytic packages and agents are used to adversarially probe the claims.

Implications and Future Directions

The adoption of a falsification-first agentic workflow carries significant implications:

Practical: Short-term mitigation is feasible by incorporating adversarial checks into agentic analyses, shifting review standards from narrative plausibility to adversarial robustness. This approach becomes more scalable and cost-effective as agentic automation lowers the marginal cost for both positive and negative analysis.
Theoretical: The paper asserts that agentic augmentation fundamentally alters the dynamics of scientific discovery, raising the possibility of end-to-end automated workflows in empirical science that integrate experiment design, execution, and inference updates, ultimately closing the verification gap.

In the longer term, the authors speculate that increased automation will support agents in running actual experiments, updating hypotheses based on experimental outcomes, and iterating until robust claims are attained.

Conclusion

"Sound Agentic Science Requires Adversarial Experiments" delivers a formal warning regarding the epistemic risks of rapid agentic analysis in empirical science. By exposing the verification gap and proposing a falsification-first standard, the authors argue for the rebalancing of agentic workflows toward adversarial trial and robust validation. The central thesis is that, for agentic science to remain sound and informative—particularly in domains where experimental access is nontrivial—agents must not merely accelerate discovery but must be actively applied to falsification and adversarial scrutiny. As agentic capability grows, so too must the pressure to break weak claims, with publishers and the broader scientific process adapting to uphold this standard.