CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Published 23 Apr 2026 in cs.CR and cs.SE | (2604.21917v1)

Abstract: We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 27%, confirming that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection.

Abstract PDF Upgrade to Chat

Authors (1)

Arunabh Majumdar

Summary

The paper demonstrates that nearly 90% of multi-commit Python CVEs are undetected by snapshot-based SAST tools, highlighting a critical detection gap.
The methodology curates 15 high-severity vulnerabilities through rigorous manual forensic analysis of multi-commit chains and commit history.
The findings imply a need for cross-commit stateful analysis and advanced vulnerability detection methods in modern CI/CD pipelines.

CrossCommitVuln-Bench: Multi-Commit Vulnerability Benchmarking in Python

Motivation and Problem Statement

Static Application Security Testing (SAST) is ubiquitously adopted in modern software CI/CD pipelines, typically analyzing code in single-commit snapshots or pull request diffs. The dominant assumption is that vulnerabilities are introduced atomically and thus detectable in isolation. "CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis" (2604.21917) directly challenges this paradigm by empirically demonstrating that a substantial fraction of severe real-world Python CVEs emerge across multiple benign commits. The dataset targets scenarios where individual commits seem innocuous but their cumulative effect yields an exploitable vulnerability, rendering per-commit SAST fundamentally inadequate.

Dataset Construction and Annotation

The authors curate 15 real-world, high/critical-severity Python CVEs, fulfilling stringent criteria:

Multi-commit inception: At least two distinct commits are collectively responsible.
Individual benignity: Each commit appears plausible and non-malicious in isolation.
Collective exploitability: Full vulnerability is present only when all commits are composed, confirmed via CVSS ≥ 7.0.
Open-source provenance: The complete commit history and repo are public.
Reproducibility: The vulnerable state is reconstructable.

The mining pipeline leverages the GitHub Security Advisory Database and OSV API, followed by rigorous manual forensic analysis on candidate CVEs via git blame and commit diff inspection. Negative examples—single-commit flaws and non-benign introducing diffs—are retained as calibration cases for annotation reliability.

Annotations are structured as JSON objects encoding CVE details, commit chain rationale, isolated/cumulative SAST findings, and explicit explanations for why per-commit detection fails. Qualitative roles (SOURCE, SINK, GUARD_REMOVAL) are assigned to contributing commits, supporting nuanced taxonomy and downstream ML tasks.

Baseline SAST Evaluation: Numerical Results

Two representative SAST tools (Semgrep v1.154.0 and Bandit v1.9.4) are systematically evaluated in two modes:

Per-commit (CCDR): Each contributing commit analyzed independently for relevant findings.
Cumulative (CDR): Full pre-fix codebase scanned.

Detection Rates

CCDR (per-commit): 2 out of 15 CVEs detected (13%).
- Both detections are qualitatively weak: one occurs on a commit intended as a security fix (alert suppressed), the other flags only a minor hardcoded-key issue and misses over 200 authentication vulnerabilities.
CDR (cumulative): 4 out of 15 CVEs detected (27%).
Detection gap: Even with the full codebase, 73% of chains remain invisible to snapshot-based tools.

The dataset spans six vulnerability classes over five primary CWEs (CWE-94, CWE-22, CWE-78, CWE-306, CWE-943). Custom wrappers, absent guard patterns, and temporal source-sink separation are systematically missed by SAST due to lack of historic or holistic analysis.

Failure Mode Analysis

Three primary root causes are identified:

Custom wrapper opacity: Dangerous operations are abstracted in internal helper functions (e.g., exec_cmd(), session.run()), eluding name-based pattern-matching.
Absent guard invisibility: Endpoints lacking authentication/validation decorators are undetected; SAST tools cannot fire on missing dependencies.
Temporal source-sink separation: Taint sources and sinks often separated by weeks to years across commits, outside scope of intra-snapshot analysis.

These failure modes are not remediable via incremental rule enrichment alone, but require fundamentally new approaches—cross-commit stateful analysis and dataflow tracking.

Practical and Theoretical Implications

The empirical demonstration that almost 90% of multi-commit Python vulnerabilities escape per-commit SAST calls into question the sufficiency of current CI/CD guardrails. The dataset exposes an adversarial blind spot: attackers (or inadvertent developers) can introduce vulnerabilities incrementally, sidestepping snapshot-centric detectors.

Practically, CrossCommitVuln-Bench delivers reproducible baseline metrics, a structured annotation schema, and evaluation scripts, catalyzing research in:

Commit sequence anomaly modeling: For ML/NLP-based detection systems.
Security stateful CI/CD: Integrating cross-commit analysis into pipelines.
Tool improvement and benchmarking: Robust evaluation of new vulnerability detection methods.

As a proof-of-concept, the graph-based POSTURA system detected 3/5 spike CVEs from the dataset (60%) by maintaining a persistent Neo4j threat graph across commit history, outperforming snapshot SAST baselines.

Theoretically, this benchmark motivates formalization of cross-commit vulnerability patterns and development of historical state tracking algorithms in both static and dynamic analysis, potentially leveraging graph neural networks, temporal logic, and persistent taint analysis.

Limitations and Future Directions

Single annotator: Initial dataset annotations produced by the author; inter-annotator reliability yet to be empirically quantified.
Tool coverage: Evaluation restricted to Semgrep and Bandit; deeper inter-procedural tools (e.g., CodeQL) may close some gaps.
Language scope: Benchmarked on Python. Expansion to JavaScript/TypeScript and Java is anticipated.
CVE scope: Only publicly available advisories; older vulnerabilities with fragmented histories are excluded.

Future research trajectories include extending the benchmark across ecosystems, formalizing annotation schemas for cross-commit chains, and developing scalable, stateful SAST-Informed CI/CD systems.

Conclusion

CrossCommitVuln-Bench (2604.21917) establishes, via meticulously curated empirical evidence, that snapshot-based SAST tools are fundamentally inadequate for detecting vulnerabilities introduced across multiple individually benign commits. The dataset provides a foundation for evaluation, model development, and tool improvement targeting cross-commit security analysis. Its release is expected to catalyze technical progress in persistent security state tracking, historical dataflow modeling, and advanced vulnerability detection approaches for both academia and industry.

Markdown Report Issue