- The paper introduces a two-stage LLM-based static verification workflow that systematically maps natural language requirements to code implementations.
- The methodology, featuring separate rule mining and code auditing agents, effectively identifies ambiguities and business-logic errors without relying on dynamic testing.
- Empirical results from an industrial case study demonstrate improved detection of requirement issues and reduced dependence on runtime tests in high-assurance domains.
LLM-Based Static Verification of Code Against Natural-Language Requirements
The paper "LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report" (2605.17926) addresses a core gap in the application of LLMs for software engineering. While LLMs are extensively utilized for code generation, requirements drafting, and test case construction, their systematic application to the verification of whether implemented code satisfies natural-language (NL) requirements remains underexplored. This is a critical issue in domains—particularly industrial cybersecurity—where program correctness hinges on business logic rather than syntactic or low-level semantic features.
Traditional static analysis tools like Coverity and SonarQube can identify memory-safety issues and certain vulnerability patterns but are fundamentally limited: they do not verify alignment between implemented code and intended business logic specified in NL. Manual testing and dynamic analysis partially address this mapping but are bounded by test coverage, oracle availability, runtime dependencies, and cost. Many business-logic flaws, especially those dependent on rare triggers or subtle requirement misinterpretations, escape this net.
Two-Stage LLM-Based Verification Workflow
The core contribution is an explicit two-stage workflow utilizing LLMs, instantiated as AI agents ("ruleMiner" and "codeAuditor"), to statically verify code against NL requirements. This design is intentionally modular, separating requirement analysis from code auditing.
- Stage 1: Requirement Rule Mining via LLMs
The agent ruleMiner parses requirements text, extracting only logic that is both normative and falsifiable, ignoring explanatory or illustrative content. It systematically detects and records ambiguity, contradiction, and non-verifiable statements, populating a dedicated requirements_specs_issues structure. This step enforces that only enforceable, clean rules influence subsequent verification, avoiding the propagation of requirement defects downstream.
To enhance output completeness and consistency—despite empirically observed nondeterminism in LLM outputs even at temperature zero—a pooling and merging strategy is employed. Multiple runs are consolidated based on the metamorphic relation MR1, mitigating missed rules due to stochastic LLM behavior.
- Stage 2: Code Auditing Against Cleaned Rules
The codeAuditor agent evaluates each sanitized rule against source code, seeking concrete evidence for compliance or violation. The process is semantics-aware: it leverages semantic triggers, operational constraints, and state conditions, eschewing superficial identifier or keyword matching. Evidence is sought across code, configs, value checks, and inter-file relationships. This approach enables discovery of semantic misalignments even in the absence of syntactic issues or conventional vulnerabilities.
The explicit separation of requirement mining from implementation auditing directly addresses several unsolved LLM issues: hallucination, output variability, context loss, and limited explainability.
Industrial Study and Empirical Results
A detailed case study focuses on an intelligent-vehicle WiFi security subsystem. Noteworthy findings include:
- Extraction of 75 requirement issues (ambiguity/contradiction/non-verifiability) from 222 analyzed requirements, underlining the pervasive quality problems in industrial NL requirements texts.
- In WiFi security policies, ruleMiner successfully flagged a requirement with an internal contradiction that had gone undetected for over a year.
- For requirements with undefined triggers (e.g., "not in use"), ruleMiner assigned low-confidence status, while codeAuditor discerned mismatches in trigger definition at the implementation level, surfacing requirement defects rather than simply code bugs.
- The workflow statically verified over 50% of requirements that had previously required dynamic testing, a significant workload shift.
- Code analysis surfaced 8 Fail/Unknown findings, with zero false positives among manual inspections and one high-priority business-logic error in mature code, subsequently fixed.
Importantly, the few ambiguous findings often stemmed from insufficient requirement-implementation traceability rather than technical limits of the methodology.
Methodological Insights and Implications
Several substantive lessons and implications emerge:
- Requirement Quality Modeling: The explicit recording of requirement ambiguities and contradictions (rather than naive normalization) is essential for trustworthy automation. This structural hygiene prevents the propagation of false security and aligns with best practices in requirements engineering.
- Semantics-Driven, Requirement-Aware Auditing: Superficial code scans miss business-logic flaws. Reasoning about triggers, operational states, and behavior enables the identification of complex mismatches inherent to cybersecurity and high-assurance domains. This further integrates LLM-based approaches with the semantics-aware static analysis literature.
- Augmenting the Oracle Problem: LLM-based static analysis does not require explicit oracles or test outputs; instead, it operates at the level of semantics correspondence. This partially addresses the well-known oracle problem in software testing.
- End-to-End Traceability: The efficacy of the method is fundamentally dependent on high-quality requirement and code traceability. Defects in alignment, change-history, or inter-departmental mappings limit the resolvability of findings, which has substantial implications for requirements management and industrial process integration.
Future Directions
Potential extensions include the deepening of metamorphic testing strategies at both rule mining and code auditing stages, such as paraphrase injection, addition/deletion of requirement clauses, and automated robustness checks. Improvements in requirement-to-code mapping, multi-agent orchestration, and integration with industrial ALM tools will be essential for further adoption. A critical direction is scaling completeness verification and investigating recall on larger, more varied requirements sets.
Conclusion
This paper makes a significant technical contribution by operationalizing a two-stage, LLM-based, static verification workflow that bridges the gap between NL requirements and code implementations in an industrial context. By structurally separating enforceable rules from ambiguous requirement text, utilizing LLM-powered semantics-driven auditing, and applying metamorphic relations to control output variability, the approach surfaces previously undetectable business-logic errors systematically and without code execution.
The methodology demonstrably supports shift-left assurance activity, reduces dependence on runtime testing for certain defect classes, and enhances coverage of business logic in domains where test oracles and exhaustive testing are infeasible. The results also illustrate the critical—not auxiliary—role of high-quality traceability and requirement hygiene in enabling AI-assisted verification pipelines.
The work paves the way for future research in robust, explainable, and scalable LLM-based requirement analysis and semantics-driven code audit, with immediate practical relevance for high-assurance software engineering, especially in safety- and security-critical domains.