Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Published 27 Apr 2026 in cs.SE and cs.AI | (2604.23940v1)

Abstract: Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of $0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that constraint-guided iterative refinement with specialized LLM agents nearly doubles the re-executability of decompiled binaries.
It employs a three-level validation approach—syntax, compilation, and execution—to systematically repair errors and ensure behaviorally correct C code.
The framework outperforms existing decompilers by preserving vulnerability patterns and enhancing code readiness for patching and security analysis.

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Problem Statement and Motivation

The challenge of decompiling binary executables to high-level, re-compilable, and behaviorally correct source code is central to reverse engineering, program analysis, and maintenance. Existing decompilers—including rule-based (Ghidra), lifting-based (Angr), and ML-driven (RetDec)—are proficient at recovering readable code but frequently yield output riddled with syntax, compilation, and semantic errors that prevent re-compilation and re-execution. This limitation impedes downstream tasks, such as patching, binary rewriting, and vulnerability analysis.

LLMs offer promise in code generation and translation but traditional approaches operate in a single pass, lacking iterative feedback mechanisms and precise validation signals. The inherent lossiness of compilation, coupled with diverse binary patterns, motivates a robust, multi-level, agent-driven refinement strategy.

Agent4Decompile System: Architecture and Workflow

Agent4Decompile is a multi-agent framework designed to transform initial decompiler output into re-executable C code using constraint-guided iterative refinement. The system employs a hierarchy of three constraints:

L1 (Syntax): Parses C code for syntactic correctness.
L2 (Compilation): Validates type correctness and proper symbol resolution.
L3 (Execution): Ensures behavioral equivalence via test-driven comparison with original binary output.

Failures at any level trigger specialized LLM agents, each receiving targeted error feedback and tasked with repairing the code. The iterative process continues until all constraints are satisfied, yielding behaviorally correct, compilable C code.

Figure 1: Agent4Decompile overview: code passes through syntax, compilation, and execution constraints, with LLM-guided repair at each stage.

Constraint Hierarchy and Repair Agents

The formalization employs syntactic domain restrictions and operators for parsing, compiling, and executing code. Each constraint level defines a filtered feasible subspace of all possible C programs, creating a progressive reduction in search complexity and error scope. Specialized LLM agents for each constraint receive diagnostic feedback (parser/compiler errors or failing test cases), use stateless prompts, and produce fully corrected code without explanations or intermediate steps.

The syntax agent uses parser errors for localized repairs; the compilation agent resolves type and symbol issues using compiler diagnostics; the execution agent leverages structured I/O discrepancies and a diagnostic checklist to remedy behavioral faults.

The refinement loop attempts to move code toward higher constraint satisfaction in each iteration. Regression between levels is possible, but empirical results demonstrate rapid convergence—most binaries are resolved within two iterations, with subsequent iterations yielding minimal improvements.

Figure 2: Iterative refinement example: execution feedback enables progressive correction, culminating in behaviorally accurate code.

Figure 3: Convergence analysis: re-executability gains saturate after 2–4 iterations across decompilers.

Empirical Results and Ablations

Agent4Decompile was evaluated on the ExeBench benchmark (1,641 binaries), spanning varying optimization levels and function categories, with tests across Ghidra, Angr, and RetDec. The framework achieves 40–46% re-executability—an 18–28 percentage point improvement over baseline decompiler outputs. Notably, compilation-only approaches reach near-perfect compilation rates (99–100%) but only 32–42% re-executability, exposing a 57–68 percentage point semantic gap. Execution-based validation proves essential: it adds 8–13 points of improvement and mitigates undetected semantic errors.

The system also outperforms prior refinement baselines: single-pass LLM refinement (35.2%), fine-tuned LLM4Decompile (12.1%), and two-phase readability-centric pipelines (21.0%). The choice of behavioral equivalence as an optimization objective is decisive—readability-focused transformations yield negligible benefits for functional correctness.

Failure Mode Analysis and Security Implications

Analysis of residual failures indicates that function inlining (39%), logic errors (31%), signature mismatches (26%), and syntax artifacts (4%) dominate. These are rooted in irreversible information loss, optimization artifacts, or insufficient semantic recovery.

Figure 4: Failure root cause breakdown: function inlining and logic errors are chief sources of residual failures.

Agent4Decompile preserves vulnerability patterns, even in stripped binaries, and enables practical security tasks: 100% pattern preservation (43% immediate exploitability) and a 6.6× increase in fuzz-readiness. The framework bridges the gap between binary analysis and source-level tools by producing analyzable, modifiable outputs compatible with modern security pipelines.

Practical Implications, Theoretical Insights, and Future Directions

Agent4Decompile demonstrates that re-executable decompilation is tractable via constraint-guided, multi-agent LLM workflows. The layered validation architecture exposes critical gaps missed by compilation-centric approaches, and execution feedback is indispensable for semantic recovery. The system is robust to compiler transformations and architectural variation, converges efficiently, and scales to large binary corpora.

There are limitations. The current focus is on single-function binaries; extension to whole-program analysis will require inter-procedural modeling and richer semantic or formal specifications. Some failures are irrecoverable due to compiler-level optimizations. The LLM repair agents occasionally overfit to test cases, suggesting that richer verification signals and hybrid static/dynamic analysis may offer further improvements.

The framework generates valuable training data for LLM fine-tuning, suggesting a synergistic cycle between iterative refinement and model improvement. Open-source model integration and broader architectural compatibility remain priorities. The research establishes a new practical upper bound for automated, agent-driven decompilation.

Conclusion

Agent4Decompile offers a significant advance in constraint-guided decompilation: multi-agent, feedback-driven LLMs nearly double the re-executability of code recovered from binaries. Compilation alone is an insufficient metric; execution validation is imperative for functional correctness. The approach demonstrates strong scalability, rapid convergence, and utility for downstream analysis, patching, and vulnerability discovery. Execution-guided iterative refinement should become an integral component of future decompilation pipelines, with implications for binary analysis, legacy maintenance, and secure software engineering.

(2604.23940)

Markdown Report Issue