Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

Published 7 Apr 2026 in cs.CR, cs.AI, and cs.SE | (2604.05719v1)

Abstract: The rapid advancement of LLMs has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.

Abstract PDF Upgrade to Chat

Authors (20)

First 10 authors:

Summary

The paper presents a comprehensive systematization and empirical evaluation of LLM-based AutoPT frameworks across six architectural dimensions.
It highlights significant findings, including the superior performance of single-agent designs and the negative impact of integrating uncurated external knowledge.
Benchmarking over 15 frameworks and 22 challenges underscores the need for adaptive memory management and targeted tool integration to enhance penetration testing efficacy.

Comprehensive Systematization and Empirical Benchmarking of LLM-Based Automated Penetration Testing

Introduction and Motivation

This paper presents an authoritative, systematic analysis and large-scale empirical benchmarking of frameworks leveraging LLMs for Automated Penetration Testing (AutoPT) (2604.05719). The principal motivation is the rapidly increasing proliferation of LLM-driven AutoPT solutions, spurred by the improvements in LLM reasoning, planning, and tool-use capabilities, and the acute need for scalable, cost-effective, and continuous PT in the face of global security talent shortages and shifting enterprise risk profiles. Despite extensive new research and industrial interest, the field has remained mired in unsystematic architectural development and severely limited by the absence of unified, large-scale benchmarks that enable rigorous, reproducible comparison across frameworks.

Multi-Dimensional Systematization of AutoPT Frameworks

The authors propose a comprehensive systematization framework for LLM-based AutoPT, structured along six core architectural and evaluative dimensions: Agent Architecture, Agent Plan, Agent Memory, Agent Execution, External Knowledge, and Benchmarks.

Figure 1: The systematization framework of AutoPT, mapping traditional PT lifecycle stages to six structured design and analysis dimensions.

Agent Architecture

Distinguishing between single-agent and multi-agent designs, the systematization elucidates the complexity of agent role definitions, collaboration patterns, and division of labor in PT workflows. The analysis highlights not only the functional sub-modules (planning, execution, summarization, reconnaissance, retrieval, orchestration, feedback) but also the pitfalls of conventional multi-agent paradigms—such as role ambiguity, memory fragmentation, communication overhead, and increased synchronization complexity—contrasted with the simplicity and, sometimes, surprising efficacy of robust single-agent ReAct-based architectures in strongly coupled CTF-like scenarios.

Agent Plan

A detailed taxonomy of planning structures is provided, distinguishing between linear (pipeline/FSM), tree (task/attack trees), and graph (dependency or causality structures) models.

Figure 2: Taxonomy differentiates planning strategies by their data structure—linear, tree, and graph—in AutoPT frameworks.

The paper notes that while linear plans are intuitive and easy to implement, they lack the adaptability needed for dynamic backtracking and multipath exploration in real-world tasks; tree and graph strategies, in contrast, enable more sophisticated dynamic task allocation, state-dependent adaptation, and efficient pruning/backtracking.

Agent Memory

The memory module is recognized as a linchpin for maintaining cross-timestep dependencies, persistent experiential context, and overcoming LLM context window limitations. The authors provide a nuanced review of compression strategies (immediate, periodic, dynamic, or hard truncation) and memory organization (in-context, external/vector, structure-bound to plan graphs/trees), emphasizing the high impact of memory management on the successful execution of long-horizon, chained-attack scenarios.

Agent Execution and Tooling

Tool use is dissected at both the execution/decision interface—centralized vs. specialized agents—and the operational level, covering general (Python/shell), security-specific, and complex/interactive (GUI, persistent session) tools.

Figure 3: Distributions of tool usage across frameworks and difficulty levels, revealing trends in tool preference and scale of execution.

Notably, the paper details the failure of expanding tool pool size as a route to improved performance and identifies the critical importance of robust execution interfaces to avoid process blocking, context explosion, and command mishandling.

External Knowledge and Retrieval-Augmented Generation

The integration of external KBs (payloads, write-ups, SSKs) is analyzed across knowledge sourcing, indexing (vector, symbolic), retrieval (dense, sparse, tool-based; see Figure 4), and response (reranking, targeted prompt injection).

Figure 4: Depiction of three retrieval paradigms: dense vector search, sparse entity keyword matching, and tool-based, LLM-autonomous method selection.

Empirical findings—discussed below—contradict prevailing assumptions regarding the efficacy of RAG: external KBs often degrade, rather than enhance, task success rates due to scenario mismatches and retrieval noise.

Benchmarking and Evaluation

The authors present a meticulous classification of benchmark types: CTF-style, single-host E2E, multi-host network, CVE-exploitation, and phase-specific, with explicit concerns about data contamination, reproducibility, and relevance to real attack chains.

Large-Scale Unified Benchmark and Experimental Results

The empirical core of the study comprises an exhaustive, unified evaluation of 13 open-source and 2 baseline frameworks across 22 curated XBOW challenges (CTF-style, broad vulnerability coverage, canary-protected from pretrain leakage), involving more than 10 billion tokens and 1,500+ execution logs over four months.

Key Empirical Findings

Superior Competitiveness of Single-Agent Architectures: Contrary to common assumptions, several single-agent frameworks (e.g., Tinyctfer, XBow-Comp, CyberStrike), built on robust ReAct or general-purpose AI coding agent backends, outperformed or matched their multi-agent counterparts on Easy and Medium challenges, even though the latter often utilize elaborate collaborative architectures. The efficiency is attributed to compact context, direct feedback loops, and the absence of communication/memory overhead typically found in multi-agent orchestration.
Negative Returns from External Knowledge: Removal of KB modules (ablation) led to substantial improvements in frameworks such as Cruiser and LuaN1ao, with increases of up to 15 points. The prevalence of misleading, irrelevantly retrieved, or insufficiently granular documents often redirected agents toward unproductive attack avenues, demonstrating that KB quality and retrieval precision are more critical than breadth or integration effort.
Lack of Monotonic Gains from Tool Pool Expansion: Tool ablation studies reveal that the expansion of the tool pool (up to 115 tools in some configurations) did not deliver performance improvements. When domain-relevant tools were missing, frameworks simply shifted to fallback mechanisms (typically Python execution), which plateaued in expressivity and coverage—especially in Hard challenges.
Resource Utilization: Single-Agent vs. Multi-Agent: While single-agent frameworks tended to execute fewer LLM calls per task, token consumption did not necessarily decrease, due to accumulating context during deep/planned operations. Well-designed multi-agent concurrency and structured memory (e.g., CTFSOLVER parallel subagents, LuaN1ao's causal graph) offset the typical communication overhead, resulting in comparable or better resource profiles.
Figure 5: Comparison of LLM call counts and token usage across frameworks, segmented by challenge complexity.
Model-Framework Adaptation and LLM-Dependent Tool Selection: Cross-evaluation with five backbone LLMs (including DeepSeek, Opus-4.6, GPT-5.2, Gemini-Pro-3.1) highlights significant non-monotonic shifts in task completion, execution behavior, tool preference, and resource usage—indicating that framework internal strategies and tool invocation patterns must be tuned to the native strengths and priors of the deployed LLM. Notably, models with superior general leaderboard ranking (e.g., GPT-5.2) did not guarantee leading performance in AutoPT tasks.
Figure 6: Variation in tool call frequencies across backbone LLMs and frameworks, highlighting the interaction between model-specific behavioral priors and agent architecture/tooling.
Prevalence of Hallucination and Premature Termination: Widespread flag hallucination, base64/hash misinterpretation, and framework-level misjudgment emerged as systematic errors, due not only to model inference but also to brittle pipeline logic.
Memory Structures as Enablers of Chained Exploitation: Explicit reasoning graphs (e.g., LuaN1ao's causal graphs) and persistent memory mechanisms substantially improved performance on multi-vulnerability and chained-attack challenges, precisely by surfacing cross-timestep findings necessary for multi-stage exploitation.

Qualitative Challenge-Specific Insights

In chained exploitation scenarios, only 16.67% of runs successfully closed multi-vuln chains, with the majority stalling in discovery or in intermediate composition stages—highlighting the necessity of high-fidelity, explicit memory and flexible plan adaptation.
For public CVE environments, only 26.67% mapped all the way from version identification to correct payload composition and exploitation; the most consistent success was found in frameworks with dynamically maintained, high-quality PoC knowledge bases (e.g., CTFSOLVER).

AI Coding Agents as Baselines

Frameworks built as thin layers over commercial AI coding agents (minimal prompt, terminal tools) delivered strong scores (72, 69), often outperforming elaborate, research-centric frameworks—demonstrating the immense value of leveraging robust, general-purpose LLMs for tool use, provided that memory management and tool integration do not inadvertently cripple model flexibility or induce context loss.

Theoretical and Practical Implications

System Design Implication: Focus must shift from agent role proliferation and toolset enumeration to adaptive memory management, fine-grained task planning, and explicit, structured feedback channels.
Knowledge Integration: External knowledge must be highly curated, scenario-aligned, and equipped with retrieval/validation logic to avoid misleading agent reasoning—contrary to the generic RAG-setting assumptions imported from broader NLP.
Tool Ecosystem Management: Scaling tool pools requires accompanying advances in context-sensitive tool recommendation, usage justification, and skill abstraction, else tool selection degenerates into inefficient trial-and-error or capability underutilization.
Model-Framework Co-design: Frameworks require explicit evaluation and adaptation to the behavioral priors and intrinsic workflows of the backbone LLM, with empirical evidence demonstrating that cross-model performance consistency cannot be assumed.
Security Considerations: Execution safeguards (least-privilege, sandboxing, audit hooks) are essential, as LLM-driven agents have high privileges and can cause significant harm through both intended and hallucinated actions.

Prospective Research Directions

Automated Log Auditing: The scale and heterogeneity of experiment logs present severe bottlenecks for scalable evaluation; domain-specific LLM-aided summarization and event extraction pipelines are needed to enable framework comparison and error attribution at scale.
Long-Term Adaptive Benchmarking: As new LLMs, frameworks, and attack surfaces emerge, the paper’s open source evaluation infrastructure serves as a foundation for continuous, reproducible benchmarking and risk discovery.

Conclusion

This work provides the first rigorous comparative analysis over a broad, up-to-date set of LLM-based AutoPT frameworks, demonstrating that several prevailing beliefs (multi-agent superiority, KB necessity, tool pool effectiveness) are unsupported or directly contradicted by empirical evidence. The necessity of adaptive memory, explicit feedback, highly scenario-aligned knowledge retrieval, and model-framework co-design are underscored. By open sourcing both benchmarks and tooling, the authors lay the groundwork for a continuously evolving, reproducible, and scientifically rigorous AutoPT research ecosystem.

Markdown Report Issue