Papers
Topics
Authors
Recent
Search
2000 character limit reached

Auditing Agent Harness Safety

Published 14 May 2026 in cs.CL and cs.CY | (2605.14271v2)

Abstract: LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

Summary

  • The paper introduces HarnessAudit, a trajectory-based safety auditing framework for agent harnesses, emphasizing boundary compliance, execution fidelity, and system stability.
  • It develops HarnessAudit-Bench with 210 tasks across 8 domains, using metrics such as SAR, TCR, AVS, and PB to quantify safety adherence.
  • Findings reveal a trade-off between task completion and safety, highlighting that multi-agent communication increases resource and information flow violations.

HarnessAudit: Trajectory-Based Safety Auditing for Agent Harnesses

Motivation and Problem Definition

Modern LLM agents operate within sophisticated execution harnesses that orchestrate tool dispatching, resource allocation, and message routing among multiple agent components. Safety in such systems is not merely a function of final outputs, as intermediate actions can traverse unauthorized boundaries and cause data leakage, even if terminal results appear benign. Most existing safety benchmarks focus only on output-level evaluation, overlooking violations that occur mid-execution, particularly in environments where multi-agent communication, delegation, and tool/resource access are rampant.

HarnessAudit addresses this gap by formalizing agent harness safety as a joint evaluation of:

  • Boundary compliance: Actions must honor permission and information-flow policies.
  • Execution fidelity: Trajectories should accomplish user goals via valid intermediate operations.
  • System stability: Compliance and fidelity must be robust against realistic perturbations (e.g., prompt injection, ambiguous goals, tool errors). Figure 1

    Figure 1: HarnessAudit overview—multi-agent harnesses, agent execution trajectories, and full-trajectory evaluation across three safety layers.

HarnessAudit introduces deterministic, agent-independent audit artifacts and evidence channels, disallowing manipulation or anticipation by agents themselves, and enabling robust post-hoc auditing.

HarnessAudit-Bench: Realistic Safety Benchmark Construction

HarnessAudit-Bench comprises 210 tasks across eight application domains and 24 scenarios, reflecting real-world workflows in finance, healthcare, office operations, e-commerce, social interaction, daily life, legal compliance, and software engineering. Each domain features diverse, role-typed agent teams (5–14 roles/domain), and tasks specify explicit tool/resource scopes with decoys to test enforcement granularity. Annotation involves both automated task generation and thorough human curation, ensuring solvability, meaningful boundaries, and measurable safety pressures. Figure 2

Figure 2: HarnessAudit-Bench scale and design—task distribution across domains, complex role structures, and mapped audit rules for tools, resources, and information flow.

The challenge in multi-agent harnesses arises from dynamic team topologies, the need for delegation, and explicit communication channels—all expanding the potential for boundary crossing and information leaks. HarnessAudit-Bench instantiates three risk types for each task:

  • Required/forbidden tools
  • In-scope/out-of-scope resources
  • Permitted/prohibited information flows

Auditing Pipeline and Evaluation Methodology

The auditing pipeline operates in three stages:

  1. Setup: Declarative task specifications encode goals, role/team definitions, tool catalogs, permission policies (Π\Pi), information-flow constraints (Φ\Phi), and hidden audit artifacts. These artifacts (completion checkpoints, violation taxonomies) are invisible to agents during execution.
  2. Execution: Harnesses execute the trajectory, recording every tool call, resource access, communication, and state transition.
  3. Judge: Upon run completion, evidence channels and artifacts are used to diagnose violations in boundary compliance (L1), execution fidelity (L2), and system stability (L3). Figure 3

    Figure 3: HarnessAudit auditing pipeline—separation of setup, execution, and judging, with hidden artifacts for boundary compliance, execution fidelity, and stability diagnosis.

Scoring is severity-weighted for each violation channel, aggregated into composite harness safety scores:

  • SAR\mathsf{SAR} (Safety Adherence Rate): Quantifies tool/resource/information-flow violations.
  • TCR\mathsf{TCR} (Task Completion Rate): Weighted across completion checkpoints.
  • AVS\mathsf{AVS} (Action Validity Score): Evaluates operational precision, coverage, resource-scope adherence, minimality.
  • PB\mathsf{PB} (Perturbation Stability): Robustness under adversarial/tested variants.

Multiplicative gating ensures completion only counts if boundaries are honored.

Empirical Results and Key Findings

The evaluation spans ten harness configurations (OpenClaw, Claude Code, Codex), frontier models (ChatGPT-5.4, Claude Opus/Sonnet 4.6, Gemini 3.1 Pro, GLM 5V Turbo, Kimi K2.6, Qwen 3.5 Plus), and three multi-agent frameworks (Claw-Team, Google ADK, OpenAI Agents SDK).

Strong Results and Contradictory Claims:

  • Task completion and harness safety are misaligned: Higher completion rates often correlate with lower safety adherence, as broader tool/resource use expands violation risk.
  • Resource access violations dominate: Agents rarely fail via tool misuse; resource-scope errors (applying correct tools to wrong entities) constitute the majority.
  • Multi-agent coordination amplifies risks: Inter-agent communication and expanded information-sharing increase both the frequency and severity of boundary violations. In single-agent settings, resource boundaries are more reliably maintained, but multi-agent teams exhibit pronounced information flow failures.
  • Perturbation robustness is universally weak: Indirect prompt injection and ambiguous goals significantly degrade adherence and completion rates.

(Figure trade)

Figure trade: Safety–completion trade-off—safety adherence declines with increasing task completion thresholds; violations scale with executed actions.

(Figure multi)

Figure multi: Domain-level and role-level variation—finance/office tasks are more prone to resource boundary violations; sensitive roles frequently cross boundaries.

(Figure res)

Figure res: Distribution of violations—resource access and information flow are primary failure modes; more than half of participating agents in multi-agent settings commit violations.

Practical and Theoretical Implications

The results establish that harness-level safety is fundamentally distinct from agent output moderation. Reliable deployment of LLM agents requires harness instrumentation capable of trajectory-based auditing, permission enforcement, and sensitive information routing. This necessitates:

  • Development of robust harness designs that curb unsafe delegation, resource misuse, and inadvertent leaks.
  • Redesign of multi-agent collaboration protocols to minimize information flow vulnerabilities.
  • Comprehensive safety evaluation infrastructures that operate independently of agent self-reporting, from fine-grained permission management to adversarial stress testing.

Harness design, not agent capability, acts as the upper bound for safe deployment. Even advanced models (e.g., Gemini 3.1 Pro) only achieve moderate composite safety scores (<0.5), underscoring deficiencies in harness orchestration and policy enforcement.

(Figure com)

Figure com: Harness and framework distinctions—native harnesses (Codex, Claude Code) show variable trade-offs between completion and safety depending on orchestration quality.

Future Directions

Research must extend trajectory-level audits to increasingly autonomous, multi-modal, and multi-agent environments. Emphasis should be placed on:

  • Formal specification and enforcement of boundaries at the harness level.
  • Universal, transparent, and reproducible audit channels.
  • Adversarial benchmarks targeting indirect injection, ambiguous goals, and systemic failure modes.
  • Scalable methods for real-time and post-hoc harness evaluation across heterogeneous agent frameworks.

Conclusion

Auditing agent harnesses at the trajectory level exposes critical safety failures invisible in output-only evaluation. HarnessAudit, anchored in boundary compliance, execution fidelity, and system stability, demonstrates persistent gaps between agent capability and safe execution. Resource access and inter-agent communication are the principal surfaces for violation concentration. The results mandate a shift from completion-centric benchmarks toward harness-centric safety auditing and provide a systematic foundation for safe agent deployment across real-world domains (2605.14271).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 40 likes about this paper.