General Agent Evaluation

Published 26 Feb 2026 in cs.AI | (2602.22953v1)

Abstract: The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper introduces the Unified Protocol and Exgentic framework to decouple agent-benchmark integration for fair cross-domain assessment.
The evaluation of five agent architectures across 90 setups highlights model quality as the dominant factor, with Claude Opus 4.5 achieving a 66% success rate.
The study outlines cost-efficiency tradeoffs, schema guard benefits, and tool shortlisting as key insights for enhancing general-purpose AI agent deployment.

Systematic Evaluation of General-Purpose AI Agents

Motivation and Contribution

The proliferation of domain-specific agents in AI has yielded substantial progress, yet these systems remain limited by manual customization and specialized engineering. Existing benchmarks and evaluation frameworks are designed for domain-specific integration, implicitly encoding task and environment semantics that undermine the fair assessment of general-purpose agents. This paper reframes general-agent evaluation as a principal research objective, introducing conceptual principles, the Unified Protocol—a canonical mediation layer between agents and benchmarks—and the Exgentic framework. The authors implement the first Open General Agent Leaderboard, assessing five agent architectures across six diverse environments, demonstrating general agents' ability to match or exceed domain-specialized baselines without environment-specific tuning.

Unified Protocol: Decoupling Agents and Benchmarks

The Unified Protocol constitutes a minimal abstraction layer for agent-benchmark integration. It standardizes communication via three fields—task, context, actions—allowing protocol-agnostic evaluation. Agents adapt these fields to their native APIs, enabling seamless interoperability. The protocol preserves existing communication paradigms (CLI, tool-calling, MCP) and supports common interaction patterns, such as messaging and final-answer submission. Adaptation methodologies for legacy benchmarks and agents ensure explicit agent-visible assumptions, omitting implementation-induced artifacts and redundant signals that would confound performance assessment. This universality reduces integration complexity, allowing scalable evaluation without costly pairwise agent-benchmark adaptation.

Exgentic Framework: Scalable, Reproducible Agent Evaluation

Exgentic leverages the Unified Protocol to orchestrate agent-benchmark sessions. Each session initializes an agent with standardized task, context, and available actions, mediating environment observations and agent actions until task completion or termination constraints are met. The framework supports parallelism, reproducibility, and native operation of third-party agents and benchmarks, utilizing external adaptors for synchronization and protocol translation. Benchmark, agent, and session results, including interaction trajectories and cost reports, are output in a unified format, facilitating component-level ablation and comparative analysis.

Benchmarking: Configuration Space and Metrics

The experimental setup evaluates five agent architectures (including ReAct variants, Smolagent, OpenAI Solo, Claude Code) across three frontier LLMs (GPT 5.2, Claude Opus 4.5, Gemini 3 Pro) and six benchmarks (BrowseComp+, TauBench, SWE-Bench Verified, AppWorld). The agent configurations and task instances yield 90 distinct setups, with rigorous adaptation of each benchmark to the Unified Protocol, ensuring reproducibility and fair comparison. Key metrics: success rate, average inference cost per task, and average steps are applied consistently.

Empirical Findings

Model Dominance and Variance Decomposition

The success rates reveal strong model effects: Claude Opus 4.5 leads (mean 0.66), Gemini 3 follows (mean 0.60), with GPT 5.2 substantially lower (mean 0.40). Variance decomposition quantifies model choice as accounting for 28.2% of success rate variance; agent architecture explains only 0.6%. Tool-rich environments expose GPT 5.2 limitations due to native tool count constraints. Agent performance is highly model-dependent—OpenAI Solo excels with Claude Opus 4.5 but underperforms with GPT 5.2.

Agent Architecture and Component-Level Insights

ReAct Short, OpenAI Solo, ReAct, Claude Code, and Smolagent perform comparably (mean success ~0.53-0.57), with statistically insignificant inter-agent differences when averaged across models and benchmarks. Schema guards (invalid action detection and self-correction) are present in top-performing architectures, underscoring their utility. Tool shortlisting is crucial for tool-rich benchmarks, markedly improving GPT 5.2 performance and cost-efficiency.

Cross-Benchmark Consistency and Cost Tradeoffs

Strong positive Spearman correlations (0.75-0.85) in benchmark scores across configurations indicate systematic model effects but not robust agent-level generalization. No single agent dominates all domains, and cross-benchmark rankings vary within models. Efficiency analysis delineates a Pareto frontier—GPT 5.2 is highly cost-efficient ($0.17–0.38 per task) but achieves substantially lower success rates; Claude Opus 4.5 configurations perform best but incur 3–33x higher costs.

Behavioral Analysis

Failure analysis shows that unsuccessful runs require more interactions—by 20–54% depending on agent—raising inference costs and exposing penalty magnitudes associated with unreliability. Patterns suggest that agents allocate interaction budgets differently, influencing practical deployment outcomes.

Implications and Future Directions

The results validate that general-purpose agents, evaluated via protocol-agnostic infrastructure, generalize effectively across heterogeneous environments, achieving parity or superiority with domain-optimized baselines. Model quality dominates agent architectural effects, implying that future research should prioritize model improvements, architectural modularity, and explicit generalization mechanisms.

Practical deployment necessitates attention to cost-efficiency, tool scalability, and component complexity. Tool shortlisting, schema guards, explicit memory, and planning modules are promising directions for enhancing robustness and efficiency. The Open General Agent Leaderboard and Exgentic provide foundational platforms for iterative benchmark expansion and comparative agent architecture studies.

Future developments may involve multimodal interfaces, visual and web-based interaction protocols, and intelligent sampling strategies to mitigate evaluation costs. Safety-critical applications and cross-domain planning represent prominent opportunities.

Conclusion

The systematic evaluation of general-purpose agents through the Unified Protocol and Exgentic framework demonstrates that model quality is the primary determinant of agent performance, with architectural choices exerting secondary effects. General agents match or exceed domain-specialized systems given proper evaluation infrastructure, highlighting the feasibility of scalable, cross-domain agent deployment and research. The practical, theoretical, and methodological implications suggest that advancing agent generality and abstraction is pivotal for future AI systems, with the provided infrastructure catalyzing ongoing research beyond domain silos (2602.22953).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about testing “general-purpose AI agents.” Think of an agent like a smart helper that can plan, click buttons, run tools, and talk to people to get tasks done. A general-purpose agent should work in lots of different places—like helping with customer support, browsing the web, or fixing computer code—without being customized for each one.

The problem is: most tests today are built for one specific type of agent and one specific type of environment. That makes it hard to tell if an agent is truly general. This paper introduces a fair way to test any agent across many kinds of tasks and shows results on a public scoreboard.

What questions were they trying to answer?

Can one agent handle many different types of tasks without special tweaks for each one?
How should we fairly test these agents so they’re not forced into someone else’s rules?
What matters more for success: the agent’s design or the underlying LLM it uses?
How do cost and performance trade off when these agents work in the real world?

How did they study it?

A “universal translator” for testing: the Unified Protocol

Different test environments and agents “speak” different technical languages. The authors built a simple, shared way for them to talk, called the Unified Protocol. You can think of it like a universal translator that turns any agent’s actions into a format any test can understand, and vice versa. It focuses on three things:

Task: what the agent should do (the instructions).
Context: what the agent should know (helpful information or rules).
Actions: what the agent is allowed to do (like “search flights” or “run a command”).

This avoids rewriting every agent for every test, like using a plug adapter instead of rewiring the house.

The Exgentic framework: a test rig

They built a tool called Exgentic that:

Connects any supported agent to any supported test using adapters (like plug converters).
Runs tasks one by one in isolated “sessions,” sends observations to the agent, gets the agent’s actions back, and keeps looping until done.
Records success, how many steps it took, and how much it cost in API dollars.

What did they test?

They evaluated five well-known agent setups using three different LLMs across six different environments, including:

Customer support simulations (airline, retail, telecom)
Deep web research
Everyday app tasks (like a digital assistant)
Real software bug fixing using code repositories

They also built a public scoreboard: the Open General Agent Leaderboard.

What did they find, and why does it matter?

Here are the main results, summarized in plain language:

General agents can generalize: Without custom setups for each environment, these agents performed about as well as specialized agents that were hand-tuned for specific tasks. That’s a big deal—it means a single agent can be widely useful.
The LLM matters most: The “brain” behind the agent (the LLM it uses) has a much bigger impact on success than the agent’s outer scaffolding. Better models generally led to better scores across many tasks.
No one agent wins everywhere: Different agent designs did better or worse depending on the task and the model they used. There wasn’t one “super agent” that beat all others on everything.
Cost versus performance is a real tradeoff: Some model–agent combos were cheap but less accurate; others were more accurate but cost much more per task. Depending on your needs, the “best” choice might be the cheapest that’s good enough, or the most accurate even if it costs more.
Simple components help a lot:
- Tool shortlisting: narrowing down which tools to consider so the agent doesn’t get overwhelmed.
- Schema guards: checking whether the agent called a tool correctly and letting it fix mistakes.
Failures often burn more time and money: When agents failed, they usually took more steps (and cost more) than when they succeeded. That means reliability isn’t just about getting it right—it also saves money.

Why is this important?

Big picture impact

Fair testing for everyone: The Unified Protocol and Exgentic make it possible to test any agent in many environments without breaking or rewriting things. That’s more fair and much easier for researchers and developers.
A shared scoreboard: The open leaderboard encourages healthy competition and faster progress toward useful general-purpose agents.
Practical guidance: The results show that choosing the right LLM is often the most important decision, then picking an agent design that fits your budget and task type.

What this could change

Companies and developers can more easily pick the right agent–model combo for their needs, whether that’s cutting costs or getting top performance.
Researchers can focus on what really improves generalization, not just tricks that only work in one test.
Over time, agents that work well “anywhere” may replace many single-purpose systems.

In short: this paper offers a fair, simple way to test general AI agents across many tasks, shows that general agents are already competitive with specialized ones, and highlights that the LLM you choose is usually the biggest factor in how well your agent performs.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single list of concrete gaps and unresolved questions that future work can address.

Protocol fidelity: Quantitatively measure how the Unified Protocol (and its adaptor layer) alters agent behavior versus native, benchmark-specific integrations by running the same agent on the same tasks under both interfaces and reporting performance, latency, and error-rate deltas.
Benchmark adaptation bias: The adaptation methodology derives task semantics and allowed actions from a single “reference agent” per benchmark (e.g., mini-swe-agent for SWE-Bench Verified). Assess sensitivity to these design choices by implementing and comparing multiple plausible adaptor designs (e.g., different bash action scopes, patch-submission mechanisms) and reporting how results vary.
Task definition integrity: Validate that Unified Protocol task/context fields neither leak extra information nor remove essential signals by auditing prompts and performing controlled ablations that reintroduce/exclude instructions originally embedded in reference prompts (e.g., “message-or-tool-only per turn” rules).
Coverage breadth: Expand evaluation beyond the six environments used (T2-Bench subdomains, SWE-Bench Verified, AppWorld, BrowseComp+) to include CLI/terminal benchmarks (e.g., Terminal-Bench), real web environments (WebArena/BrowserGym), OS-level multi-app suites (OSWorld), multimodal tasks, robotics, and safety-critical domains.
Real-user evaluation: Replace or complement LLM-simulated users in T2-Bench with human participants or high-fidelity user simulators to quantify distribution shift, safety, and usability under realistic interaction patterns.
Model diversity and tuning: Evaluate more models and families (e.g., Llama, Qwen, Mistral, DeepSeek) and systematically vary decoding/tool-calling hyperparameters (temperature, top_p, tool JSON schema strictness) to establish robust, model-agnostic conclusions rather than relying on “default parameters.”
Run-to-run variance and reproducibility: Perform repeated trials per (agent, model, task) and report confidence intervals, bootstrapped error bars, and model version pinning to account for LLM nondeterminism and vendor updates/drift.
Sample representativeness: Current results use 100 randomly sampled tasks per environment; quantify representativeness by stratifying samples across difficulty tiers and compare against full-benchmark runs to assess sampling bias.
Unified metrics: Success-rate alone fails to capture partial progress, safety/compliance, and side-effects. Define and report unified, cross-benchmark metrics (e.g., partial credit, rule violations, harmful actions, recovery ability, calibration/uncertainty) to improve comparability.
Latency and throughput: Include wall-clock time, step time, and throughput under load as first-class metrics alongside token-cost to reflect operational constraints for deployment.
Cost accounting completeness: Current cost estimates are based on API list prices and LLM tokens; include non-LLM costs (benchmark environment execution, sandboxes, adaptor overhead), amortized setup costs, and potential enterprise pricing to improve economic realism.
Step-budget sensitivity: Success/failure and cost are capped by max-turn limits (e.g., 100 turns; 50-turn cap in failure analysis). Report performance and cost as functions of step budgets and study dynamic budgeting policies.
Tool-scale limits: GPT 5.2’s hard limit of 128 tools caused zero scores in AppWorld. Investigate tool-scaling strategies (e.g., hierarchical tool ontologies, dynamic tool loading, compositional tools) and measure their impact across models with differing tool limits.
Shortlisting algorithm details: Specify and ablate tool shortlisting methods (e.g., retrieval scoring, top-k, learned selectors), their parameters, and failure modes to enable reproducible improvements and fair comparisons.
Component-level ablations: Beyond schema guards and shortlisting, systematically ablate memory, planning, reflection/self-correction, caching, and error-recovery components across agents and models, reporting their individual and interaction effects on performance, cost, and reliability.
Model–agent interaction modeling: The variance decomposition indicates model effects dominate, but the agent set is small. Use hierarchical statistical models with more agents to better quantify main effects and interactions, and test whether conclusions hold under expanded design spaces.
Protocol expressivity limits: The “message” and “final-answer” action design may not capture multi-phase tasks, asynchronous events, concurrent actions, streaming outputs, or partial observability. Extend and test protocol primitives to cover these cases and evaluate agents under those conditions.
Context handling strategies: The paper notes alternative context usage (e.g., storing in MCP resources) but does not implement or compare them. Evaluate different context-ingestion strategies (prompt concatenation vs. retrievable memory stores vs. tool-accessible resources) and quantify their effects.
Adaptor overhead and failure modes: Measure adaptor-induced latency, serialization/deserialization errors, schema mismatches, and cross-process communication failures, and report how they impact task success and cost.
Security and safety: Assess whether agents can perform unauthorized or unsafe actions through adaptor mappings and sandbox configurations (e.g., SWE-Bench bash actions), and establish security hardening and auditing protocols for general-agent evaluation.
Retrieval component isolation: In BrowseComp+, the retriever is fixed (dense Qwen3 embeddings). Evaluate sensitivity to retriever choice/quality and agent–retriever co-design, including robustness to noisy/biased retrieval and ablations on query planning.
Head-to-head with specialized agents: Current comparisons to domain leaderboards are indirect (different samples, setups). Run specialized, benchmark-optimized agents inside Exgentic under the same conditions to directly quantify general-agents’ competitiveness.
Governance and standardization: The Unified Protocol is not a community standard. Establish a formal governance process (versioning, extension proposals, compliance suites) and community validation to prevent protocol drift and benchmark-specific overfitting.
Dataset and environment versioning: Benchmarks (web tasks, repos, APIs) evolve. Implement strict environment snapshotting, dependency pinning, and change logs, and report sensitivity to version changes.
Failure analytics depth: The paper shows failed runs generally consume more steps; extend analysis to categorize failure types (tool-selection errors, schema violations, recovery failures, looping) and connect them to actionable component fixes.
Generalization beyond static tasks: Assess agents on temporally dynamic environments (websites that change, evolving repositories) to test adaptability, monitoring, and recovery under real-world drift.

View Paper Prompt View All Prompts

Practical Applications

Practical, real-world applications of the paper’s findings

The paper introduces the Unified Protocol, the Exgentic evaluation framework, and the Open General Agent Leaderboard to systematically assess general-purpose agents across heterogeneous environments. Below are applications that leverage these contributions and the empirical insights (e.g., cost-performance tradeoffs, component effects, and cross-benchmark generalization patterns).

Immediate Applications

The following applications can be deployed now using the paper’s released protocol, framework, and results.

Enterprise agent procurement and benchmarking (software, customer service, IT operations)
- Use Exgentic to run internal “agent bake-offs” across representative tasks, comparing agent-model pairs on success rate, cost per task, and interaction steps.
- Potential tools/workflows: internal Open General Agent Leaderboard clone, standardized “agent readiness” reports, reproducible traces and cost logs.
- Assumptions/dependencies: access to LLM APIs (pricing variability), adaptor creation to map internal tasks to the Unified Protocol, data privacy and sandboxing for evaluations.
Cost-aware agent routing and workload segmentation (operations, finance, customer-facing support)
- Operationalize the Pareto frontier: route low-stakes/high-volume tasks to GPT 5.2 configurations (efficiency), and high-stakes tasks to Claude Opus 4.5 configurations (performance).
- Potential tools/products: “Cost-aware Agent Router,” policy rules for automatic model-agent selection per task type.
- Assumptions/dependencies: accurate task classification, budget enforcement, consistent availability of target models; enterprise pricing may differ from public list prices.
Protocol bridges for legacy environments (telecom, retail, airline, web apps)
- Wrap existing internal APIs, CLIs, or MCP tools as Unified Protocol actions to test and deploy general agents without rewriting systems.
- Potential tools/workflows: “Unified Protocol Adapter SDK,” action schema libraries, message/final-answer action mapping.
- Assumptions/dependencies: clear action semantics and parameter schemas; secure sandboxing for bash/CLI interactions; adaptor maintenance as systems evolve.
AgentOps observability and reliability tuning (software/SRE, MLOps)
- Instrument agents with step-count tracking, early termination thresholds, and failure diagnostics—leveraging the finding that failures tend to consume more steps (and cost).
- Potential tools/products: “AgentOps Dashboard,” “Agent Trace Explorer,” budget alarms tied to interaction counts.
- Assumptions/dependencies: standardized logging across agents; consistent orchestration of sessions; adherence to reproducible runs.
Component-level upgrades: schema guards and tool shortlisting (software engineering agents, customer support)
- Adopt schema guards to catch invalid tool calls and enable self-correction; implement tool shortlisting in tool-rich environments to unlock otherwise unusable model configurations (e.g., GPT 5.2 with >128 tools).
- Potential tools/workflows: reusable shortlisting modules, validation middleware for tool schemas.
- Assumptions/dependencies: enumerated action spaces, ability to intercept/validate tool calls; performance depends on underlying model behavior.
Academic reproducibility and ablation labs (AI research and education)
- Use Exgentic for method courses and lab exercises: component ablations (planning, memory, schema guards), cross-benchmark generalization studies, and model-agent pairing analyses.
- Potential tools/workflows: coursework bundles, caching and parallelism for class-scale runs, public benchmark contributions via Unified Protocol.
- Assumptions/dependencies: compute availability; benchmark licensing; standardized reporting of success/cost/steps.
Internal governance and procurement policy (public sector, regulated industries)
- Formalize evaluation criteria: require cross-domain success/cost reporting and trace preservation before purchasing agent solutions.
- Potential tools/workflows: procurement checklists anchored to the Leaderboard metrics; independent verification via Exgentic.
- Assumptions/dependencies: acceptance of common metrics; data protection during evaluations; clarity on domain boundaries in “general-purpose” claims.

Long-Term Applications

These applications require further research, scaling, community consensus, or domain-specific development.

Industry-wide interoperability standard for agents (standardization bodies, multi-vendor ecosystems)
- Promote the Unified Protocol as a foundation for a universal agent-benchmark interface, enabling plug-and-play evaluation and deployment across vendors and domains.
- Potential tools/products: formal specification, compliance test suites, reference adaptors across major protocols (CLI, tool-calling APIs, MCP).
- Assumptions/dependencies: community buy-in, governance process for versioning and extensions, robust security guidelines.
Enterprise-wide general-purpose agent deployments without per-domain tuning (cross-department automation)
- Deploy a single agent scaffold across customer service, IT helpdesk, and software ops by exposing departmental systems through Unified Protocol actions.
- Potential tools/workflows: “Unified Agent Gateway,” centralized action catalogs, dynamic model-agent pair optimization pipelines.
- Assumptions/dependencies: consistent reliability across domains; scalable adaptor maintenance; strong data governance and access controls.
Safety-critical certification frameworks (healthcare, finance, energy)
- Extend Exgentic to include safety, robustness, and policy-compliance metrics; enable certification that requires cross-benchmark stability and failure-mode audits.
- Potential tools/products: “Agent Certification & Compliance Toolkit,” domain-specific evaluation suites with human-in-the-loop gates.
- Assumptions/dependencies: domain-specific benchmarks (clinical/EHR, trading/compliance), rigorous sandboxing, legal/regulatory acceptance; enhanced reliability beyond current success-rate levels.
Multimodal and embodied agent evaluation (robotics, IoT, OS agents)
- Adapt the Unified Protocol to sensor/action interfaces for physical devices and real-computer environments (e.g., OSWorld-like tasks), enabling general agent evaluation in complex contexts.
- Potential tools/workflows: robotics adaptors for perception/action loops, multimodal action schemas, high-fidelity simulators.
- Assumptions/dependencies: safe control frameworks, latency constraints, robust error handling; expanded agent components (planning/memory) for temporal tasks.
Agent marketplaces with verified leaderboard-backed listings (software platforms)
- Curate agent packages with transparent, reproducible scores and cost profiles across standardized benchmarks to support enterprise and consumer selection.
- Potential tools/products: marketplace metadata schemas, automated verification harnesses, “trust badges” tied to Exgentic-style evaluations.
- Assumptions/dependencies: standardized reporting; verifiable, tamper-resistant runs; continuous re-evaluation as models/agents evolve.
Automated agent-model pairing and architecture search (MLOps, AutoML for agents)
- Leverage Exgentic instrumentation to build meta-optimizers that search over planning/memory/shortlisting components and model choices to hit target cost-performance envelopes.
- Potential tools/workflows: “Agent Architecture Search” services, constrained optimizers that respect tool limits, budget-aware deployment policies.
- Assumptions/dependencies: rich configuration spaces; reliable performance predictors; stable access to multiple backbone models.
Education and workforce upskilling in general-agent development (education, professional training)
- Develop curricula focused on cross-domain generalization, adaptor building, and agent component design informed by Leaderboard evidence.
- Potential tools/workflows: modular teaching kits, capstone projects on benchmark contributions and protocol design.
- Assumptions/dependencies: sustained community datasets; accessible cloud credits; evolving best practices for safe agent deployment.

View Paper Prompt View All Prompts

Glossary

A2A: A shorthand for agent-to-agent interaction protocols used in some evaluation frameworks to model communication pathways. "AgentBeats7 models agents and bench- marks as interacting via A2A/MCP subsets, standardizing evaluation lifecycle components but leaving task seman- tics to individual benchmarks."
ablation studies: Experiments that remove or alter components of a system to quantify their effect on performance. "conducting ablation studies as a primary means of advancing general agent development."
adaptor: An external translation layer that maps between differing protocols or APIs without modifying the original agent or benchmark. "we use external adaptor code that handles synchro- nization and protocol translation."
agentic benchmarks: Evaluation suites designed specifically to assess AI agents performing tasks within structured environments. "Existing agentic benchmarks like SWE-Bench Ver- ified (Jimenez et al., 2023) and ₸2-Bench (Yao et al., 2024) provide valuable assessments of domain-specific agents."
agentic scaffolds: Structural patterns or templates that organize an agent’s reasoning, memory, and tool usage. "different agentic scaffolds exhibit comparable performance, despite substantial variance in cost."
backbone model: The underlying LLM that powers an agent, often the primary determinant of its capabilities and performance. "Performance is strongly influenced by backbone model choice."
BM25: A classic sparse retrieval ranking function used to score document relevance to a query. "We use the authors' provided retriever with either BM25 (Robertson et al., 1994) or Qwen3. Embedder-based dense retrieval (Zhang et al., 2025), and report results using the latter."
BrowserGym: A standardized ecosystem for evaluating web-based agents through browser interactions. "Recent consolidation efforts like BrowserGym (Chezelles et al., 2025) and Harbor (Shaw, 2025) have integrated multiple benchmarks within single domains, by exposing to the agent the current goals and environment semantics (Fig. 2(B))."
CLI (Command Line Interface): A text-based interface for issuing commands; often used as a fixed interaction protocol in certain benchmarks. "these frameworks still enforce a single protocol (web-based for BrowserGym, CLI-based for Harbor), preventing agents from using their native integration mechanisms and effec- tively evaluating a diminished version of the agent (Yehudai et al., 2025)."
dense retrieval: A retrieval method that uses vector embeddings to find semantically relevant documents. "Qwen3. Embedder-based dense retrieval (Zhang et al., 2025)"
Exgentic: A protocol-preserving evaluation framework introduced to systematically assess general agents across heterogeneous benchmarks. "Based on the Unified Protocol, we release Exgentic an evaluation harness for general agents that supports mod- ular insights-comparing architectures, analyzing LLM impact, and optimizing agent-model pairings."
git diff: A Git command that generates a patch showing differences between file versions, often used to submit code changes. "generating patches via git diff for evaluation."
HAL (Holistic Agent Leaderboard): An infrastructure that unifies multiple benchmarks but requires per-benchmark agent adaptation. "HAL (Kapoor et al., 2025) unifies infrastructure across benchmarks but requires per- benchmark agent adaptation."
Harbor: A framework that consolidates benchmarks under a fixed protocol (e.g., CLI), enabling standardized interactions. "Recent consolidation efforts like BrowserGym (Chezelles et al., 2025) and Harbor (Shaw, 2025) have integrated multiple benchmarks within single domains"
McNemar test: A statistical test for paired nominal data used to compare success rates across configurations. "We assessed statistical significance using a pooled McNe- mar test."
MCP (Model Context Protocol): A protocol for exposing tools and resources to LLMs in a standardized way. "agent interfaces (e.g., CLI, tool-calling APIs, MCP)"
narrow waist: An architectural design principle where a minimal common interface mediates between diverse systems, reducing integration complexity. "The Unified Protocol serves as a "narrow waist", adding a new agent (or benchmark) only needs adhering to it rather than to all benchmarks (agents)."
orchestrator: The coordination component that mediates actions and observations between agents and benchmarks during evaluation. "all communication between them is mediated by the orchestrator and the cor- responding adaptor components."
Pareto frontier: The set of optimal configurations where improving one metric (e.g., performance) necessarily worsens another (e.g., cost). "The Pareto frontier (red dashed line) shows optimal tradeoffs: GPT 5.2 configurations offer the best cost-efficiency while Claude Opus 4.5 achieve the highest performance at 3-33 x higher cost."
reference implementation: A baseline agent or system used to derive integration assumptions and interfaces for a benchmark. "we examined MINI-SWE AGENT2 as the reference implementation."
retriever: A component that fetches relevant documents or information to support an agent’s reasoning. "we fix the retriever to isolate agent reasoning and decision- making."
sandboxed environment: An isolated execution context that ensures safety and reproducibility of interactions. "Following mini-swe-agent , we expose a single bash action for repository interaction in a sand- boxed environment, generating patches via git diff for evaluation."
schema guard: A mechanism that detects invalid tool invocation schemas and enables self-correction by the agent. "employ a schema guard compo- nent: a mechanism that detects when an action with an in- valid schema is invoked and allows the agent to correct itself."
Spearman rank correlations: A nonparametric statistic measuring the monotonic association between ranked variables. "we computed Spearman rank correlations be- tween benchmark scores across all agent-model configura- tions."
tool-calling: A pattern in which agents invoke external tools via structured API calls provided by the platform. "LiteLLM's tool- calling interface"
tool shortlisting: Reducing or preselecting the available tools to make selection more efficient and scalable, especially in tool-rich settings. "Tool shortlisting, when added to a simple ReAct agent with tool calling, improves performance across all models in tool-rich environments."
Unified Protocol: A canonical mediation protocol that decouples agents and benchmarks by standardizing task, context, and actions. "We present the Unified Protocol, a benchmark- agent mediation protocol (Fig. 2(C))."
variance decomposition: A statistical method that partitions the variance of outcomes across contributing factors (e.g., model vs. agent). "We performed variance decomposition to isolate the relative contributions of model choice versus agent architecture."

General Agent Evaluation

Summary

Systematic Evaluation of General-Purpose AI Agents

Motivation and Contribution

Unified Protocol: Decoupling Agents and Benchmarks

Exgentic Framework: Scalable, Reproducible Agent Evaluation

Benchmarking: Configuration Space and Metrics

Empirical Findings

Model Dominance and Variance Decomposition

Agent Architecture and Component-Level Insights

Cross-Benchmark Consistency and Cost Tradeoffs

Behavioral Analysis

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were they trying to answer?

How did they study it?

A “universal translator” for testing: the Unified Protocol

The Exgentic framework: a test rig

What did they test?

What did they find, and why does it matter?

Why is this important?

Big picture impact

What this could change

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical, real-world applications of the paper’s findings

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets