General Agent Evaluation
Abstract: The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about testing “general-purpose AI agents.” Think of an agent like a smart helper that can plan, click buttons, run tools, and talk to people to get tasks done. A general-purpose agent should work in lots of different places—like helping with customer support, browsing the web, or fixing computer code—without being customized for each one.
The problem is: most tests today are built for one specific type of agent and one specific type of environment. That makes it hard to tell if an agent is truly general. This paper introduces a fair way to test any agent across many kinds of tasks and shows results on a public scoreboard.
What questions were they trying to answer?
- Can one agent handle many different types of tasks without special tweaks for each one?
- How should we fairly test these agents so they’re not forced into someone else’s rules?
- What matters more for success: the agent’s design or the underlying LLM it uses?
- How do cost and performance trade off when these agents work in the real world?
How did they study it?
A “universal translator” for testing: the Unified Protocol
Different test environments and agents “speak” different technical languages. The authors built a simple, shared way for them to talk, called the Unified Protocol. You can think of it like a universal translator that turns any agent’s actions into a format any test can understand, and vice versa. It focuses on three things:
- Task: what the agent should do (the instructions).
- Context: what the agent should know (helpful information or rules).
- Actions: what the agent is allowed to do (like “search flights” or “run a command”).
This avoids rewriting every agent for every test, like using a plug adapter instead of rewiring the house.
The Exgentic framework: a test rig
They built a tool called Exgentic that:
- Connects any supported agent to any supported test using adapters (like plug converters).
- Runs tasks one by one in isolated “sessions,” sends observations to the agent, gets the agent’s actions back, and keeps looping until done.
- Records success, how many steps it took, and how much it cost in API dollars.
What did they test?
They evaluated five well-known agent setups using three different LLMs across six different environments, including:
- Customer support simulations (airline, retail, telecom)
- Deep web research
- Everyday app tasks (like a digital assistant)
- Real software bug fixing using code repositories
They also built a public scoreboard: the Open General Agent Leaderboard.
What did they find, and why does it matter?
Here are the main results, summarized in plain language:
- General agents can generalize: Without custom setups for each environment, these agents performed about as well as specialized agents that were hand-tuned for specific tasks. That’s a big deal—it means a single agent can be widely useful.
- The LLM matters most: The “brain” behind the agent (the LLM it uses) has a much bigger impact on success than the agent’s outer scaffolding. Better models generally led to better scores across many tasks.
- No one agent wins everywhere: Different agent designs did better or worse depending on the task and the model they used. There wasn’t one “super agent” that beat all others on everything.
- Cost versus performance is a real tradeoff: Some model–agent combos were cheap but less accurate; others were more accurate but cost much more per task. Depending on your needs, the “best” choice might be the cheapest that’s good enough, or the most accurate even if it costs more.
- Simple components help a lot:
- Tool shortlisting: narrowing down which tools to consider so the agent doesn’t get overwhelmed.
- Schema guards: checking whether the agent called a tool correctly and letting it fix mistakes.
- Failures often burn more time and money: When agents failed, they usually took more steps (and cost more) than when they succeeded. That means reliability isn’t just about getting it right—it also saves money.
Why is this important?
Big picture impact
- Fair testing for everyone: The Unified Protocol and Exgentic make it possible to test any agent in many environments without breaking or rewriting things. That’s more fair and much easier for researchers and developers.
- A shared scoreboard: The open leaderboard encourages healthy competition and faster progress toward useful general-purpose agents.
- Practical guidance: The results show that choosing the right LLM is often the most important decision, then picking an agent design that fits your budget and task type.
What this could change
- Companies and developers can more easily pick the right agent–model combo for their needs, whether that’s cutting costs or getting top performance.
- Researchers can focus on what really improves generalization, not just tricks that only work in one test.
- Over time, agents that work well “anywhere” may replace many single-purpose systems.
In short: this paper offers a fair, simple way to test general AI agents across many tasks, shows that general agents are already competitive with specialized ones, and highlights that the LLM you choose is usually the biggest factor in how well your agent performs.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single list of concrete gaps and unresolved questions that future work can address.
- Protocol fidelity: Quantitatively measure how the Unified Protocol (and its adaptor layer) alters agent behavior versus native, benchmark-specific integrations by running the same agent on the same tasks under both interfaces and reporting performance, latency, and error-rate deltas.
- Benchmark adaptation bias: The adaptation methodology derives task semantics and allowed actions from a single “reference agent” per benchmark (e.g., mini-swe-agent for SWE-Bench Verified). Assess sensitivity to these design choices by implementing and comparing multiple plausible adaptor designs (e.g., different bash action scopes, patch-submission mechanisms) and reporting how results vary.
- Task definition integrity: Validate that Unified Protocol task/context fields neither leak extra information nor remove essential signals by auditing prompts and performing controlled ablations that reintroduce/exclude instructions originally embedded in reference prompts (e.g., “message-or-tool-only per turn” rules).
- Coverage breadth: Expand evaluation beyond the six environments used (T2-Bench subdomains, SWE-Bench Verified, AppWorld, BrowseComp+) to include CLI/terminal benchmarks (e.g., Terminal-Bench), real web environments (WebArena/BrowserGym), OS-level multi-app suites (OSWorld), multimodal tasks, robotics, and safety-critical domains.
- Real-user evaluation: Replace or complement LLM-simulated users in T2-Bench with human participants or high-fidelity user simulators to quantify distribution shift, safety, and usability under realistic interaction patterns.
- Model diversity and tuning: Evaluate more models and families (e.g., Llama, Qwen, Mistral, DeepSeek) and systematically vary decoding/tool-calling hyperparameters (temperature, top_p, tool JSON schema strictness) to establish robust, model-agnostic conclusions rather than relying on “default parameters.”
- Run-to-run variance and reproducibility: Perform repeated trials per (agent, model, task) and report confidence intervals, bootstrapped error bars, and model version pinning to account for LLM nondeterminism and vendor updates/drift.
- Sample representativeness: Current results use 100 randomly sampled tasks per environment; quantify representativeness by stratifying samples across difficulty tiers and compare against full-benchmark runs to assess sampling bias.
- Unified metrics: Success-rate alone fails to capture partial progress, safety/compliance, and side-effects. Define and report unified, cross-benchmark metrics (e.g., partial credit, rule violations, harmful actions, recovery ability, calibration/uncertainty) to improve comparability.
- Latency and throughput: Include wall-clock time, step time, and throughput under load as first-class metrics alongside token-cost to reflect operational constraints for deployment.
- Cost accounting completeness: Current cost estimates are based on API list prices and LLM tokens; include non-LLM costs (benchmark environment execution, sandboxes, adaptor overhead), amortized setup costs, and potential enterprise pricing to improve economic realism.
- Step-budget sensitivity: Success/failure and cost are capped by max-turn limits (e.g., 100 turns; 50-turn cap in failure analysis). Report performance and cost as functions of step budgets and study dynamic budgeting policies.
- Tool-scale limits: GPT 5.2’s hard limit of 128 tools caused zero scores in AppWorld. Investigate tool-scaling strategies (e.g., hierarchical tool ontologies, dynamic tool loading, compositional tools) and measure their impact across models with differing tool limits.
- Shortlisting algorithm details: Specify and ablate tool shortlisting methods (e.g., retrieval scoring, top-k, learned selectors), their parameters, and failure modes to enable reproducible improvements and fair comparisons.
- Component-level ablations: Beyond schema guards and shortlisting, systematically ablate memory, planning, reflection/self-correction, caching, and error-recovery components across agents and models, reporting their individual and interaction effects on performance, cost, and reliability.
- Model–agent interaction modeling: The variance decomposition indicates model effects dominate, but the agent set is small. Use hierarchical statistical models with more agents to better quantify main effects and interactions, and test whether conclusions hold under expanded design spaces.
- Protocol expressivity limits: The “message” and “final-answer” action design may not capture multi-phase tasks, asynchronous events, concurrent actions, streaming outputs, or partial observability. Extend and test protocol primitives to cover these cases and evaluate agents under those conditions.
- Context handling strategies: The paper notes alternative context usage (e.g., storing in MCP resources) but does not implement or compare them. Evaluate different context-ingestion strategies (prompt concatenation vs. retrievable memory stores vs. tool-accessible resources) and quantify their effects.
- Adaptor overhead and failure modes: Measure adaptor-induced latency, serialization/deserialization errors, schema mismatches, and cross-process communication failures, and report how they impact task success and cost.
- Security and safety: Assess whether agents can perform unauthorized or unsafe actions through adaptor mappings and sandbox configurations (e.g., SWE-Bench bash actions), and establish security hardening and auditing protocols for general-agent evaluation.
- Retrieval component isolation: In BrowseComp+, the retriever is fixed (dense Qwen3 embeddings). Evaluate sensitivity to retriever choice/quality and agent–retriever co-design, including robustness to noisy/biased retrieval and ablations on query planning.
- Head-to-head with specialized agents: Current comparisons to domain leaderboards are indirect (different samples, setups). Run specialized, benchmark-optimized agents inside Exgentic under the same conditions to directly quantify general-agents’ competitiveness.
- Governance and standardization: The Unified Protocol is not a community standard. Establish a formal governance process (versioning, extension proposals, compliance suites) and community validation to prevent protocol drift and benchmark-specific overfitting.
- Dataset and environment versioning: Benchmarks (web tasks, repos, APIs) evolve. Implement strict environment snapshotting, dependency pinning, and change logs, and report sensitivity to version changes.
- Failure analytics depth: The paper shows failed runs generally consume more steps; extend analysis to categorize failure types (tool-selection errors, schema violations, recovery failures, looping) and connect them to actionable component fixes.
- Generalization beyond static tasks: Assess agents on temporally dynamic environments (websites that change, evolving repositories) to test adaptability, monitoring, and recovery under real-world drift.
Practical Applications
Practical, real-world applications of the paper’s findings
The paper introduces the Unified Protocol, the Exgentic evaluation framework, and the Open General Agent Leaderboard to systematically assess general-purpose agents across heterogeneous environments. Below are applications that leverage these contributions and the empirical insights (e.g., cost-performance tradeoffs, component effects, and cross-benchmark generalization patterns).
Immediate Applications
The following applications can be deployed now using the paper’s released protocol, framework, and results.
- Enterprise agent procurement and benchmarking (software, customer service, IT operations)
- Use Exgentic to run internal “agent bake-offs” across representative tasks, comparing agent-model pairs on success rate, cost per task, and interaction steps.
- Potential tools/workflows: internal Open General Agent Leaderboard clone, standardized “agent readiness” reports, reproducible traces and cost logs.
- Assumptions/dependencies: access to LLM APIs (pricing variability), adaptor creation to map internal tasks to the
Unified Protocol, data privacy and sandboxing for evaluations.
- Cost-aware agent routing and workload segmentation (operations, finance, customer-facing support)
- Operationalize the Pareto frontier: route low-stakes/high-volume tasks to GPT 5.2 configurations (efficiency), and high-stakes tasks to Claude Opus 4.5 configurations (performance).
- Potential tools/products: “Cost-aware Agent Router,” policy rules for automatic model-agent selection per task type.
- Assumptions/dependencies: accurate task classification, budget enforcement, consistent availability of target models; enterprise pricing may differ from public list prices.
- Protocol bridges for legacy environments (telecom, retail, airline, web apps)
- Wrap existing internal APIs, CLIs, or MCP tools as
Unified Protocolactions to test and deploy general agents without rewriting systems. - Potential tools/workflows: “Unified Protocol Adapter SDK,” action schema libraries, message/final-answer action mapping.
- Assumptions/dependencies: clear action semantics and parameter schemas; secure sandboxing for bash/CLI interactions; adaptor maintenance as systems evolve.
- Wrap existing internal APIs, CLIs, or MCP tools as
- AgentOps observability and reliability tuning (software/SRE, MLOps)
- Instrument agents with step-count tracking, early termination thresholds, and failure diagnostics—leveraging the finding that failures tend to consume more steps (and cost).
- Potential tools/products: “AgentOps Dashboard,” “Agent Trace Explorer,” budget alarms tied to interaction counts.
- Assumptions/dependencies: standardized logging across agents; consistent orchestration of sessions; adherence to reproducible runs.
- Component-level upgrades: schema guards and tool shortlisting (software engineering agents, customer support)
- Adopt
schema guardsto catch invalid tool calls and enable self-correction; implementtool shortlistingin tool-rich environments to unlock otherwise unusable model configurations (e.g., GPT 5.2 with >128 tools). - Potential tools/workflows: reusable shortlisting modules, validation middleware for tool schemas.
- Assumptions/dependencies: enumerated action spaces, ability to intercept/validate tool calls; performance depends on underlying model behavior.
- Adopt
- Academic reproducibility and ablation labs (AI research and education)
- Use Exgentic for method courses and lab exercises: component ablations (planning, memory, schema guards), cross-benchmark generalization studies, and model-agent pairing analyses.
- Potential tools/workflows: coursework bundles, caching and parallelism for class-scale runs, public benchmark contributions via
Unified Protocol. - Assumptions/dependencies: compute availability; benchmark licensing; standardized reporting of success/cost/steps.
- Internal governance and procurement policy (public sector, regulated industries)
- Formalize evaluation criteria: require cross-domain success/cost reporting and trace preservation before purchasing agent solutions.
- Potential tools/workflows: procurement checklists anchored to the Leaderboard metrics; independent verification via Exgentic.
- Assumptions/dependencies: acceptance of common metrics; data protection during evaluations; clarity on domain boundaries in “general-purpose” claims.
Long-Term Applications
These applications require further research, scaling, community consensus, or domain-specific development.
- Industry-wide interoperability standard for agents (standardization bodies, multi-vendor ecosystems)
- Promote the
Unified Protocolas a foundation for a universal agent-benchmark interface, enabling plug-and-play evaluation and deployment across vendors and domains. - Potential tools/products: formal specification, compliance test suites, reference adaptors across major protocols (CLI, tool-calling APIs, MCP).
- Assumptions/dependencies: community buy-in, governance process for versioning and extensions, robust security guidelines.
- Promote the
- Enterprise-wide general-purpose agent deployments without per-domain tuning (cross-department automation)
- Deploy a single agent scaffold across customer service, IT helpdesk, and software ops by exposing departmental systems through
Unified Protocolactions. - Potential tools/workflows: “Unified Agent Gateway,” centralized action catalogs, dynamic model-agent pair optimization pipelines.
- Assumptions/dependencies: consistent reliability across domains; scalable adaptor maintenance; strong data governance and access controls.
- Deploy a single agent scaffold across customer service, IT helpdesk, and software ops by exposing departmental systems through
- Safety-critical certification frameworks (healthcare, finance, energy)
- Extend Exgentic to include safety, robustness, and policy-compliance metrics; enable certification that requires cross-benchmark stability and failure-mode audits.
- Potential tools/products: “Agent Certification & Compliance Toolkit,” domain-specific evaluation suites with human-in-the-loop gates.
- Assumptions/dependencies: domain-specific benchmarks (clinical/EHR, trading/compliance), rigorous sandboxing, legal/regulatory acceptance; enhanced reliability beyond current success-rate levels.
- Multimodal and embodied agent evaluation (robotics, IoT, OS agents)
- Adapt the
Unified Protocolto sensor/action interfaces for physical devices and real-computer environments (e.g., OSWorld-like tasks), enabling general agent evaluation in complex contexts. - Potential tools/workflows: robotics adaptors for perception/action loops, multimodal action schemas, high-fidelity simulators.
- Assumptions/dependencies: safe control frameworks, latency constraints, robust error handling; expanded agent components (planning/memory) for temporal tasks.
- Adapt the
- Agent marketplaces with verified leaderboard-backed listings (software platforms)
- Curate agent packages with transparent, reproducible scores and cost profiles across standardized benchmarks to support enterprise and consumer selection.
- Potential tools/products: marketplace metadata schemas, automated verification harnesses, “trust badges” tied to Exgentic-style evaluations.
- Assumptions/dependencies: standardized reporting; verifiable, tamper-resistant runs; continuous re-evaluation as models/agents evolve.
- Automated agent-model pairing and architecture search (MLOps, AutoML for agents)
- Leverage Exgentic instrumentation to build meta-optimizers that search over planning/memory/shortlisting components and model choices to hit target cost-performance envelopes.
- Potential tools/workflows: “Agent Architecture Search” services, constrained optimizers that respect tool limits, budget-aware deployment policies.
- Assumptions/dependencies: rich configuration spaces; reliable performance predictors; stable access to multiple backbone models.
- Education and workforce upskilling in general-agent development (education, professional training)
- Develop curricula focused on cross-domain generalization, adaptor building, and agent component design informed by Leaderboard evidence.
- Potential tools/workflows: modular teaching kits, capstone projects on benchmark contributions and protocol design.
- Assumptions/dependencies: sustained community datasets; accessible cloud credits; evolving best practices for safe agent deployment.
Glossary
- A2A: A shorthand for agent-to-agent interaction protocols used in some evaluation frameworks to model communication pathways. "AgentBeats7 models agents and bench- marks as interacting via A2A/MCP subsets, standardizing evaluation lifecycle components but leaving task seman- tics to individual benchmarks."
- ablation studies: Experiments that remove or alter components of a system to quantify their effect on performance. "conducting ablation studies as a primary means of advancing general agent development."
- adaptor: An external translation layer that maps between differing protocols or APIs without modifying the original agent or benchmark. "we use external adaptor code that handles synchro- nization and protocol translation."
- agentic benchmarks: Evaluation suites designed specifically to assess AI agents performing tasks within structured environments. "Existing agentic benchmarks like SWE-Bench Ver- ified (Jimenez et al., 2023) and ₸2-Bench (Yao et al., 2024) provide valuable assessments of domain-specific agents."
- agentic scaffolds: Structural patterns or templates that organize an agent’s reasoning, memory, and tool usage. "different agentic scaffolds exhibit comparable performance, despite substantial variance in cost."
- backbone model: The underlying LLM that powers an agent, often the primary determinant of its capabilities and performance. "Performance is strongly influenced by backbone model choice."
- BM25: A classic sparse retrieval ranking function used to score document relevance to a query. "We use the authors' provided retriever with either BM25 (Robertson et al., 1994) or Qwen3. Embedder-based dense retrieval (Zhang et al., 2025), and report results using the latter."
- BrowserGym: A standardized ecosystem for evaluating web-based agents through browser interactions. "Recent consolidation efforts like BrowserGym (Chezelles et al., 2025) and Harbor (Shaw, 2025) have integrated multiple benchmarks within single domains, by exposing to the agent the current goals and environment semantics (Fig. 2(B))."
- CLI (Command Line Interface): A text-based interface for issuing commands; often used as a fixed interaction protocol in certain benchmarks. "these frameworks still enforce a single protocol (web-based for BrowserGym, CLI-based for Harbor), preventing agents from using their native integration mechanisms and effec- tively evaluating a diminished version of the agent (Yehudai et al., 2025)."
- dense retrieval: A retrieval method that uses vector embeddings to find semantically relevant documents. "Qwen3. Embedder-based dense retrieval (Zhang et al., 2025)"
- Exgentic: A protocol-preserving evaluation framework introduced to systematically assess general agents across heterogeneous benchmarks. "Based on the Unified Protocol, we release Exgentic an evaluation harness for general agents that supports mod- ular insights-comparing architectures, analyzing LLM impact, and optimizing agent-model pairings."
- git diff: A Git command that generates a patch showing differences between file versions, often used to submit code changes. "generating patches via git diff for evaluation."
- HAL (Holistic Agent Leaderboard): An infrastructure that unifies multiple benchmarks but requires per-benchmark agent adaptation. "HAL (Kapoor et al., 2025) unifies infrastructure across benchmarks but requires per- benchmark agent adaptation."
- Harbor: A framework that consolidates benchmarks under a fixed protocol (e.g., CLI), enabling standardized interactions. "Recent consolidation efforts like BrowserGym (Chezelles et al., 2025) and Harbor (Shaw, 2025) have integrated multiple benchmarks within single domains"
- McNemar test: A statistical test for paired nominal data used to compare success rates across configurations. "We assessed statistical significance using a pooled McNe- mar test."
- MCP (Model Context Protocol): A protocol for exposing tools and resources to LLMs in a standardized way. "agent interfaces (e.g., CLI, tool-calling APIs, MCP)"
- narrow waist: An architectural design principle where a minimal common interface mediates between diverse systems, reducing integration complexity. "The Unified Protocol serves as a "narrow waist", adding a new agent (or benchmark) only needs adhering to it rather than to all benchmarks (agents)."
- orchestrator: The coordination component that mediates actions and observations between agents and benchmarks during evaluation. "all communication between them is mediated by the orchestrator and the cor- responding adaptor components."
- Pareto frontier: The set of optimal configurations where improving one metric (e.g., performance) necessarily worsens another (e.g., cost). "The Pareto frontier (red dashed line) shows optimal tradeoffs: GPT 5.2 configurations offer the best cost-efficiency while Claude Opus 4.5 achieve the highest performance at 3-33 x higher cost."
- reference implementation: A baseline agent or system used to derive integration assumptions and interfaces for a benchmark. "we examined MINI-SWE AGENT2 as the reference implementation."
- retriever: A component that fetches relevant documents or information to support an agent’s reasoning. "we fix the retriever to isolate agent reasoning and decision- making."
- sandboxed environment: An isolated execution context that ensures safety and reproducibility of interactions. "Following mini-swe-agent , we expose a single bash action for repository interaction in a sand- boxed environment, generating patches via git diff for evaluation."
- schema guard: A mechanism that detects invalid tool invocation schemas and enables self-correction by the agent. "employ a schema guard compo- nent: a mechanism that detects when an action with an in- valid schema is invoked and allows the agent to correct itself."
- Spearman rank correlations: A nonparametric statistic measuring the monotonic association between ranked variables. "we computed Spearman rank correlations be- tween benchmark scores across all agent-model configura- tions."
- tool-calling: A pattern in which agents invoke external tools via structured API calls provided by the platform. "LiteLLM's tool- calling interface"
- tool shortlisting: Reducing or preselecting the available tools to make selection more efficient and scalable, especially in tool-rich settings. "Tool shortlisting, when added to a simple ReAct agent with tool calling, improves performance across all models in tool-rich environments."
- Unified Protocol: A canonical mediation protocol that decouples agents and benchmarks by standardizing task, context, and actions. "We present the Unified Protocol, a benchmark- agent mediation protocol (Fig. 2(C))."
- variance decomposition: A statistical method that partitions the variance of outcomes across contributing factors (e.g., model vs. agent). "We performed variance decomposition to isolate the relative contributions of model choice versus agent architecture."
Collections
Sign up for free to add this paper to one or more collections.