AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking

Published 26 Apr 2026 in cs.SE and cs.CL | (2604.23581v1)

Abstract: Agentic systems that chain reasoning, tool use, and synthesis into multi-step workflows are entering production, yet prevailing evaluation practices like end-to-end outcome checks and ad-hoc trace inspection systematically mask the intermediate failures that dominate real-world error budgets. We present AgentEval, a framework that formalizes agent executions as evaluation directed acyclic graphs (DAGs), where each node carries typed quality metrics assessed by a calibrated LLM judge (GPT-4o), classified through a hierarchical failure taxonomy (3 levels, 21 subcategories), and linked to upstream dependencies for automated root cause attribution. An ablation study isolates the impact of DAG-based dependency modeling: it alone contributes +22 percentage points to failure detection recall and +34 pp to root cause accuracy over flat step-level evaluation with identical judges and rubrics. Across three production workflows (450 test cases, two agent model families, predominantly sequential architectures with a 12% non-DAG trace rate), AgentEval achieves 2.17x higher failure detection recall than end-to-end evaluation (0.89 vs. 0.41), Cohen's kappa = 0.84 agreement with human experts, and 72% root cause accuracy against an 81% human ceiling. Cross-system evaluation on tau-bench and SWE-bench traces confirms transferability (failure detection recall >= 0.78) without taxonomy or rubric modification. A 4-month pilot with 18 engineers detected 23 pre-release regressions through CI/CD-integrated regression testing, reducing median root-cause identification time from 4.2 hours to 22 minutes and driving measurable failure rate reductions in two workflows.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a DAG-based, step-level framework that enhances failure detection and root cause analysis using calibrated LLM-as-judge techniques.
It employs a hierarchical failure taxonomy and CI/CD regression suite to drastically reduce error identification time from hours to minutes.
Experimental results demonstrate a 2.17× improvement in FDRec and near-perfect human alignment (Cohen’s κ = 0.84), proving its robust performance.

AgentEval: DAG-Based Step-Level Evaluation with Error Propagation Tracking for Agentic Workflows

Overview

"AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking" (2604.23581) introduces a formal evaluation framework for autonomous AI agents executing multi-step workflows. Unlike conventional evaluation paradigms that focus on end-to-end outcomes, AgentEval explicitly models agent actions as directed acyclic graphs (DAGs) where each node (step) is independently evaluated according to its type, and the system automatically tracks how errors propagate through dependency chains. The framework incorporates a hierarchical failure taxonomy and leverages calibrated LLM-as-judge protocols for scalable, human-aligned assessment. Its integration with CI/CD pipelines enables continuous, regression-aware quality monitoring.

Motivation and Challenges

The evaluation of agentic AI systems—where LLM-powered agents interact with multiple tools, construct plans, and synthesize outputs—presents unique difficulties. Traditional end-to-end (E2E) metrics obscure intermediate failures, hindering both root cause identification and systematic quality improvements. Prior work demonstrates that most real-world agent errors are either propagated or compounded across steps [cemri2025multiagent, zhu2025agents]. Manual trace inspection does not scale, and static benchmarks neglect deployment-specific constraints. AgentEval addresses these gaps by (a) capturing workflow structure in the form of an evaluation DAG, (b) evaluating each step with type-specific metrics, and (c) attributing failures to proximate causes using automated heuristics.

Framework Design

AgentEval is built around four key modules:

Evaluation DAG Formalism: Each agent workflow execution is represented as a DAG, with nodes corresponding to individual steps (e.g., planning, tool selection, parameter generation, execution, synthesis). Edges encode dependencies and provide explicit context for error propagation analysis.
Step-Level Quality Metrics and Judging: Each node type has a dedicated set of quality metrics, evaluated by an LLM-as-judge (primary: GPT-4o, with fallback judges for scalability). Calibration is performed with stratified few-shot examples per metric to ensure high fidelity to human annotation, with separate absolute or relative framing depending on whether the step is a root or interior node. The systematic use of cross-family LLMs (agent and judge are different model families) mitigates evaluation circularity.
Hierarchical Failure Taxonomy: A comprehensive three-level taxonomy (21 subcategories) annotates failure types, derived from large-scale empirical trace analysis. This taxonomy supports granular diagnostic tasks and quantifies error propagation rates by failure class.
CI/CD Regression Suite: Automated regression detection using paired bootstrap hypothesis testing is integrated tightly with development pipelines, enabling pre-release gating and progressive, cost-aware evaluation strategies.

Experimental Validation

Datasets and Evaluation Protocol

Three production workflow domains were studied: customer service, data analysis, and document processing, each instrumented with realistic tool usage, step variety, and LLM agents (Claude 3.5 Sonnet, Llama 3 70B).
Evaluation separates taxonomy development (523 traces) from primary assessment (450 traces, 987 step annotations, independent human expert labels).
Performance was assessed against E2E, flat stepwise, and rule-based baselines. Metrics include failure detection recall (FDRec), root cause accuracy (RCA), and Cohen's $\kappa$ for human alignment.

Main Results

AgentEval achieves:

Failure Detection Recall (FDRec) of 0.89, a factor of $2.17\times$ higher than E2E (0.41) and +22 percentage points over flat step evaluation (0.67), demonstrating the necessity of dependency modeling for surfacing latent failures.
Root Cause Accuracy of 72%, closely approaching the 81% human ceiling.
Cohen's $\kappa = 0.84$ with human experts, confirming near-perfect agreement in step-level judgments.
In practical deployment, median time to root cause identification was reduced from 4.2 hours to 22 minutes.

Ablation confirms DAG structure is the dominant driver of performance; removing it degrades FDRec by 22 points and RCA by 34. Taxonomy removal primarily harms root cause analysis, and judge calibration most impacts human alignment.

Generalization and Limitations

Cross-system evaluation on external benchmarks ( $\tau$ -bench and SWE-bench) without taxonomy adaptation yielded robust detection ( $\text{FDRec} \geq 0.78$ ), though RCA degrades when failure modes fall outside the predefined taxonomy.
DAG modeling retains its advantage up to approximately 60% non-DAG execution traces (e.g., via loops or dynamic branching); beyond this, gains diminish, indicating a current limitation for highly dynamic or non-acyclic agent architectures.

Practical Implications

AgentEval establishes a rigorous, scalable evaluation paradigm that is directly useful for production engineering teams:

Fine-grained error localization exposes dominant failure propagation chains, allowing engineering interventions to target high-leverage failure modes (e.g., context loss, parameter hallucination).
Automated regression detection in CI/CD pipelines enables prompt identification of silent performance regressions, mitigating risk of quality drift during frequent model or prompt updates.
Strong judge-agent decoupling ensures reliability of LLM-as-judge protocols for ongoing evaluation.

From a research perspective, the results underscore the inadequacy of flat or outcome-only evaluation in agentic settings and motivate broader adoption of structural, propagation-aware metrics for both benchmarking and training.

Future Directions

The authors highlight several lines for advancing the framework:

Extension to multi-agent and cyclic architectures, including explicit support for cycles and richer dependency graphs;
Automated taxonomy evolution, utilizing clustering over failure traces to adapt diagnostic categories as system complexity grows or domains shift;
Integration with agent training, using structured failure signals as auxiliary rewards or as part of process supervision loops [uesato2022process];
Enhanced root cause attribution, possibly leveraging formal causal inference methodologies rather than greedy heuristics.

Conclusion

"AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking" sets a new standard in systematic, interpretable, and production-grounded agent evaluation through DAG-based modeling of workflow structure, targeted failure taxonomy, and LLM-as-judge assessment (2604.23581). The framework yields significantly superior failure detection and attribution performance compared to E2E or flat approaches, with robust transferability and tangible deployment impact. Its utility is highest for sequential and moderately branching architectures, and it offers a foundation on which more sophisticated, cycle-aware, and adaptive evaluation systems can be developed as AI agents and workflows continue to grow in complexity.