- The paper introduces ClawTrace, an open tracing platform that refines LLM agent skill distillation by integrating precise per-step cost attribution.
- It presents TraceCards, a compact YAML-based representation that captures token- and cost-level annotations to facilitate corrective, prune, and repair interventions.
- Empirical results show that cost-aware pruning notably reduces regressions and improves efficiency, with transferable effects across diverse benchmark tasks.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
Problem Motivation and Context
The skill distillation paradigm for LLM-based agents aims to automatically mine reusable abstracted “skills” from raw agent trajectories, facilitating knowledge transfer across tasks without updating model parameters. Prior pipelines primarily leverage trajectory structure and high-level success/failure signals, but lack granular, instrumented cost attribution. This omission is consequential: the inability to trace per-step cost impairs the pipeline's capacity to distinguish between (a) corrective additions—steps required to repair failed outcomes—and (b) efficiency-driven subtractions—removal of wasteful steps that incur disproportionate cost without contributing to task success.
Observability platforms (e.g., LangSmith, Langfuse, Phoenix) offer per-span metrics, but only expose these through dashboards designed for human-in-the-loop oversight, lacking concise, machine-native representations for downstream automated pipelines. This disconnect hinders principled cost-aware skill distillation and the development of skills that robustly improve both agent reliability and cost efficiency.
ClawTrace Architecture and Instrumentation
ClawTrace is introduced as an open, highly-instrumented agent tracing platform with dense event coverage across LLM agent session lifecycles. It registers eight event hooks targeting key moments (session boundaries, LLM IO, tool invocation, sub-agent spawning/termination), capturing a full span tree with precise token- and cost-level annotations for all subcalls and tool-based delegations. Notably, multi-agent systems, e.g., OpenClaw-based orchestrations, often launch nested sub-agents whose state would be difficult to reconstruct from flat traces; ClawTrace instead links sub-agent sessions to their parent calls, enabling consolidation of descendant costs and actions into unified analysis.
Figure 1: End-to-end architecture of ClawTrace, detailing event instrumentation, TraceCard compilation, and skill distillation via CostCraft.
A deterministic compilation stage coalesces the raw event stream into compact, structured session artifacts termed TraceCards. Each TraceCard is a YAML summary containing per-step USD cost, input/output/cache token breakdowns (provider-billing aware), high-cost span ranking, redundancy detection, and basic sub-agent output provenance. This representation eschews proprietary linkage to any specific agent framework, accepting generic JSON for wide compatibility.
CostCraft: Three-Way Patch Typology in Skill Distillation
CostCraft leverages the TraceCard intermediate to instantiate a three-action patch taxonomy for skill evolution: (1) preserve patches encode valuable behaviors from successful traces, (2) prune patches target high-cost, non-essential steps with explicit counterfactual justifications (ensuring removal does not regress quality), and (3) repair patches correct failure by referencing causal evidence extracted via oracle-based diagnosis. This separation operationalizes a correctness–efficiency dichotomy that replaces simple success/error analysis from earlier pipelines ([ni2026trace2skilldistilltrajectorylocallessons]).
The pipeline enforces strict admission for prune patches: targeting must reference the highest-cost spans identified in the TraceCard, and all prunes must be justified via model-generated, natural-language counterfactuals. The downstream merge algorithm, ranking repair > prune > duplicate preserve patches, ensures that supported, causally precise interventions are prioritized.
Figure 2: ClawTrace execution-path visualizes cost attribution per span, payloads for tool calls, and sub-agent nesting, clarifying cost drivers within agent decision trees.
Experimental Methodology and Results
Setup
Empirical studies focus on two prominent agent benchmarks: SpreadsheetBench ([ma2024spreadsheetbenchchallengingrealworld]) and SkillsBench ([li2026skillsbenchbenchmarkingagentskills]). An initial 50-task SpreadsheetBench sample is partitioned into 10-task “evolve” (training) and 30-task held-out (test) splits. CostCraft is applied under various ablation conditions: full cost-aware distillation, removal of cost info, omission of prunes, and disabling counterfactual gating, allowing separation of signal contributions.
Main Results
Ablation studies demonstrate several high-impact findings:
- Cost Attribution Is Essential: Removing per-step cost from TraceCards substantially increases regression incidence (13% to 20%), with most regressions manifested as catastrophic (Q=0) failures.
- Prune Rules Protect Quality: Disabling prunes triples quality regressions, while not significantly affecting median cost. This contradicts a naive assumption that prunes only compress cost; in fact, they act as critical “guardrails,” preventing non-intervention skills from breaking already-successful tasks.
- Effects Are Regime-Specific: Aggregate statistics mask relevant patterns. Full CostCraft features both perfect recoveries on failed seeds and catastrophic failures when key signals are ablated.

Figure 3: Quality outcome rate comparison under different ablation signals on held-out SpreadsheetBench tasks. Full CostCraft minimizes regressions and uniquely delivers net quality wins.
Figure 4: Per-task cost breakdown on SkillsBench, comparing baseline and CostCraft-instrumented runs. Notable cost reductions are highlighted when prune rules match observed waste patterns.
Skill transfer experiments on cross-domain SkillsBench tasks reveal an important asymmetry: common “prune” rules, which abstract over universal inefficiencies (such as redundant file reads), generalize effectively and cut median cost by 32%. In contrast, “preserve” rules often encode benchmark-specific conventions and, when transferred, can degrade performance—an effect visible in observed regressions on document analysis and code-generation tasks.
Failure Taxonomy and Operational Implications
Extensive annotation of failure cases reveals that only repairs and preserves are actionable on failed trajectories; prune patches exclusively emerge from successes with witnessable inefficiency. This structurally restricts prune coverage—unless the training set is large enough to observe a diverse range of waste patterns, the number of transferable prunes remains low. This limitation is not an artifact of model scale, but rather of pipeline design: quality-protective prune rules require both successful and inefficient runs.
Practical and Theoretical Implications
ClawTrace and CostCraft deliver a reproducible, open infrastructure for cost-aware agent analysis. The experimental results reveal immediate practical benefits: instrumented cost signals enable pipelines to reliably isolate and excise non-essential agent actions, while robust separation of patch types allows for more stable skill evolution. Theoretically, the work suggests a new framing for skill distillation—one centered on explicit cost-grounded causal attribution—and highlights the need for specialized intermediate representations (TraceCards) for downstream automation. The generality and format simplicity of TraceCards primes them for future adoption in RL-based, memory-augmented, or multi-agent distillation scenarios.
The observed cross-benchmark asymmetry—prune rules being more general than preserve/repair—suggests that efficiency-oriented skill learning has higher transfer potential than previously recognized. Conversely, quality-centric skill abstractions may require dataset or domain-specific adaptation layers to avoid regressive generalization.
Visualization and Operability
ClawTrace's dashboard and timeline visualizations exemplify its utility for both real-time and retrospective cost/action root-cause diagnosis.
Figure 5: Interactive trajectory dashboard enables high-level cost, token, and outcome monitoring across large agent batches.
Figure 6: Per-trajectory Gantt visualizations expose span-level parallelism, redundant tool use, and fine-grained duration profiling independent of agent stack.
These interfaces, combined with TraceCard’s concise schema, allow both human and programmatic agents to quickly isolate provenance and cause for regressions or inefficiencies.
Conclusion
Cost-aware tracing fundamentally modifies the trajectory of automated LLM agent skill distillation. By introducing a pipeline with fine-grained cost attribution and three-action patch separation, the work demonstrates that skill distillation can be robust against regressions and capable of transferable efficiency gains. While the current dataset size limits the catalog of reusable prunes, scaling the evolve set will further unlock cost compression potential. The modular design of TraceCards ensures compatibility with future non-GPT backbones, RL-based skill selection, or memory-oriented agent stacks.
Further work should explore:
- Expansion of training/evolve sets for diverse prune rule mining
- Multi-seed robustness studies to quantify protection and transfer repeatability
- Integration with automated closed-loop self-evolving agent frameworks
ClawTrace and CostCraft provide a technical base for explicit, reproducible, and generalizable cost-aware agent skill research, raising the standard for what constitutes actionable signal in the agent observability and distillation domain.