MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

Published 9 May 2026 in cs.AI, cs.CL, and cs.MA | (2605.08670v1)

Abstract: LLM powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present $\textbf{M}$ulti-agent $\textbf{IN}$duction and $\textbf{D}$eduction for $\textbf{Skill}$s ($\textbf{MIND-Skill}$), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a closed-loop framework that combines induction and deduction agents to automatically generate and verify reusable procedural skills.
It implements three textual loss functions—reconstruction, outcome, and rubric—to ensure skills meet standards for fidelity, actionability, and abstraction.
Empirical results show MIND-Skill outperforms baseline methods in task completion and efficiency, affirming its potential for scalable skill library construction.

MIND-Skill: Automatic, Quality-Assured Skill Generation via Multi-Agent Induction and Deduction

Problem Statement and Context

The paper "MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction" (2605.08670) addresses a crucial bottleneck in the construction of AI agents capable of solving complex, real-world, multi-step tasks: scalable, high-quality skill acquisition. LLM-powered agents, while displaying strong declarative knowledge due to large-scale pretraining, frequently underperform on tasks requiring domain-specific procedural knowledge. Existing solutions—most notably, agent skill libraries—have largely depended on human experts to manually curate reusable procedural guides. Prior attempts to automate skill discovery suffer from: (1) lack of closed-loop quality guarantees, (2) poor control over documentation structure and abstraction level, and (3) inability to validate whether the abstracted skills preserve the necessary procedural fidelity for successful task execution.

MIND-Skill Framework

MIND-Skill proposes a closed-loop, multi-agent framework featuring two principal components: an induction agent and a deduction agent, designed for robust skill abstraction and procedural verification, respectively. The induction agent abstracts reusable skill documentation from reference trajectories, leveraging a taxonomy-based prompt structure optimized for extracting non-obvious, generalizable procedural patterns while suppressing both instruction-inferable and instance-specific details. The deduction agent, with a frozen prompt, attempts to reconstruct the original trajectory given only the induced skill and the task specification, thereby isolating the skill’s actual utility from the underlying agent’s reasoning prowess.

The framework introduces three complementary, text-based loss functions to guarantee skill quality:

Reconstruction Loss: Quantifies procedural alignment between the reconstructed and reference trajectories, using LLM judgment for tactic-level equivalence rather than literal step identity.
Outcome Loss: Directly assesses the execution correctness of the reconstructed trajectory in the environment, providing a ground-truth performance signal.
Rubric Loss: Independently evaluates the quality of the generated skill documentation along axes such as ground-truth independence, actionability, completeness, transferability, and conciseness. This regularizes abstraction and documentation standards, counteracting the agent’s tendency toward overfitting or boilerplate inclusion.

Optimization of the induction agent’s prompt is performed iteratively via TextGrad, with the deduction agent and textual losses providing diagnostic feedback. Lexicographic selection ensures the best encountered skill is retained across optimization rounds.

Empirical Evaluation

Experimental Protocol

MIND-Skill is evaluated on AppWorld—a complex, multi-app, API-rich agent environment—and BFCL-v3, a function-calling benchmark. Baselines include standard ReAct, in-context learning (ICL), skill extraction methods (Skill-extract), and two state-of-the-art skill-generation frameworks: ACE (lifelong playbook-based) and Trace2Skill (parallel hierarchical trajectory distillation).

Performance is measured on task goal completion (TGC), scenario goal completion (SGC), and aggregate accuracy. Skills are extracted from successful trajectories with both Qwen3.5-122B-A10B and GPT-5.4 base models.

Results

MIND-Skill obtains the highest aggregate scores—71.4 TGC on AppWorld-Normal, 55.4 SGC on AppWorld-Normal, 51.8 TGC and 39.6 SGC on AppWorld-Challenge, and 77.3 accuracy on BFCL-v3—substantially outperforming ACE and Trace2Skill across both source and held-out tasks. Notably, only MIND-Skill achieves consistent superiority on both normal and challenge task splits, indicating genuine procedural generalization instead of overfitting to superficial task patterns.

Closed-loop optimization confers a marked advantage: MIND-Skill consistently outperforms its one-shot induction ablation (Skill-extract) by 7–8 percentage points, confirming the necessity of iterative textual loss-driven refinement for producing skills that are simultaneously abstract, complete, and well-documented.

Ablations demonstrate each loss function’s criticality: dropping reconstruction loss almost entirely erases gains over skill-extraction methods, removal of rubric loss impairs generalization (especially SGC), and outcome loss mainly affects scenario-level robustness. Additionally, skill quality, as measured both by trajectory alignment and rubric scores, improves monotonically across optimization steps.

MIND-Skill achieves high efficiency: skill libraries remain 3–6× more concise (measured as total injected context tokens) than those produced by baselines, due to aggressive rubric-driven pruning of redundant or instance-specific documentation.

Finally, a grounded case analysis demonstrates that, after MIND-Skill optimization, skills generated with a "weaker" base model (Qwen3.5-122B-A10B) achieve equal or sometimes superior test-time contribution as those from the more advanced GPT-5.4—due to superior alignment with the inference distribution, further underscoring the importance of self-consistency in induction and deduction agents.

Discussion and Implications

MIND-Skill advances the state of the art in automatic skill acquisition for LLM-based agents by providing explicit quality assurance mechanisms spanning procedural fidelity and documentation quality. By freezing the deduction agent and relying on tripartite textual feedback, the method decouples skill abstraction from agent reasoning and enforces a separation between reusable procedural logic and brittle implementation details. The result is a set of skills optimized to be interpretable, transferable, and actionable across diverse task variations.

Practically, MIND-Skill substantially reduces the manual burden of skill curation, democratizes access to performant autonomous agent capabilities, and supports the scalable construction of skill libraries where provenance and abstraction quality are both auditable. The design of textual loss functions and the use of prompt-based differentiable optimization (TextGrad) provide a blueprint for further developments in closed-loop, text-based program synthesis and skill induction.

Theoretically, the work highlights the centrality of closed-loop verification and abstraction regularization for reliable skill extraction from experience—and exposes limitations of prior approaches reliant on raw distillation or continual, unregularized playbook growth. It also demonstrates that feedback from frozen deduction models enables disentangled optimization of skills and agents, a direction with implications for robust compositional and modular agent design.

Speculation on Future Directions

Future work may extend MIND-Skill’s principles to multi-agent collaborative skill induction, more expressive forms of abstraction regularization, and incorporation of richer environment signals (e.g., human-in-the-loop feedback or adversarial robustness). Application to broader agent domains—including robotics, interactive web agents, or scientific procedure discovery—appears tractable. Furthermore, the separation of induction and deduction, combined with adaptive rubric frameworks, is likely to inform next-generation agent ecosystems emphasizing safety, verifiability, and compositional generalization.

Conclusion

MIND-Skill introduces a rigorously controlled paradigm for autonomous skill generation, integrating multi-agent abstraction, closed-loop verification, and multi-faceted textual evaluation. Its quality-assured, documentation-aware skill libraries yield state-of-the-art agent performance on challenging benchmarks, and its design provides a generalizable framework for robust skill induction and reusable procedural knowledge extraction.

Markdown Report Issue