- The paper introduces SkillSynth, a framework that constructs scenario-mediated skill graphs to generate diverse and compositional terminal tasks.
- It employs controlled path sampling with inverse-frequency weighting and a multi-agent harness to ensure robust, verifiable task generation.
- Empirical results show 95.7% task verification and significant improvements in model performance and sample efficiency.
Introduction
Terminal agents, leveraging command-line interfaces (CLIs), have emerged as effective tools for LLMs to automate realistic system-level tasks. However, conventional training paradigms are heavily bottlenecked by the scarcity of high-quality, diverse execution trajectories. Manual curation is intractable at scale, and existing synthetic data generation approaches primarily focus on sheer task volume rather than diverse, compositional agentic experience.
The paper "Toward Scalable Terminal Task Synthesis via Skill Graphs" (2604.25727) proposes SkillSynth, a comprehensive framework for generating terminal tasks with controllable diversity by synthesizing workflows over a scenario-mediated skill graph. This essay provides a technical overview and critical analysis of this method, discussing its design, implementation details, empirical results, and the implications for scalable agentic training.
The core principle is abstraction of agentic trajectories into sequences of scenarios (semantically meaningful decision states) and skills (coherent action subsequences effecting scenario transitions). Formally, each trajectory is lifted from raw token-level steps to paths (σ0​,κ1​,σ1​,…,κL​,σL​), separating semantic states and the transformations between them. This decouples task diversity from task quantity and identifies the actionable coverage requirement in the empirical support of scenario-skill pairs.
SkillSynth constructs a scenario-mediated skill graph where:
- Nodes correspond to deduplicated scenario states (pre-/post-conditions of skills).
- Directed edges correspond to executable skills, mediating transitions between compatible scenarios.
Skill extraction is performed from large, real-world sources such as ClawHub and public GitHub repositories, with filtering to retain only executable, structured (non-prompt-only), safe (non-adversarial), and objectively verifiable skills.
Scenario inference for each skill utilizes LLM-based prompts to produce candidate scenario nodes. Agglomerative clustering with Louvain community detection semantically deduplicates scenarios, essential for merging equivalent but lexically divergent descriptions.
Cross-skill alignment employs embedding similarity to select plausible scenario bridges, followed by LLM-based semantic compatibility checks, ensuring only well-aligned skill transitions. The result is a multigraph supporting compositional workflow induction.
Figure 1: The skill graph construction pipeline encompasses skill extraction/filtering, scenario inference and deduplication, and cross-skill alignment to form a navigable graph of realistic terminal workflows.
Controlled Path Sampling over the Skill Graph
Task synthesis involves sampling compositional paths from the graph—each representing a high-level workflow requiring the agent to interleave multiple, compatible skills across diverse scenarios. To counter concentration on frequent or generic skills/scenarios (a limitation of naïve random walks), inverse-frequency weighting with monotone progression is introduced: newly sampled paths avoid recently traversed subgraphs, optimizing for broad support over the scenario-skill product space.
Figure 2: SkillSynth overview: (a) Path sampling in the scenario-mediated skill graph, (b) multi-agent harness converts workflow abstraction to executable tasks, and (c) output is a verified, containerized task.
Multi-Agent Task Synthesis Harness
Given a sampled scenario-skill path, SkillSynth employs a multi-agent harness to realize five core task assets: natural language instruction, initial filesystem/environment, verification scripts, and canonical oracle solutions.
Direct end-to-end generation is empirically unstable—Task construction is thus decomposed into planning (translation of scenario-skill path to objectives) and implementation (realization of task, assets, and environment).
Dual-axis verification ensures task quality:
- Oracle-based: Automated solvability checks via execution.
- Rubric-based: LLM-based assessment of specification fidelity and natural language instruction alignment.
Iterative repair with bounded retries ensures robust task generation, while non-recoverable failures are discarded. Within a single synthetic pass, this yields thousands of automatically verified, executable tasks with minimal manual overhead.
Figure 3: An example of a compositional skill-path in the video domain, illustrating the mapping from high-level workflow to a synthesized, multi-step task instruction.
Empirical Evaluation and Analysis
Task Yield and Difficulty: From 3,721 sampled paths, 3,560 (95.7%) are fully verified, with a significant portion (38%) constituting hard-unsolved tasks for state-of-the-art agentic models, underscoring the richness and challenge of workflows.
Data Efficiency and Model Performance: SkillSynth-derived training data demonstrably improve model performance and sample efficiency across every Qwen3 model size tested, with Qwen3-32B + SkillSynth outperforming larger, domain-specialized open-source models on Terminal-Bench 2.0.
Diversity: Trajectories from SkillSynth tasks exhibit up to 31% more unique scenario-skill pairs than standard LLM-generated or software-engineering-only baselines, confirming the effectiveness of graph-guided sampling for diversity amplification.
Ablation: Both single-skill and naive multi-skill (random combination) baselines produce tasks that are easier, less coherent, and less varied than SkillSynth tasks. Notably, randomly composed skill sequences lack the workflow requirements extractable only from structured, scenario-aware sampling.
Error Modes: Evaluation reveals persistent agentic problems—partial implementation, over-reliance on inline self-testing, and hallucinated API/flag usage dominate failure cases even for advanced models, highlighting the need for further research into instruction-compliance and robust verification.
Figure 4: Distribution of skill categories within the constructed graph reveals extensive coverage across canonical, specialized, and long-tail CLI domains.
Theoretical and Practical Implications
SkillSynth makes explicit the statistical prerequisites for performant policy learning in agentic environments: only trajectories that densely cover the joint space of semantically meaningful scenarios and actionable skills maximize learning capacity. Simply scaling instance count without structural consideration induces diminishing returns due to redundancy and lack of compositional exposure.
On the practical side, SkillSynth's pipeline offloads the vast majority of data engineering burden to scalable automation, conditional on the continued community expansion of curated skill libraries (e.g., ClawHub). It provides an extensible, domain-agnostic template applicable directly to other agentic task spaces requiring compositional, verifiable abstraction.
Future Directions
The study identifies multiple axes along which synthesis can be further advanced:
- Graph expansion: Scaling beyond chain-like paths to subgraphs, supporting parallel workflows and more intricate compositional dependencies.
- Harness robustness: Improved repair and task synthesis modules to further reduce non-recoverable failures and enable synthesis with weaker or more cost-effective generative models.
- Agent alignment: Addressing the algorithmic limitations behind prevalent error modes, especially instruction-grounded evaluation and adaptive workflow execution.
- Application generalization: Extending the scenario-mediated synthesis paradigm to cross-domain, multi-modal, or hardware-integrated agentic ecosystems.
Conclusion
SkillSynth provides a principled approach for scalable terminal task synthesis with explicit control over compositional diversity, enabled by a scenario-mediated skill graph. Empirical results implicate trajectory diversity (rather than mere volume) as critical for advancing terminal agentic capabilities. The modular, data-driven pipeline offers high-yield and low-cost synthesis, equipping both models and research with a foundation for continuous agentic improvement as community skill registries evolve.
References
- Zhiyuan Fan, et al., "Toward Scalable Terminal Task Synthesis via Skill Graphs" (2604.25727).