SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Published 10 Apr 2026 in cs.SE and cs.AI | (2604.09297v1)

Abstract: Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents an automated multi-objective optimization framework that significantly improves agent test pass rates and reduces inference cost in software engineering tasks.
It employs a two-agent evolutionary loop with NSGA-II selection to iteratively refine skill bundles via pruning and substitution, ensuring minimal performance regression.
Empirical evaluations showed up to +131.2% pass rate improvement and substantial cost reductions, demonstrating the framework’s superiority over static, manual skill tuning.

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Motivation and Problem Statement

The proliferation of LLM-based coding agents for software engineering tasks has spurred interest in modular agent skill systems, such as those popularized by Anthropic's skills API. These skills are packaged instruction sets or scripts that, when incorporated into an agent's workflow, modulate its behavior for domain-specific scenarios. While these skill bundles have been shown to improve performance across tasks per SkillsBench evaluations, manual skill tuning remains cost-inefficient, brittle, and reliant on non-transferable expertise. Benchmarking has revealed only modest performance gains ( $+4.5\%$ ) from skills in SE tasks (Li et al., 13 Feb 2026).

SkillMOO addresses this deficit with an automated, multi-objective optimization (MOO) framework. The framework orchestrates an interplay between a skill optimizer (driven by LLM-based proposals) and a task solver, using population-based NSGA-II selection across pass rate and cost. The explicit objective is to simultaneously enhance the effectiveness of coding agents (as measured by test pass rate) and minimize their operational costs (inference budget), while reducing the manual overhead of bundle refinement.

Technical Approach

SkillMOO implements a two-agent evolutionary optimization loop, commencing from a seed skill bundle. The process can be summarized as follows:

Solver-Agent Evaluation: At each generation, the task solver evaluates the candidate skill bundle on an SE task, producing pass rate, cost, runtime, and error traces.
Optimizer-Agent Proposal: Leveraging current optimizer-skill prompts and observed failure evidence, the optimizer agent proposes a child skill bundle edit. Edits encompass pruning, substitution, reordering, and selective rewriting.
NSGA-II Pareto Selection: All bundle candidates are non-dominated sorted using NSGA-II on the vector $[\text{pass\_rate}(b), \text{cost}(b)]$ . Selection ensures maintenance of pass rate (+ guard threshold: no >0.05 regression vs. parent), and lexicographic order resolves ties (preference: pass rate $\to$ cost $\to$ runtime).
Termination: Iterations (generation budget) continue until a fixed number of generations; the best candidate is selected from the Pareto front.
Figure 1: SkillMOO workflow: solver-optimizer loop with evolving skill bundles.

This design facilitates discovery of skill bundles that are both effective (higher test pass rates) and efficient (lower LLM inference costs), while tracking optimization-induced runtime impacts.

Experimental Design

Three complex software engineering tasks were selected from the SkillsBench suite, each embedding the largest available skill bundles. To ensure robust evaluation, the default verifiers for each task were augmented to 40 tests using GPT-5.4, thus increasing test coverage and the diagnostic power of performance metrics. The tasks evaluated were:

Task 1: Python build-failure repair,
Task 2: Python-to-Scala pipeline logic translation,
Task 3: Spring Boot to Jakarta API migration.

GLM-5 was selected as the LLM backbone for both task-solving and skill-optimization roles, and evolutionary search was operationalized with a population size of 1 and 5 generations per task. Each experiment was repeated 10 times, using Scott-Knott ESD to establish pass rate ranks and effect sizes.

Comparative baselines included the original skill bundles and a no-skill configuration.

Empirical Results

Pass Rate and Cost Improvement

SkillMOO demonstrated consistent, statistically significant improvement in agent performance over static bundles:

Task 1: Pass rate increased from 0.16 to 0.37 (+131.2%), with cost reduced by 31.7% and runtime reduced by 23.6%.
Task 2: Pass rate increased from 0.39 to 0.51 (+30.8%), cost dropped by 5.4%, and runtime by 8.0%.
Task 3: Pass rate improved from 0.97 to 0.99 (+2.1%), with cost and runtime reduced by 19.4% and 29.8%, respectively.

SkillMOO always delivered higher pass rates than both original and no-skill baselines, and also yielded superior cost efficiency relative to the original bundles.

Optimization Overhead and Efficiency

Multi-objective improvement, measured as Pareto hypervolume (HV) gain in pass-vs-cost space, was substantial for all tasks:

HV gains ranged from 301% (Task 3) to 2110% (Task 1) over baseline.
Optimization cost per percentage point of HV gain was extremely low—e.g., $0.0011 for Task 1.

These results highlight the economic feasibility of SkillMOO's evolutionary optimization: the cost-to-value ratio for improvement is trivial compared to the gains.

Edit Pattern Analysis

A systematic analysis of optimizer-agent edit logs demonstrated the dominance of pruning and substitution operations:

Pruning (removal of irrelevant skill blocks) and substitution (replacement with alternate guidance) occurred most frequently and consistently yielded cost reductions and, in many cases, performance improvement.
Expansion (additive guidance) rarely led to pass rate improvement.
Task-specific targeted removals (such as RestTemplate code in Task 3) directly mitigated overfitting to obsolete cues and improved pass rates when original bundle instructions conflicted with task constraints.

Discussion and Implications

This study establishes several key findings for SE agent optimization:

Automated, LLM-driven skill bundle optimization is both feasible and effective for SE tasks, even under realistic time and resource constraints.
Minimalist and focused skill bundles—achieved via systematic pruning and targeted substitution—are more effective than accumulation of elaborative or generic guidance.
Manual tuning is inefficient: The automated approach both outperforms and is far less labor-intensive compared to the traditional, expert-driven process.
Task specificity matters: Effective bundle edits reflect misalignment between prior expectations and actual task requirements, underscoring the adaptability of SkillMOO.

These results are consistent with recent observations from other skill-evolution approaches such as EvoSkill (Alzubi et al., 3 Mar 2026), Meta Context Engineering (Ye et al., 29 Jan 2026), and EvoSkills (Zhang et al., 2 Apr 2026), but SkillMOO extends the methodology to multi-objective search and explicitly weighs cost/benefit tradeoffs.

Future Directions

Given these findings, future research should address the following:

Generalizability: Expanding evaluation beyond the three largest SkillsBench tasks to smaller or cross-domain tasks.
Other LLMs and Meta-Agents: Adapting SkillMOO for a diversity of underlying LLMs (beyond GLM-5) and scaling optimizer-solver schema.
Causal Attribution: More granular studies into which skill modifications drive causal improvements, deconfounded from underlying LLM idiosyncrasies.
Integration in Continuous Agent Learning: Leveraging SkillMOO's skill evolution in ongoing agent self-improvement and live deployment scenarios.

Conclusion

SkillMOO delivers a principled, efficient approach for multi-objective skill bundle optimization in LLM-based software engineering agents. The empirical evidence establishes robust improvements in both agent pass rate and operational cost, further strengthened by consistent findings that skill pruning and selective substitution are more beneficial than accumulating additional instruction. These insights are valuable for both researchers and practitioners seeking to deploy scalable, cost-efficient, and high-performing agent systems in evolving SE environments.

Markdown Report Issue