- The paper presents an automated multi-objective optimization framework that significantly improves agent test pass rates and reduces inference cost in software engineering tasks.
- It employs a two-agent evolutionary loop with NSGA-II selection to iteratively refine skill bundles via pruning and substitution, ensuring minimal performance regression.
- Empirical evaluations showed up to +131.2% pass rate improvement and substantial cost reductions, demonstrating the framework’s superiority over static, manual skill tuning.
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering
Motivation and Problem Statement
The proliferation of LLM-based coding agents for software engineering tasks has spurred interest in modular agent skill systems, such as those popularized by Anthropic's skills API. These skills are packaged instruction sets or scripts that, when incorporated into an agent's workflow, modulate its behavior for domain-specific scenarios. While these skill bundles have been shown to improve performance across tasks per SkillsBench evaluations, manual skill tuning remains cost-inefficient, brittle, and reliant on non-transferable expertise. Benchmarking has revealed only modest performance gains (+4.5%) from skills in SE tasks (Li et al., 13 Feb 2026).
SkillMOO addresses this deficit with an automated, multi-objective optimization (MOO) framework. The framework orchestrates an interplay between a skill optimizer (driven by LLM-based proposals) and a task solver, using population-based NSGA-II selection across pass rate and cost. The explicit objective is to simultaneously enhance the effectiveness of coding agents (as measured by test pass rate) and minimize their operational costs (inference budget), while reducing the manual overhead of bundle refinement.
Technical Approach
SkillMOO implements a two-agent evolutionary optimization loop, commencing from a seed skill bundle. The process can be summarized as follows:
This design facilitates discovery of skill bundles that are both effective (higher test pass rates) and efficient (lower LLM inference costs), while tracking optimization-induced runtime impacts.
Experimental Design
Three complex software engineering tasks were selected from the SkillsBench suite, each embedding the largest available skill bundles. To ensure robust evaluation, the default verifiers for each task were augmented to 40 tests using GPT-5.4, thus increasing test coverage and the diagnostic power of performance metrics. The tasks evaluated were:
- Task 1: Python build-failure repair,
- Task 2: Python-to-Scala pipeline logic translation,
- Task 3: Spring Boot to Jakarta API migration.
GLM-5 was selected as the LLM backbone for both task-solving and skill-optimization roles, and evolutionary search was operationalized with a population size of 1 and 5 generations per task. Each experiment was repeated 10 times, using Scott-Knott ESD to establish pass rate ranks and effect sizes.
Comparative baselines included the original skill bundles and a no-skill configuration.
Empirical Results
Pass Rate and Cost Improvement
SkillMOO demonstrated consistent, statistically significant improvement in agent performance over static bundles:
- Task 1: Pass rate increased from 0.16 to 0.37 (+131.2%), with cost reduced by 31.7% and runtime reduced by 23.6%.
- Task 2: Pass rate increased from 0.39 to 0.51 (+30.8%), cost dropped by 5.4%, and runtime by 8.0%.
- Task 3: Pass rate improved from 0.97 to 0.99 (+2.1%), with cost and runtime reduced by 19.4% and 29.8%, respectively.
SkillMOO always delivered higher pass rates than both original and no-skill baselines, and also yielded superior cost efficiency relative to the original bundles.
Optimization Overhead and Efficiency
Multi-objective improvement, measured as Pareto hypervolume (HV) gain in pass-vs-cost space, was substantial for all tasks:
- HV gains ranged from 301% (Task 3) to 2110% (Task 1) over baseline.
- Optimization cost per percentage point of HV gain was extremely low—e.g., $0.0011 for Task 1.
These results highlight the economic feasibility of SkillMOO's evolutionary optimization: the cost-to-value ratio for improvement is trivial compared to the gains.
Edit Pattern Analysis
A systematic analysis of optimizer-agent edit logs demonstrated the dominance of pruning and substitution operations:
- Pruning (removal of irrelevant skill blocks) and substitution (replacement with alternate guidance) occurred most frequently and consistently yielded cost reductions and, in many cases, performance improvement.
- Expansion (additive guidance) rarely led to pass rate improvement.
- Task-specific targeted removals (such as RestTemplate code in Task 3) directly mitigated overfitting to obsolete cues and improved pass rates when original bundle instructions conflicted with task constraints.
Discussion and Implications
This study establishes several key findings for SE agent optimization:
- Automated, LLM-driven skill bundle optimization is both feasible and effective for SE tasks, even under realistic time and resource constraints.
- Minimalist and focused skill bundles—achieved via systematic pruning and targeted substitution—are more effective than accumulation of elaborative or generic guidance.
- Manual tuning is inefficient: The automated approach both outperforms and is far less labor-intensive compared to the traditional, expert-driven process.
- Task specificity matters: Effective bundle edits reflect misalignment between prior expectations and actual task requirements, underscoring the adaptability of SkillMOO.
These results are consistent with recent observations from other skill-evolution approaches such as EvoSkill (Alzubi et al., 3 Mar 2026), Meta Context Engineering (Ye et al., 29 Jan 2026), and EvoSkills (Zhang et al., 2 Apr 2026), but SkillMOO extends the methodology to multi-objective search and explicitly weighs cost/benefit tradeoffs.
Future Directions
Given these findings, future research should address the following:
- Generalizability: Expanding evaluation beyond the three largest SkillsBench tasks to smaller or cross-domain tasks.
- Other LLMs and Meta-Agents: Adapting SkillMOO for a diversity of underlying LLMs (beyond GLM-5) and scaling optimizer-solver schema.
- Causal Attribution: More granular studies into which skill modifications drive causal improvements, deconfounded from underlying LLM idiosyncrasies.
- Integration in Continuous Agent Learning: Leveraging SkillMOO's skill evolution in ongoing agent self-improvement and live deployment scenarios.
Conclusion
SkillMOO delivers a principled, efficient approach for multi-objective skill bundle optimization in LLM-based software engineering agents. The empirical evidence establishes robust improvements in both agent pass rate and operational cost, further strengthened by consistent findings that skill pruning and selective substitution are more beneficial than accumulating additional instruction. These insights are valuable for both researchers and practitioners seeking to deploy scalable, cost-efficient, and high-performing agent systems in evolving SE environments.