Gym-Anything: Turn any Software into an Agent Environment
This presentation introduces Gym-Anything, a groundbreaking framework that automatically converts arbitrary software into interactive agent environments for training and evaluating computer-use agents. The work addresses fundamental limitations in existing agent benchmarks by creating CUA-World, a massive benchmark spanning over 10,000 tasks across 200+ industrially relevant software applications selected based on economic impact. Through a novel multi-agent creation-audit loop and GDP-grounded software selection, this research demonstrates how to build scalable, realistic environments that cover the full spectrum of digital occupations, while revealing critical insights about agent generalization, scaling laws, and the substantial gap between current capabilities and practical automation.Script
Building agents that can automate real digital work sounds simple until you try to test them. Existing benchmarks use toy tasks in narrow domains, short workflows that take seconds, not the complex multi-hour processes that define actual jobs. The researchers behind Gym-Anything asked a harder question: what if we could turn any software—selected by economic impact—into a realistic training ground for agents?
The problem runs deeper than missing a few domains. Prior work evaluates agents on tasks that finish in minutes, using software configurations a college student could set up in an afternoon. Real jobs—clinical decision support, financial reconciliation, scientific analysis—demand hours of interaction across specialized tools. Manual curation of such environments is economically infeasible at the scale needed to train generalist agents.
The authors designed a system that treats environment creation itself as an agent task.
Naive agent-generated environments fail silently, producing plausible but broken setups. Gym-Anything enforces a creation-audit loop where one agent builds environments and documents every claim with screenshots and logs, while an adversarial auditor inspects only that evidence—not the creator's assertions—against rigorous checklists. A third agent distills lessons into shared memory, so each new environment benefits from all prior failures. This adversarial accountability makes automation reliable.
Software selection isn't arbitrary. The authors built a multi-stage pipeline that allocates GDP to specific applications, filtering for those that are free, sandboxable, and GUI-driven. Tiered algorithms balance coverage across all occupation groups while prioritizing high-impact domains. The result is CUA-World: 200 environments and over 10,000 tasks that systematically span the economic landscape of digital work, from engineering simulation to legal case management.
This figure reveals a fundamental limitation in agent generalization. When the researchers trained models on only a quarter or half of the 200 software applications, performance on those training environments recovered most of the possible improvement. But on held-out software, the gain collapsed to barely a fifth of what full coverage achieved. The gap between the solid and hatched bars shows that agents don't automatically transfer skills across applications. Coverage diversity isn't optional; it's the only path to generalist competence. Claims of general-purpose agents must now be tested against this standard.
Gym-Anything shifts the bottleneck from environment curation to the algorithms themselves, proving that realistic, large-scale agent evaluation is no longer a manual luxury but an automated necessity. The framework and CUA-World benchmark are now available for the research community. Visit EmergentMind.com to explore this work further and create your own research videos.