- The paper introduces OS-SPEAR, a unified toolkit that evaluates OS agents across safety, performance, efficiency, and robustness dimensions.
- It employs specialized evaluation subsets and an automated analysis tool to deliver granular metrics and comprehensive diagnostic reports.
- Empirical results on 22 OS agents reveal trade-offs between efficiency and safety, underscoring the need for specialized design in real-world scenarios.
Introduction
The paper "OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents" (2604.24348) presents OS-SPEAR, a unified evaluation toolkit addressing significant deficiencies in current OS agent benchmarking. While OS agents powered by MLLMs have evolved from language output systems to behavioral executors interacting with GUIs, most prior benchmarks focus narrowly on task completion. This paper asserts that reliable OS agents must also be evaluated for safety, efficiency, and robustness, especially given their deployment in real-world settings where environmental hazards and adversarial disturbances are prevalent. OS-SPEAR systematically covers four dimensionsโSafety (S), Performance (P), Efficiency (E), and Robustness (R)โand introduces specialized evaluation subsets and an automated analysis tool for comprehensive diagnostic reporting.
Figure 1: OS-SPEAR provides integrated benchmarking of OS agents across safety, performance, efficiency, and robustness with dimension-specific metrics and reporting.
OS agents are autonomous systems for GUI-based task automation, commonly leveraging MLLMs to translate visual and textual input into actionable steps (clicks, text input, navigation, etc.). The paper formalizes execution flow and threat modeling: a policy maps screenshots, histories, and prompts to actions, which are executed repeatedly until termination. Threats to reliability are modeled as stochastic environmental noise or adversarial perturbations, affecting both visual and textual modalities. Robustness is defined as resistance to perturbation-induced divergence from optimal trajectories, with formal metrics quantifying robustness degradation and safety violations.
OS-SPEAR Subset Design and Construction
OS-SPEAR deconstructs agent benchmarking into four rigorously constructed subsets:
S-subset: Safety
The S-subset systematically evaluates OS agents' resilience against unsafe factors from environmental distractions, real-world anomalies, and targeted adversarial misleading. Unsafe factors include intentionally crafted distractions (e.g., pop-ups designed to manipulate agent behavior), naturally occurring anomalies, and adversarial content intended to induce detrimental actions.
Figure 2: S-subset scenario taxonomy: environmental distractions, real-world anomalies, and adversarial misleading events.
Performance evaluation is addressed by curating trajectories of varying complexity and annotation quality through agent voting and stratified sampling, ensuring only high-value test cases are retained. The subset is constructed in two stages: small-scale agent-based filtering of trivial cases, followed by large-scale agents voting trajectories by difficulty (Easy, Medium, Hard), discarding ambiguous or overly challenging cases.
Figure 3: P-subset construction: trajectory curation via value estimation and stratified sampling for balanced difficulty distribution.
E-subset: Efficiency
OS-SPEAR quantitatively assesses efficiency via both temporal latency and token consumption (the latter weighted by output tokens, reflecting real-world economic deployment constraints). Unlike prior benchmarks relying solely on step counts, this approach integrates inference cost metrics aligned with commercial billing standards, offering a more actionable perspective on agent scalability and deployment.
Figure 4: E-subset token evaluation: inference token cost (input + 3x output) across all agents with intra-family comparisons.
Figure 5: E-subset time evaluation: inference latency (log scale) per step, highlighting scale-related efficiency trade-offs.
R-subset: Robustness
Robustness is investigated by subjecting agents to cross-modal disturbancesโfive visual (mask, zoom-in, varying Gaussian noise) and five textual (state conflict, bad memory, bad knowledge, irrelevant memory, irrelevant knowledge). The subset enables granular analysis of modality-dependent vulnerabilities and resilience mechanisms.
The toolkit integrates a multi-agent analysis tool for automated reporting. Dimension-specific expert agents parse evaluation logs, generate specific scores, and identify finer diagnostic characteristics. An integrated agent produces an overall report featuring composite scoring, risk summaries, and actionable insights.
Figure 6: Analysis tool pipeline: expert agents synthesize subset-specific reports; integrated agent produces comprehensive summary and risk stratification.
Experimental Study and Numerical Results
OS-SPEAR was applied to 22 representative OS agents, covering both specialized and general-purpose models across various scale regimes. Dimension-specific and overall rankings were computed from task completion rate, safety under perturbations, efficiency metrics, and robustness degradation.
Key Numerical Findings
- UI-Venus-1.5-8B ranked first overall, demonstrating balanced excellence in task completion, safety, efficiency, and robustness.
- Specialized OS agents systematically outperformed general-purpose models across aggregate and dimensional metrics.
- Trade-offs were observed: models engineered for efficiency (e.g., AgentCPM-GUI-8B, OS-Atlas-Pro-7B) suffered in safety and robustness due to reduced reasoning and context utilization.
- Performance scaling within model families yielded diminishing returns; parameter increases did not linearly translate to overall ranking improvements due to associated efficiency penalties.
- Robustness vulnerabilities were modality-specific: some agents were disproportionately affected by visual perturbations, whereas others were susceptible to textual conflicts and irrelevant injected content.
Practical and Theoretical Implications
OS-SPEAR enables nuanced and actionable evaluation of OS agent reliability, moving beyond simplistic completion metrics. Multidimensional benchmarking is essential for safe deployment in high-stakes real-world scenarios, such as medical, financial, or social automation where failures can result in catastrophic outcomes. The methodology surfaces critical trade-offs and guides future design: efficiency optimization must not neglect safety; domain adaptation and specialized model design are mandatory for robust operation in realistic environments. The results highlight the need for innovations beyond parameter scalingโaddressing architectural, algorithmic, and data-centric advances to achieve both scalability and reliability.
Future Directions
The synergy between safety, robustness, and efficiency remains an open area, with substantial theoretical interest in causal reasoning, explicit risk modeling, and adversarial mitigation. The cross-modal perturbation protocol in OS-SPEAR also motivates future research in joint vision-language threat assessment. Automated diagnostic reporting could be extended to real-time monitoring and continual learning systems for ongoing reliability assurance. OS-SPEAR establishes a standardized foundation for future agent research, facilitating reproducible benchmarking and rigorous comparative studies.
Conclusion
OS-SPEAR introduces a rigorous, multidimensional evaluation toolkit addressing safety, performance, efficiency, and robustness for OS agents interacting with GUIs. It provides significant advances in methodology, dataset curation, and automated reporting. Empirical results reveal trade-offs, modality-specific vulnerabilities, and scaling limitations, underscoring the necessity of comprehensive benchmarks for reliable agent deployment. OS-SPEAR constitutes a foundational resource for the development and assessment of next-generation OS agents, and facilitates the advancement of agent research across practical and theoretical domains.