OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

Published 27 Apr 2026 in cs.CL | (2604.24348v1)

Abstract: The evolution of Multimodal LLMs (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces OS-SPEAR, a unified toolkit that evaluates OS agents across safety, performance, efficiency, and robustness dimensions.
It employs specialized evaluation subsets and an automated analysis tool to deliver granular metrics and comprehensive diagnostic reports.
Empirical results on 22 OS agents reveal trade-offs between efficiency and safety, underscoring the need for specialized design in real-world scenarios.

OS-SPEAR: A Comprehensive Toolkit for Multi-Dimensional Evaluation of OS Agents

Introduction

The paper "OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents" (2604.24348) presents OS-SPEAR, a unified evaluation toolkit addressing significant deficiencies in current OS agent benchmarking. While OS agents powered by MLLMs have evolved from language output systems to behavioral executors interacting with GUIs, most prior benchmarks focus narrowly on task completion. This paper asserts that reliable OS agents must also be evaluated for safety, efficiency, and robustness, especially given their deployment in real-world settings where environmental hazards and adversarial disturbances are prevalent. OS-SPEAR systematically covers four dimensions—Safety (S), Performance (P), Efficiency (E), and Robustness (R)—and introduces specialized evaluation subsets and an automated analysis tool for comprehensive diagnostic reporting.

Figure 1: OS-SPEAR provides integrated benchmarking of OS agents across safety, performance, efficiency, and robustness with dimension-specific metrics and reporting.

OS Agent Evaluation: Dimension Formalization

OS agents are autonomous systems for GUI-based task automation, commonly leveraging MLLMs to translate visual and textual input into actionable steps (clicks, text input, navigation, etc.). The paper formalizes execution flow and threat modeling: a policy maps screenshots, histories, and prompts to actions, which are executed repeatedly until termination. Threats to reliability are modeled as stochastic environmental noise or adversarial perturbations, affecting both visual and textual modalities. Robustness is defined as resistance to perturbation-induced divergence from optimal trajectories, with formal metrics quantifying robustness degradation and safety violations.

OS-SPEAR Subset Design and Construction

OS-SPEAR deconstructs agent benchmarking into four rigorously constructed subsets:

S-subset: Safety

The S-subset systematically evaluates OS agents' resilience against unsafe factors from environmental distractions, real-world anomalies, and targeted adversarial misleading. Unsafe factors include intentionally crafted distractions (e.g., pop-ups designed to manipulate agent behavior), naturally occurring anomalies, and adversarial content intended to induce detrimental actions.

Figure 2: S-subset scenario taxonomy: environmental distractions, real-world anomalies, and adversarial misleading events.

P-subset: Performance

Performance evaluation is addressed by curating trajectories of varying complexity and annotation quality through agent voting and stratified sampling, ensuring only high-value test cases are retained. The subset is constructed in two stages: small-scale agent-based filtering of trivial cases, followed by large-scale agents voting trajectories by difficulty (Easy, Medium, Hard), discarding ambiguous or overly challenging cases.

Figure 3: P-subset construction: trajectory curation via value estimation and stratified sampling for balanced difficulty distribution.

E-subset: Efficiency

OS-SPEAR quantitatively assesses efficiency via both temporal latency and token consumption (the latter weighted by output tokens, reflecting real-world economic deployment constraints). Unlike prior benchmarks relying solely on step counts, this approach integrates inference cost metrics aligned with commercial billing standards, offering a more actionable perspective on agent scalability and deployment.

Figure 4: E-subset token evaluation: inference token cost (input + 3x output) across all agents with intra-family comparisons.

Figure 5: E-subset time evaluation: inference latency (log scale) per step, highlighting scale-related efficiency trade-offs.

R-subset: Robustness

Robustness is investigated by subjecting agents to cross-modal disturbances—five visual (mask, zoom-in, varying Gaussian noise) and five textual (state conflict, bad memory, bad knowledge, irrelevant memory, irrelevant knowledge). The subset enables granular analysis of modality-dependent vulnerabilities and resilience mechanisms.

Automated Analysis Tool

The toolkit integrates a multi-agent analysis tool for automated reporting. Dimension-specific expert agents parse evaluation logs, generate specific scores, and identify finer diagnostic characteristics. An integrated agent produces an overall report featuring composite scoring, risk summaries, and actionable insights.

Figure 6: Analysis tool pipeline: expert agents synthesize subset-specific reports; integrated agent produces comprehensive summary and risk stratification.

Experimental Study and Numerical Results

OS-SPEAR was applied to 22 representative OS agents, covering both specialized and general-purpose models across various scale regimes. Dimension-specific and overall rankings were computed from task completion rate, safety under perturbations, efficiency metrics, and robustness degradation.

Key Numerical Findings

UI-Venus-1.5-8B ranked first overall, demonstrating balanced excellence in task completion, safety, efficiency, and robustness.
Specialized OS agents systematically outperformed general-purpose models across aggregate and dimensional metrics.
Trade-offs were observed: models engineered for efficiency (e.g., AgentCPM-GUI-8B, OS-Atlas-Pro-7B) suffered in safety and robustness due to reduced reasoning and context utilization.
Performance scaling within model families yielded diminishing returns; parameter increases did not linearly translate to overall ranking improvements due to associated efficiency penalties.
Robustness vulnerabilities were modality-specific: some agents were disproportionately affected by visual perturbations, whereas others were susceptible to textual conflicts and irrelevant injected content.

Practical and Theoretical Implications

OS-SPEAR enables nuanced and actionable evaluation of OS agent reliability, moving beyond simplistic completion metrics. Multidimensional benchmarking is essential for safe deployment in high-stakes real-world scenarios, such as medical, financial, or social automation where failures can result in catastrophic outcomes. The methodology surfaces critical trade-offs and guides future design: efficiency optimization must not neglect safety; domain adaptation and specialized model design are mandatory for robust operation in realistic environments. The results highlight the need for innovations beyond parameter scaling—addressing architectural, algorithmic, and data-centric advances to achieve both scalability and reliability.

Future Directions

The synergy between safety, robustness, and efficiency remains an open area, with substantial theoretical interest in causal reasoning, explicit risk modeling, and adversarial mitigation. The cross-modal perturbation protocol in OS-SPEAR also motivates future research in joint vision-language threat assessment. Automated diagnostic reporting could be extended to real-time monitoring and continual learning systems for ongoing reliability assurance. OS-SPEAR establishes a standardized foundation for future agent research, facilitating reproducible benchmarking and rigorous comparative studies.

Conclusion

OS-SPEAR introduces a rigorous, multidimensional evaluation toolkit addressing safety, performance, efficiency, and robustness for OS agents interacting with GUIs. It provides significant advances in methodology, dataset curation, and automated reporting. Empirical results reveal trade-offs, modality-specific vulnerabilities, and scaling limitations, underscoring the necessity of comprehensive benchmarks for reliable agent deployment. OS-SPEAR constitutes a foundational resource for the development and assessment of next-generation OS agents, and facilitates the advancement of agent research across practical and theoretical domains.

Markdown Report Issue