Papers
Topics
Authors
Recent
Search
2000 character limit reached

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Published 3 Apr 2026 in cs.SE and cs.AI | (2604.02648v1)

Abstract: The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for LLMs. In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

Summary

  • The paper introduces GBQA as a benchmark that challenges LLMs to autonomously discover bugs in game environments.
  • It employs a hierarchical multi-agent pipeline and dual-memory system to systematically evaluate agent performance across 30 games and 124 bugs.
  • Experimental results reveal low recall rates in autonomous bug discovery compared to conventional code repair benchmarks, highlighting critical research gaps.

GBQA: A Benchmark for Evaluating Autonomous Bug Discovery in Game Environments

Overview and Motivation

This paper introduces GBQA, a Game Benchmark for Quality Assurance, which directly evaluates the capacity of LLMs to function as fully autonomous QA engineers in interactive software environments, with a focus on game development (2604.02648). While prior research has primarily targeted code generation and repair conditioned on explicit issue reports, GBQA addresses the upstream challenge of autonomous bug discovery, in which LLM agents must detect specification-violating anomalies without human-supplied bug descriptions.

GBQA targets agentic evaluation through a suite of 30 systematically constructed game environments, embedding 124 human-verified bugs across varying difficulties. The benchmark is meticulously curated and provides a quantitative evaluation protocol, emphasizing recall in bug discovery—maximizing the coverage of latent faults, which remains a critical bottleneck in fully autonomous software engineering. Figure 1

Figure 1: Evolution of the software development paradigm in the LLM era: from human-driven workflows (a), to human–LLM collaborative coding (b), to the goal of fully autonomous development, including bug detection and QA (c), which is the principal focus of GBQA.

Benchmark Construction and Task Formulation

The GBQA benchmark leverages a hierarchical multi-agent pipeline for scalable environment and bug generation. A producer agent orchestrates three expert teams (Design, Programming, Art), each managed by leaders, and operationalizes a modular, multi-workspace development process. The system iteratively escalates environment complexity, injecting diverse and challenging bugs until a minimum discovery threshold is met, ensuring non-trivial state and interaction spaces.

Each environment E\mathcal{E} is defined as a tuple comprising state space, action space, transition dynamics, and initial state. Agents interact with E\mathcal{E} through RESTful backend APIs, collecting observations, emitting actions, and generating multiple ReAct-based exploration trajectories. Bugs are divided into three categories—easy, medium, hard—using cognitive and temporal complexity criteria ranging from perceptual inconsistencies to long-horizon state tracking challenges. Figure 2

Figure 2: GBQA dataset and evaluation loop, with a multi-agent system generating 30 games and 124 annotated bugs. During evaluation, an LLM-based QA agent autonomously interacts, submits bug reports, and a critic agent quantitatively matches to ground-truth.

Baseline Agents: ReAct and Memory Architectures

A baseline QA agent is provided, formalizing the agentic bug discovery loop. Unlike task-completion or repair agents, this model deploys intertwined ReAct-style planning and acting interleaved with explicit reflection on expectation violation at every step. Critically, upon anomaly suspicion, the agent invokes a local verification phase, seeking reports that are robust to false positives and reproducible.

To address context constraints in long-horizon interactive testing, a dual-level hierarchical memory system is introduced: an in-session short-term memory supporting summarization of recent trajectories, and a cross-session memory serving experience abstraction across playthroughs. This enables systematic rather than stochastic exploration, informative verification, and sustained reasoning about stateful, temporally delayed inconsistencies. Figure 3

Figure 3: Ablation study of the memory module, demonstrating that the combination of in-session and cross-session memory avoids redundant exploration and supports superior long-horizon discovery—each session cluster shows step budget gains with increasing memory sophistication.

Experimental Results and Analysis

Empirical evaluations are conducted over a representative suite of contemporary LLMs—Claude-4.6-Opus(-Thinking), GPT-5.2, Gemini-3.1-Pro, DeepSeek-R1, Qwen3(-Coder-Next)—with testing under player-only and QA-informed regimes, across four step budgets (T=50,100,200,500T=50,100,200,500). Their recall on GBQA is substantially lower than on recent code repair leaderboards (e.g., SWE-bench Verified), with the best result—Claude-4.6-Opus-Thinking—detecting only 48.39% of annotated defects at a 500-step budget.

Key findings include:

  • Bug discovery remains unsolved: Even with comprehensive source access and extended exploration, all agents fail to detect over half of present bugs under realistic constraints, highlighting the intrinsic complexity and limitation of autonomous bug discovery as compared to guided code repair.
  • Scaling law for reasoning: Incremental improvements are observed with both model scaling and more sophisticated inference-time capabilities ("thinking" variants), but increased weights alone trail behind deliberative reasoning enhancements.
  • Significant difficulty gap: Easy bugs—those evident from direct observation—are largely exhausted within a few hundred steps, while hard bugs, which depend on aggregating long-term interaction context, essentially require orders of magnitude more exploration and remain mostly unsolved.
  • QA mode advantage: Providing game specifications and code consistently boosts recall, but the effect is bounded by reasoning deficits, particularly around persistent state, planning, and systematic hypothesis testing. Figure 4

    Figure 4: Percentage of bug discovery by difficulty level and step count, showing plateauing for easy bugs and an unsaturated, near-linear regime for hard bugs, affirming the heightened long-horizon challenge.

Comparison with Conventional Code Benchmarks

Frontier LLMs now routinely exceed 70–80% on code repair datasets such as SWE-bench Verified; however, their recall on GBQA is routinely less than half of that. This supports the paper’s claim that "bug discovery presents a fundamentally harder problem than issue-driven repair," as it involves unscripted state exploration, implicit anomaly recognition, and real-time hypothesis refinement rather than localization and patching of a known defect.

Benchmark Reliability and Evaluation Protocol

GBQA’s annotation and evaluation pipeline is robustly validated: inter-annotator agreement achieves Krippendorff's α=0.901\alpha=0.901, and the automated critic agent’s decisions exhibit high Pearson correlation with human raters (e.g., GPT-5.2 ρ=0.903\rho=0.903, p0.0001p\ll0.0001), ensuring consistent and reliable assessment of discovered bugs.

Case Study: Autonomous Closed-Loop Development

A closed-loop case study demonstrates the feasibility of integrating GBQA’s discovery agent with a code repair agent (Claude Code), iteratively achieving 100% discovery and fix rates across three sessions on a representative environment. This highlights the viability of fully autonomous agentic development cycles, albeit contingent on more robust upstream discovery.

Implications and Future Directions

The introduction of GBQA exposes key research gaps in autonomous system-level QA. Practically, it provides a rigorous environment for experimentation on agentic strategies for systematic exploration, expectation inference, and specification-based anomaly detection. Theoretically, it sets an agenda for the development of memory-augmented agents, efficient credit assignment across interactions, and RL approaches attuned to QA task structure.

Looking forward, scaling GBQA beyond games to encompass broader, multimodal real-world applications, and augmenting LLMs with specialized QA RL or structured hypothesis-testing priors are clear next steps. The observed performance ceilings suggest that meaningful advances will require not just larger LLMs, but also new agentic architectures and training regimes specifically targeting open-ended bug discovery and long-horizon inference. Figure 5

Figure 5: The Game Environment Builder’s architecture, showing the Producer Agent orchestrating modular, multi-agent development, enabling rapid, scalable benchmark generation with controllable bug complexity.

Conclusion

GBQA advances the evaluation of LLMs from guided code synthesis and repair toward genuinely autonomous software quality assurance. Substantial performance gaps with repair tasks underscore the open challenges in agentic bug discovery, particularly for dynamic, state-rich, and temporally entangled errors typical of complex software systems. The benchmark’s design, results, and accompanying analysis provide a principled foundation for the next generation of QA-centric reasoning agents and their integration into autonomous development pipelines.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.