Measuring Progress Toward AGI: A Cognitive Framework

Published 27 May 2026 in cs.AI | (2605.28405v1)

Abstract: Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper proposes a novel multidimensional evaluation protocol that uses ten cognitive faculties to benchmark AI against human performance.
It employs systematic cognitive tasks, human baseline comparisons, and integration strategies to ensure robust scoring and uncertainty management.
The study highlights the need for evolving taxonomies and high-quality, private evaluation suites to effectively address current benchmarking gaps.

Measuring Progress Toward AGI: A Cognitive Framework

Motivation and Background

The paper "Measuring Progress Toward AGI: A Cognitive Framework" (2605.28405) addresses the persistent ambiguity in defining and measuring progress toward artificial general intelligence. Rather than relying on ad hoc or task-specific benchmarks, the authors advocate for an empirically grounded, multidimensional evaluation protocol rooted in established cognitive science, psychology, and neuroscience principles. The impetus is to provide clarity for researchers and policymakers, enabling nuanced assessment and robust governance of ever-advancing AI systems by mapping their capabilities to human cognition.

The Cognitive Taxonomy: Structure and Rationale

The authors introduce a cognitive taxonomy comprising ten faculties, each demarcating a core component of general intelligence as observed in humans:

Perception: Extraction and processing of sensory information across modalities.
Generation: Production of outputs, from language to motor actions.
Attention: Allocation of cognitive resources for relevant stimuli or tasks.
Learning: Acquisition of novel knowledge, skills, or behaviors.
Memory: Persistent storage and retrieval of information.
Reasoning: Inference and logical processing culminating in valid conclusions.
Metacognition: Self-monitoring and regulation of cognitive processes.
Executive Functions: Goal-directed organization, planning, inhibition, flexibility.
Problem Solving (Composite): Efficient application and integration of faculties in diverse domain-specific contexts. 10. Social Cognition (Composite): Processing and interpretation of social cues, norms, and interactions.

Each faculty is rigorously specified with subordinate abilities and modal variations (e.g., low/high-level perception, domain knowledge, various reasoning styles), ensuring comprehensive coverage and fine-grained diagnostic utility. Importantly, the taxonomy is implementation-agnostic, focusing on observable behavior and task competence rather than internal mechanisms or architectural traditions.

Figure 1: Overview of the 10 cognitive faculties. Faculties outlined in orange represent composite faculties.

Evaluation Protocol: Methodology for Cognitive Benchmarking

To operationalize AGI assessment, the paper prescribes a three-stage evaluation protocol:

Systematic Cognitive Task Suite: AI systems are measured on held-out, independently verified tasks designed to probe each cognitive faculty distinctly. Tasks exhibit structural diversity and span a wide range of human difficulty. Contamination resistance is prioritized to avoid training/test leakage.
Human Baseline Construction: A demographically representative adult cohort completes the identical suite of tasks. Baselines control for external resource access and instructional parity, encompassing the full spectrum of human cognitive ability.
Cognitive Profiling: AI and human results are mapped onto a faculty-wise performance distribution, enabling identification of strengths, weaknesses, and outlier profiles relative to human benchmarks. Multiple integration strategies (e.g., item aggregation, IRT models) are discussed for robust scoring. Uncertainties stemming from task quality, construct validity, and stochasticity are explicitly considered.
Figure 2: Cognitive profiles for three hypothetical systems, representing various achievement levels relative to the human sample median and maximum.

Analysis and Discussion

Benchmarking Gaps and Future Requirements

While partial benchmark coverage exists (notably in perception, problem solving, and world knowledge), significant lacunae remain for metacognition, attention, learning, and social cognition. Many existing datasets are public, which increasingly undermines their relevance due to contamination. The framework’s practical realization thus hinges on continued creation of high-quality, private, and independently audited evaluation suites.

Beyond Faculties: Supplementary Metrics

The authors elaborate that cognitive benchmarking is necessary but not sufficient for full AGI characterization. Complementary considerations include:

Processing and Response Speed: Timeliness as a critical axis for real-world utility, decoupled from correctness.
System Propensities: Behavioral tendencies (risk, alignment, strategy, communication) impacting safety and reliability.
Creativity: Although difficult to isolate, aspects like cognitive flexibility, world knowledge, and problem solving can be proxy-evaluated.
Deployment Evaluations: End-to-end empirical studies remain essential for domain-specific utility and impact forecasting.

Model vs. System Evaluation

The framework advises evaluating entire AI systems—including tools, modules, and environmental interfaces—rather than just core models, aligning with real-world deployment realities. Modularity is recognized as intrinsic to both biological and artificial intelligence. However, this raises methodological challenges regarding tool-access parity and interpretability for cognitive task construction.

Taxonomy Iteration and Emergent Capabilities

The taxonomy is not prescriptive or exhaustive; anticipated emergence of novel AI faculties necessitates iterative refinement. Practical relevance of each faculty for real-world tasks is not fully established, motivating future empirical validation work. The framework is posited as a foundation for a rigorous, evolving science of AGI.

Numerical Results and Claims

The paper’s methodology enables empirical differentiation of systems that:

Score below the human median in select faculties (signaling real-world limitations).
Exceed the human median across all faculties (potentially matching at least half of sampled humans).
Approach/exceed the human maximum for all faculties (indicative but not definitive for “superhuman” cognitive generality).

These faculty-by-faculty profiles afford nuanced, multidimensional system characterization rather than binary or monolithic “AGI/not-AGI” status.

Implications and Future Directions

Practically, the proposed framework sets the stage for transparent progress tracking and comparative evaluation of AGI candidates. It facilitates objective communication between technical stakeholders and policymakers, which is essential for responsible governance and deployment. Theoretically, mapping the jagged profile of AI cognition to the human spectrum stimulates both foundational research and ongoing taxonomy expansion.

Future developments will likely involve:

Closing benchmark gaps, particularly for social cognition and metacognition.
Enhanced statistical modeling for performance integration and uncertainty quantification.
Incorporation of emergent, non-human faculties and hybrid intelligence paradigms.
Refined benchmarks for creativity, behavioral propensities, and deployment-specific workflows.

Conclusion

"Measuring Progress Toward AGI: A Cognitive Framework" establishes a rigorous, empirically justified road map for AGI evaluation, grounded in human cognitive science. By decomposing intelligence into ten multidimensional faculties and prescribing a robust evaluation protocol, the framework offers actionable tools for benchmarking AI progress, contextualizing claims, and guiding responsible advancement. The approach is inherently extensible, baseline-driven, and adaptation-ready, providing a critical scaffold for scientific inquiry and technological oversight in advanced AI.

Markdown Report Issue