CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Published 20 Apr 2025 in cs.CL | (2504.14462v1)

Abstract: The rise of LLMs has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

Analysis of CoLoTa: A Dataset for Entity-Based Commonsense Reasoning over Long-Tail Knowledge

The paper, "CoLoTa: A Dataset for Entity-Based Commonsense Reasoning over Long-Tail Knowledge," addresses a significant challenge in the domain of LLMs: their limited capability to perform commonsense reasoning over obscure, long-tail entities. This research introduces the CoLoTa dataset, designed to evaluate the reasoning abilities of LLMs and their resilience to hallucinations in long-tail contexts.

Objectives and Contributions

CoLoTa is crafted to assess two main facets of LLM performance: (i) the ability to conduct commonsense reasoning over long-tail knowledge and (ii) the capability of Knowledge Graph Question Answering (KGQA) methods to incorporate both factual and commonsense knowledge. The dataset is composed of 3,300 queries, split evenly between question answering and claim verification tasks, ensuring a diverse representation of commonsense reasoning skills required in varied scenarios.

The uniqueness of CoLoTa lies in its focus on long-tail entities, unlike existing benchmarks that generally revolve around well-documented subjects. This dataset is meticulously curated by replacing popular entities with obscure counterparts, ensuring that the necessary factual bases are present within Wikidata, hence facilitating its dual role as a KGQA dataset.

Methodology

The authors executed a systematic approach to construct CoLoTa:

Query Selection and Rewriting: Queries were sourced from existing datasets, StrategyQA and CREAK, specifically targeting ones that possess accessible factual backing in Wikidata. Rewriting involved replacing popular entities with less-known ones, thereby increasing the cognitive demands on LLMs for accurate reasoning.
Annotation and Representation: Each query in CoLoTa is annotated with relevant Wikidata entities, inference rules, and reasoning steps. This framework ensures a structured approach to evaluate both factual accuracy and logical coherence in response generation.
Reasoning Skills Distribution: The dataset is diverse, encompassing broad reasoning skills such as temporal reasoning, numeric comparison, and domain-specific knowledge (e.g., historical, geographical).

Findings and Implications

The experimentation with state-of-the-art LLMs, including OpenAI's GPT models and others, reveals critical insights:

Performance Gaps: There is a significant accuracy decline when models transition from handling original queries to CoLoTa's long-tail equivalents. This reduction is pronounced in claim verification tasks, underscoring the complexity of reasoning required for less-documented entities.
Limitations of KGQA Methods: Current KGQA techniques underperform with CoLoTa’s queries, indicating a need for advancements in these methodologies to handle the integration of commonsense reasoning with KG-based factual retrieval.
Hallucination and Reasoning Errors: LLMs demonstrate increased hallucination rates and reasoning flaws in long-tail contexts, illuminating the necessity for models that can discern relevance and logical relationships even without extensive entity documentation.

Conclusion and Future Directions

The CoLoTa dataset serves as a pivotal benchmark that exposes the deficiencies of LLMs in dealing with long-tail knowledge, posing a substantial challenge to both LLMs and KGQA methods. The dataset’s focus on diversified reasoning skills and obscure entities provides a platform for future AI research to enhance the robust reasoning capabilities of LLMs across varied contexts.

Further investigation could explore integrating advancements in few-shot or zero-shot learning techniques to heighten model performance on long-tail knowledge bases. Moreover, developing methodologies that can seamlessly amalgamate factual and commonsense reasoning in KGQA systems promises significant improvements in AI's interpretative and reasoning prowess.

Markdown Report Issue