UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

Published 17 Apr 2026 in cs.IR and cs.CL | (2604.15827v2)

Abstract: Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the UsefulBench benchmark, which annotates texts with both relevance and decision usefulness scores to guide IR evaluation.
It reveals that classical IR models excel in retrieving relevant content, while LLM-driven ranking offers improvements in usefulness with notable limitations.
The analysis highlights failure modes and demonstrates that targeted interventions like few-shot prompting and fine-tuning partially bridge the expert knowledge gap.

Decision-Useful Information as a Target for Information Retrieval: An Expert Analysis of the UsefulBench Benchmark

Introduction

"UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval" (2604.15827) presents a pivotal advancement for the evaluation and development of information retrieval (IR) systems—particularly in knowledge-intensive domains such as sustainability analysis. While conventional IR primarily rewards retrieval based on lexical and semantic relevance, this work delineates and operationalizes the distinction between relevance (thematic/contextual matching) and usefulness (direct practical value for decision-making or question answering). The authors formalize this dichotomy via the UsefulBench benchmark, which captures expert annotations for both labels in a fine-grained graded manner.

Dataset Construction and Annotation Schema

The UsefulBench dataset is the result of an expert-driven labeling process involving three domain professionals in sustainability analysis. For each of 64 queries across 15 real-world sustainability reports, analysts annotate text passages for both their relevance (to the query) and their usefulness (for supporting concrete answers or decisions). Crucially, labels are assigned at three levels each (0: not relevant/useful, 1: partially, 2: fully), and annotators consolidate their judgments through consensus discussion.

Figure 1: The UsefulBench creation pipeline, depicting the iterative, consensus-based expert annotation workflow for labeling relevance and usefulness.

Two main datasets are released: a gold set (UsefulBench-gold) with 1,061 fine-grained human-annotated (query, document) pairs, and a substantially larger set (UsefulBench-full) with 53,000+ triplets at the (report, query, document) level for retrieval evaluation.

Empirical Distinction Between Relevance and Usefulness

Analysis of the label distribution demonstrates a strong correlation (high inherent overlap) but also frequent, consequential divergences between relevance and usefulness. Approximately 22% of the data display maximal relevance but only partial usefulness: these passages often comprise contextual or explanatory content lacking actionable details or answers. This highlights the inadequacy of using purely relevance-driven retrieval signals in settings where decision utility is critical.

Benchmarking Classical and LLM-based IR Models

The authors benchmark both classical (BM25, dense embedding models such as BGE-M3, and hybrid reranking approaches) and LLM-based document ranking systems on UsefulBench. Key findings include:

Classical IR models (BM25, BGE-M3): These excel at retrieving relevant content, substantially outperforming their utility-based retrieval—confirming that similarity metrics are a poor proxy for actionable usefulness.
LLM-based ranking (GPT-4.1 models): When prompted specifically to predict usefulness, LLMs surpass classical models on usefulness ranking but still lag on relevance labeling. However, LLM-based improvements for usefulness reach an empirical ceiling quickly relative to model size and capacity.
Figure 2: Comparative nDCG@10 results for embedding-based vs. LLM-based document rankings—the former excels at relevance, the latter at usefulness.

The analysis shows that while scaling LLM parameters improves calibration and fine-grained discrimination, utility-based retrieval faces an early upper bound, indicating inherent limitations of current models in expert-level domain understanding.

Failure Modes and Error Analysis

A qualitative misclassification study reveals that a majority of LLM errors are true model limitations—cases where expert human interpretation is required that current LLMs fail to emulate. Approximately a third of disagreements stem from ambiguous or under-specified query descriptions that require context-specific expertise, and less than 10% result from annotation errors. This substantiates that domain knowledge remains the primary bottleneck in decision-useful IR.

Strategies to Overcome Expert Knowledge Integration Gap

Extensive ablation studies examine interventions to close the gap:

Joint multitask prompting (predicting both relevance and usefulness in a shared prompt) yields minor efficiency- and performance-gains.
Specialized descriptions (annotating with explicit reference to actions, targets, solutions) and keyword suppression (removing lexical cues from prompts) do not robustly improve utility ranking.
Figure 3: Variation in F1 scores across prompt ablations, with modest differences between multitask, specialized, and keyword-free formulations.
Few-shot prompting, especially with in-domain similar examples, slightly enhances F1 at the expense of calibration error, primarily benefiting usefulness prediction.
Supervised fine-tuning of smaller (3b/8b/14b) open LLMs on gold labels produces the largest gains in usefulness F1 (e.g., ministral-8b from 0.434 to 0.639), but again at some cost to probability calibration.
Figure 4: F1 score improvements for Ministral models (3b/8b/14b) after few-epoch fine-tuning, with strongest gains for usefulness classification.

These results emphasize that scalable approaches to imbue decision-useful expertise—such as fine-tuning and example-driven prompting—offer only partial remediation. Theoretical limitations in aligning generic LLMs to nuanced expert judgments persist.

Results Synthesis and Implications

From a system design perspective, the results strongly caution against using relevance labels as a stand-in for document usefulness in knowledge-intensive retrieval settings including, but not limited to, sustainability reporting, policy analysis, or high-stakes QA. LLM-based classifiers, though better than similarity-based models, are not sufficient without explicit exposure to expert rationales, granular examples, or domain adaptation.

Theoretically, the findings affirm the results of recent literature on the limitations of embedding-based retrieval for utility-focused applications, and reinforce concerns about the LLM-as-judge paradigm in expert evaluation settings [Szymanski2025, zhang-etal-2025-utility]. This underscores the necessity of further research into in-context knowledge injection, integration of retrieval and reasoning over structured expert knowledge, and improved annotation schemes for utility.

Directions for Future Research

Potential future progress includes:

Domain transfer and expansion: The usefulness/relevance dichotomy must be studied in other verticals (e.g., law, medicine, finance) to assess generality.
Active and adaptive annotation: Semi-automated selection and labeling pipelines, supported by active query generation and expert-in-the-loop designs, are prime candidates for increasing the annotation throughput and diversity.
Utility-oriented retrieval architectures: Incorporating structured and contextual signals, logical entailment, and causal/event-chaining into rankers can more directly target decision utility.
Hybrid symbolic-neural models: The integration of explicit, programmatic domain ontologies with generative/neural architectures could address some of the interpretational ambiguity and experiential knowledge missing from current LLMs.
Evaluate impact on downstream QA/RAG: Extending the evaluation from ranking static passages to measuring actual answer accuracy, faithfulness, and calibration in generative scenarios will provide a more direct assessment of practical benefit.

Conclusion

"UsefulBench" provides a rigorous, expert-annotated benchmark separating relevance and decision-useful information in the IR pipeline. The work demonstrates that, while modern LLMs partially bridge the gap beyond classical IR systems, high-stakes utility and actionable retrieval remain a frontier for both annotation schemes and model architectures. Addressing the expert knowledge deficit in IR and downstream generative tasks is a foundational challenge for information-reliant AI systems. The dataset, methodology, and findings of this paper constitute a substantive resource for future research in utility-centric retrieval.

Markdown Report Issue