Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

Published 28 Apr 2026 in cs.AI | (2604.25149v1)

Abstract: LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p < 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that providing curated semantic-layer documentation boosts first-shot accuracy by 17–23% across LLM models.
The study employs a controlled paired-comparison using a retail dataset, highlighting the impact of explicit business semantics on analytic reliability.
The paper shows that explicit semantic context decisively reduces schema-linking errors and hallucinations compared to raw schema inputs.

Semantic Layers for Reliable LLM-Powered Data Analytics: An Expert Analysis

Problem Overview and Motivation

LLM-backed natural-language database interfaces are characterized by two intertwined failure modalities: incorrect answers (accuracy failures) and confident hallucinations—plausible but wrong outputs presented without uncertainty. Both arise from a structurally identical source: LLMs are forced to infer essential business semantics from schemas that lack explicit information about business conventions, metric definitions, or dataset idiosyncrasies. This study rigorously quantifies the effect of semantic-layer grounding on mitigating these failures, benchmarking three frontier LLMs—Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4—on a realistic analytics dataset under controlled paired-comparison protocols.

Experimental Framework

The experimental setup leverages the Cleaned Contoso Retail Dataset (25 tables, multi-fact, non-trivial conventions/domains) in ClickHouse. A set of 100 natural-language questions of systematically varied complexity targets scenarios that induce real-world failure: ambiguous metric definitions, ambiguous dimensional references, snapshot-versus-flow disambiguation, and other practical challenges. Each model is evaluated under two conditions: (1) provided only with the warehouse schema (raw), or (2) provided with both the schema and a 4 KB hand-curated semantic-layer document in markdown, encompassing fact table selection criteria, measure formulae, conventions, and disambiguation rules. The grading procedure utilizes a held-constant, blinded LLM judge with recourse to strict and relaxed pass/fail rubrics, and statistical assessment is conducted via exact McNemar tests.

Key Results

The principal finding is a consistent +17 to +23 percentage point gain in first-shot analytical accuracy across all models when provided with the semantic-layer document: Opus 4.7 rises from 50.5% to 67.7%, Sonnet 4.6 from 46.5% to 68.7%, and GPT-5.4 from 45.5% to 68.7%. These gains are statistically robust (p ≤ 0.0015, paired), and every cross-cluster comparison (raw vs. semantic-layer) is significant at p < 0.01. Notably, accuracy differences among raw or among semantic-layer models are statistically indistinguishable within tier (≤ 5 pp, not significant), demonstrating that semantic context, not model selection, is the dominant driver of performance variance.

This structural finding—contextualizing business semantics yields larger gains than LLM family/model upgrades—aligns with independent paired-benchmark results from BIRD, data.world, BEAVER, AtScale, dbt Labs, Databricks Genie, and Snowflake Cortex, as well as ontology-based approaches (Sequeda et al., 2023, Allemang et al., 2024). The effect, consistently observed across diverse datasets and domains, is robust to context provenance (hand-authored, automatically generated, ontology-grounded) and to the LLM's intrinsic parameterization.

Mechanistic Analysis

Semantic layers function by converting ill-posed inference into constrained lookup:

Schema linking is replaced by explicit mapping. Without context, LLMs frequently mis-select fact tables and dimension keys, collapse to incorrect or heuristic formulas, and fail at non-obvious conventions (e.g., string-encoded booleans, sentinel keys).
Business logic errors are pre-empted. Metrics calculated with dataset-specific exclusions or non-standard aggregations (inventory, gross margin, return rate) are reproducibly correct only when the computation pathway is provided.
Ambiguity is deterministically resolved. Disambiguation rules (e.g., time anchoring against the latest available data, region disambiguation) eliminate silent hallucinations—the class of errors most dangerous in business practice due to invisible propagation.

Empirical error analysis reveals that semantic-layer grounding primarily suppresses schema-linking and business-logic misinterpretation: >80% of execution failures in raw schema mode stem from these categories, consistent with large-scale error analyses (Shen et al., 16 Jan 2025, Liu et al., 15 Mar 2025). The residual failure cases with semantic-layer context are predominantly outside the document's coverage—complex analytic patterns (e.g., percentiles, recursion, multi-CTEs) or unencoded conventions—and thus suggest the benefit is modulated by coverage quality, not the method's intrinsic ceiling.

Implications for Methodological and Practical Design

The study’s rigorous paired-comparison framework (identical harness, judge, question set, model settings) isolates the semantic-context variable, establishing the measured effects as minimally confounded. The result has substantial implications:

Model selection can be subordinated to cost/latency/engineering criteria once sufficient semantic context is present—semantic-layering transforms the system design space by saturating accuracy given available context.
Practitioners achieve large, robust accuracy/hallucination reductions by encoding and supplying explicit business semantics—this lever operates independently (and, by magnitude, preferentially) to fine-tuning, inference-time retrieval-augmentation, prompt engineering, or output-constrained generation. The gain is available to any sufficiently competent LLM, and its magnitude approaches that of the most ambitious domain adaptation techniques without requisite model modification.
The runtime (programmatic) form of semantic layers likely offers even stricter guarantees, converting the advisory effect observed here to hard enforcement (i.e., compilation-time error on out-of-model queries versus plausible hallucinated SQL). This is an open empirical question, but the findings establish a lower bound for runtime-system effects.

Relation to Adjacent Literature

Benchmark results and mechanistic inferences are corroborated by ontology-oriented approaches, which formalize business semantics in OWL/RDFS and demonstrably triple end-to-end QA accuracy while eliminating hallucinated entities (Sequeda et al., 2023, Allemang et al., 2024). Semantic layers implement the same principle, using natural-language context and optionally enforceable constraints. The strong generalization across domains—including bioinformatics (GraphRAG) and clinical QA—indicates that this architectural intervention is not domain-specific but foundational.

Limitations and Directions for Future Research

The evaluation is limited by its focus on a single retail-analytics dataset and warehouse engine; broader validation over additional industries (finance, healthcare, SaaS telemetry) and execution platforms is warranted. The relationship between semantic-layer document coverage, authoring effort, and marginal utility is non-trivial and merits systematic exploration. Assessment using agentic/task-oriented pipelines and iteration/adaptive querying is also required to extend these results beyond single-shot scenarios, where compounding errors are likely to magnify the studied effect. Additionally, a more diverse adjudication process (human, cross-family LLM judging) could solidify external validity.

Conclusion

This study establishes that authoritative semantic context, encoded as semantic-layer documentation, is the principal structural determinant of LLM-powered analytics reliability with respect to analytical accuracy and hallucination minimization. Across state-of-the-art models, a paired-comparison protocol demonstrates a consistent, significant 17–23 percentage-point improvement in first-shot accuracy when context is augmented with curated business semantics. The effect size subsumes inter-model differences, is robust to model vendor or generation, and is mirrored in contemporary literature across industrial and academic settings. Semantic-layering reframes the text-to-SQL task from open-ended semantic inference to constrained retrieval, suppressing dominant LLM error classes and enabling practical reliability unattainable via model improvements alone. Future research should address runtime-enforcement architectures, broader domain generalization, and agentic settings to comprehensively characterize the utility and limitations of semantic-layer grounding.