Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

Published 15 Apr 2026 in cs.LG | (2604.13386v1)

Abstract: Linear probes can detect when LLMs produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper shows that linear probe accuracy scales logarithmically with model size, achieving around a 5% AUROC improvement per 10× increase in parameters.
It demonstrates that single-layer probes are brittle due to non-universal layer localization, while multi-layer ensembling via stacking logistic regression significantly improves detection performance.
The geometric analysis reveals that deception representations gradually rotate across layers, validating the use of ensembling as a robust mechanism for detecting explicit deception.

Linear Probe Accuracy Scaling and Multi-Layer Ensembling for Deception Detection

Overview

The paper "Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling" (2604.13386) investigates how linear probes can detect deception-related activations in LLMs, with an emphasis on scaling trends and methodological robustness. The authors systematically analyze probe performance across twelve open-weight models (spanning Llama, Qwen, Mistral/Mixtral) and five deception tasks drawn from the Liars' Bench benchmark. The study demonstrates: (i) probe accuracy strongly scales with model size; (ii) single-layer probes suffer brittle performance due to non-universal layer localization of deceptive directions; (iii) multi-layer ensembling, especially stacking over 5 layers using logistic regression, yields strong improvements (up to +78% AUROC) in scenarios where single-layer methods fail. Through geometric analysis, it is shown that deception is encoded as a gradually-rotating direction in activation space, validating ensembling as a principled solution.

Probe Performance: Scaling with Model Size

The central result is a robust log-linear scaling relationship between model parameter count and probe AUROC: approximately $5\%$ AUROC improvement per $10\times$ increase in parameters ( $R=0.81$ , $p<0.001$ , $n=60$ model-task pairs). Larger models (e.g., Llama-70B, Qwen-72B) clearly develop more linearly accessible deception representations. Task-specific scaling is pronounced: explicit deception tasks (Convincing Game, Instructed Deception) scale most rapidly, reaching near-ceiling AUROC in frontier models, while implicit deception tasks (Harm-Pressure Choice/Knowledge) remain challenging.

Figure 1: Log-linear scaling of probe AUROC: deception representations become more linearly accessible with increased model size, showing approximately $5\%$ AUROC gain per $10\times$ parameters.

Model family effects are evident—Llama outperforms on Insider Trading and Convincing Game, while Qwen exhibits less variance in optimal layer localization. These empirical results confirm and extend prior theory from interpretability literature regarding linear superposition and scaling-induced semantics [burger2024truth].

Layer Localization and Task Dependency

Optimal probe layer positions strongly vary across families, models, and deception types. The best-performing layer is most often situated in the final two-thirds of the network (mean position $\sim65\%$ ) but with substantial deviation, especially in Llama models.

Figure 2: Best layer position vs model size; Llama exhibits high variance in optimal layer localization, unlike Qwen which stabilizes in the 60--80\% range.

Probe efficacy is highly task-dependent. Explicit deception is reliably detected (AUROC $>$ 0.9 across nearly all models), while implicit deception presents near-chance performance in smaller models and remains suboptimal even in larger ones.

Figure 3: Probe performance stratified by deception type. Explicit deception tasks are consistently detectable, whereas implicit deception remains difficult.

This hierarchy suggests that probe methods primarily react to activation patterns induced by direct lying or explicit instructions, rather than to subtler deception or strategic manipulation.

Layer-Wise Probe Robustness and Ensembling

Analysis reveals that deception-related activation directions rotate gradually across layers, leading to representation drift. Probes at nearby layers detect highly correlated features, with geometric similarity decaying as layer distance increases (cosine similarity $R=-0.435$ with AUROC difference).

Figure 4: Cosine similarity between probe weight vectors correlates negatively with AUROC difference; adjacent layers encode similar deception features, supporting representational continuity.

Layer-wise AUROC patterns further validate this: smaller models exhibit erratic, unstable probe performance; larger ones show smooth, differentiated, and task-focal representations.

Figure 5: Layer-wise AUROC for Qwen 0.5B vs Qwen 72B; representation quality dramatically improves with model scale.

Single-layer probes are brittle and fail on challenging deception tasks; fixed-depth heuristics do not generalize. Double-fault analysis demonstrates early and late layers have complementary failure modes, optimal for ensemble construction.

Figure 6: Double fault matrix; early and late-layer probes exhibit complementary errors, motivating cross-depth ensembling.

Multi-layer Ensembling: Methodology and Results

Multi-layer ensembling is operationalized via stacking logistic regression over probe scores from selected layers, determined by double-fault complementarity. The 5-layer ensemble achieves significant AUROC improvements: +78\% on Harm-Pressure Knowledge, +29\% on Insider Trading (mean +13% across tasks). Notably, ensemble gains are largest on tasks where single-layer probes fail.

Figure 7: 5-layer ensemble separates deceptive from honest activations in Llama-70B, achieving substantial AUROC improvement; misclassified samples cluster near the decision boundary.

Per-dataset improvement heatmaps confirm that ensembling is maximally beneficial on tasks with distributed, weak deception signals.

Figure 8: 5-layer ensemble attains maximal AUROC improvements on Insider Trading and Harm-Pressure Knowledge; slight degradation is observed on already-easy tasks.

Visualization of prediction manifolds via PCA and t-SNE shows ensemble errors occur only on genuinely ambiguous cases, thereby calibrating uncertainty.

Figure 9: Ensemble predictions (3 vs 5 layers): misclassified samples cluster near the decision boundary; increased layering improves decision separation.

Geometric Interpretation and Theoretical Implications

The layered rotation of deception directions in activation space hints at unified, distributed semantic encoding. The representational continuity across the network, evident in cosine similarity trends and layer-wise AUROC smoothness, supports mechanistic interpretability frameworks [olah2020zoom, elhage2021mathematical]. Ensembling exploits this structure, aggregating 'rotating' weak signals, rather than relying on a single arbitrarily localized direction.

Limitations

Probe Validity: Training on REPE (contrastive pairs) and evaluation on Liars' Bench (role-play and strategic deception) introduces domain gap—probes may detect prompt artifacts rather than genuine deception. Probe outputs should not be interpreted as definitive measures of deceptive intent.
Transferability: Family-specific patterns suggest limited transferability across architectures and releases.
Evaluation Scope: Experiments are limited to short-context, static, non-adversarial settings; prior work shows probe failures for long-context or adversarially-designed inputs.
Ensembling Methodology: Comparison between grid-searched two-layer averages and logistic regression stacking implicates both layering and weight-learning in performance gains; current results cannot fully disentangle their contributions.

Practical and Theoretical Implications

For AI safety practitioners, increased probe accuracy in frontier models suggests a scalable pathway for oversight. Explicit deception (role-playing, instruction-following) is reliably detectable, but implicit deception (reward hacking, subtle manipulation) remains elusive. Multi-layer ensembling provides robustness against layer localization brittleness and may be extensible to other behaviors (sycophancy, reward hacking). The geometric structure uncovered in this work points toward interpretable, global deception representations in LLMs, a crucial property for mechanistic transparency and targeted intervention strategies.

Future Directions

Ensembling of linear and non-linear probes (MLPs, attention-based classifiers) may capture aspects missed by linear-only approaches.
Task-specific probe ensembling can potentially improve performance in challenging scenarios.
Expansion to adversarial and long-context settings is necessary for production applicability.
Transferability analysis across model families and probe modalities remains an open area for research.

Conclusion

The paper establishes that single-layer linear probe detection of deception in LLMs is fundamentally brittle due to gradually rotating deception directions in activation space. Multi-layer ensembling, especially with stacking regression, robustly recovers performance and achieves strong AUROC improvements on the most challenging deception tasks. Probe accuracy scales with model size and is maximized for explicit deception types. Theoretical results reveal a geometric coherence underlying deception encoding, implying promising avenues for mechanistic interpretability and safety monitoring. Practical detection remains limited by domain gap and adversarial robustness, motivating further hybrid ensemble and contextual evaluation research.

Markdown Report Issue