- The paper demonstrates that LLMs can robustly detect, localize, and verbally distinguish between dropout masking and additive Gaussian noise applied to their activations.
- The methodology involves token-level perturbations across multiple transformer layers in models like Llama, Olmo, and Qwen, achieving accuracies over 99% in some cases.
- The findings imply that LLMs possess a form of training awareness that raises important questions about model safety, evaluation protocols, and potential vulnerabilities.
LLMs' Introspective Recognition of Dropout and Gaussian Noise in Their Own Activations
Overview and Motivation
This work addresses the extent to which modern LLMs are capable of not only being sensitive to but also explicitly recognizing certain activation-space perturbations applied during inference. Specifically, the paper evaluates whether models from leading open-weight families (Llama, Olmo, Qwen, with 8Bโ32B parameters) can detect, localize, and verbally distinguish between (i) dropout-style masking of activations and (ii) additive Gaussian noiseโtwo structurally different interventions commonly used for regularization and robustness handling.
Recent interpretability literature has explored model introspection using semantic steering vectors, raising the question of whether LLMs can report on internal concept manipulations tied to their inferred knowledge structures [lindsey2025introspection] [pearsonvogel2026latentintrospectionmodelsdetect]. However, those interventions often confound semantic bias with low-level perturbational awareness. This paper isolates the effect by using strictly non-semantic, stochastic perturbations and testing the model's metacognitive ("introspective") capacity to recognize such interventions.
Methodology
Experiments are conducted on several open-weight LLMs, notably Llama3.1-8B, Olmo3.1-32B, Qwen3-14B, and Qwen3-32B, focusing on token-level activation perturbations at various transformer layers:
- Dropout: Activations are masked independently per-entry with probability p and rescaled, analogous to standard transformer regularization.
- Gaussian Noise: I.I.D. zero-mean Gaussian noise with standard deviation ฯ is injected additively into activations.
Prompt variants are constructed such that after processing perturbed (and unperturbed) sentences, the model is queried either for detection ("Which sentence was perturbed?"), identification ("Which perturbation was applied?"), or discrimination (forced binary classification between perturbation types). Large prompt pools and sentences (sourced from WikiText-103 and synthetic datasets) ensure minimal prompt memorization or accidental bias.
Multiple experiment arms are designed:
- Localization: Given two sentences in a prompt, only one is perturbed; the model must localize the intervention.
- Zero-shot Discrimination: A single sentence is perturbed with either dropout or noise; the model is asked to identify which perturbation occurred, without in-context labelled exemplars.
- Few-shot In-Context Learning: The model is provided with labelled in-context examples associating sentences with perturbation types, followed by a test query.
- Control Experiments: Critical nulls include using non-semantic control labels, flipping label assignments, and contextually simple questions unrelated to introspective detection, to rule out trivial answer biases.
Success is primarily measured by next-token accuracy corresponding to the correct answer token, explicitly handling tokenization and label bias via aggregation over equivalent answer strings.
Empirical Results
Localization Sensitivity
All tested LLMs robustly detect and localize both activation dropout and Gaussian noise when applied at non-trivial (but not extreme) magnitudes, with Qwen3-32B reaching almost perfect accuracy and Llama3.1-8B displaying sensitivity down to extremely small perturbations. Importantly, accuracy falls to chance under null perturbation (i.e., no modification), establishing that the detected effect is not a side-effect of prompt structure or label bias.
Control experiments reveal that at moderate perturbation strengths, model performance on standard semantic comprehension questions remains unimpaired until perturbations cross a substantial threshold (the impairment boundary is empirically tabulated). Thus, the localization capacity is not an artifact of a response bias toward selecting the perturbed sentence.
Zero-Shot and Few-Shot Perturbation Discrimination
Among the evaluated models, Qwen3-32B exhibits notably strong zero-shot discrimination: as perturbation magnitude increases, its classification accuracy of perturbation type monotonically increases, exceeding 99% with strong in-context synonyms ("masking" for dropout, "jitter" for noise). The model also exhibits a non-trivial prior for the canonical mapping of perturbation namesโaccuracy sharply decreases if semantic labels are deliberately swapped, even when synonyms are used.
By contrast, Olmo3.1-32B and Llama3.1-8B fail to reveal a clear zero-shot discrimination signal, suggesting the effect is not universally present across architectures and training schemes.
In the few-shot regime, all models demonstrate the capacity to learn the mapping between perturbations and labels when provided explicit in-context supervision, with the Qwen models again achieving the highest accuracy. Learning dynamics indicate that Qwen3-32B requires very few in-context pairs (as few as one or two) to approach strong discrimination, whereas others' accuracy improves more slowly and with lower asymptotes.
Control and Label Assignment Experiments
Critical controlsโincluding the use of arbitrarily mapped or flipped labels, and non-semantic answer pairsโdemonstrate that the models' performance is not a spurious consequence of choice bias or label-token-specific effects. With Qwen3-32B, performance is maximized when correct semantics are used, depressed with flipped labels, and neutral for non-semantic label pairs, confirming the presence of a prior associating perturbation with the semantically correct name.
Strong numerical results include:
- Qwen3-32B achieves >99% zero-shot accuracy in discriminating dropout from noise with canonical and synonymous labels as perturbation strength increases.
- For both detection and localization, all tested models maintain >80% accuracy well within realistic perturbation magnitude ranges, with no degradation of semantic tasks until extreme perturbation.
- In few-shot in-context learning, Qwen-family models reach >70% accuracy with as few as 7-9 exemplars; other models show more modest improvement.
Implications and Future Directions
From a theoretical standpoint, the existence of this introspective capacity demonstrates that LLMs encode activation-level signal, even for stochastic, non-semantic interventions, and that this signal is aligned (sometimes by prior) with human-understood conceptual differences between regularization techniques. This alignment is notable given that training details for these open models specifically indicate that neither standard dropout nor additive noise were applied during their pretraining; thus, the models' semantic knowledge and experiential "introspection" appear to have fused through exposure to the concepts in data, not via gradient-based association with the actual activations.
One significant implication is the emergence of "training awareness"โin parallel to evaluation-awareness [chaudhary2025evaluation] [abdelnabi2025hawthorne] [needham2025large] [nguyen2025probing] [xiong2025probe] [hua2025steering] [bengio2025international] [bengio2026international]. If models can distinguish training-time regularization from inference-time interventions (despite never experiencing those interventions directly in training), this has ramifications for safety and trustworthiness; a model could condition its behavior on the detection of such signals, potentially undermining AI evaluation protocols and safety guarantees.
Furthermore, such findings open the possibility that models can be systematically probed, controlled, or even attacked by direct, non-semantic manipulation of their latent state, without requiring semantic content at all. This raises both opportunities for transparency tools and challenges for alignment and robustness.
Open questions remain:
- Mechanistic Origin: How does the association between semantic knowledge (of dropout, noise) and introspective experience arise?
- Broader Generalization: Will this capacity extend to other regularization methods (e.g., uniform noise, quantization) or more subtle/complex intervention classes?
- Interaction with Model Size: Is there a scaling law for introspective capacity versus parameter count or architecture variant?
- Mitigation and Safety: Can (and should) this kind of "signal" be masked or randomized to prevent undesirable meta-cognitive inference in safety-critical usage scenarios?
Conclusion
This paper provides strong evidence that LLMs, at least in several leading open architectures, possess an introspective signal for structurally non-semantic perturbationsโdropout and Gaussian noiseโapplied to their own activations. This capacity is both detectable and verbally accessible, with model performance tied to perturbation strength, label semantics, and degree of in-context supervision. These findings heighten concerns about training and inference awareness in current models and highlight the need for deliberate safety and evaluation strategies that account for activation-level meta-detection capabilities.
References
- Lindsey, J. "Emergent Introspective Awareness in LLMs" [lindsey2025introspection]
- Pearson-Vogel, T. et al., "Latent Introspection: Models Can Detect Prior Concept Injections" (Pearson-Vogel et al., 23 Feb 2026)
- Chaudhary, M. et al., "Evaluation Awareness Scales Predictably in Open-Weights LLMs" (Chaudhary et al., 10 Sep 2025)
- Bengio, Y. et al., "International AI Safety Report 2026" (Bengio et al., 24 Feb 2026)
- Fornasiere, D. et al., "LLMs Recognize Dropout and Gaussian Noise Applied to Their Activations" (2604.17465)