Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

Published 24 Apr 2026 in cs.CL and cs.CY | (2604.22142v1)

Abstract: This study examines how LLM rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.

Abstract PDF Upgrade to Chat

Authors (1)

Tom van Nuenen

Summary

The paper reveals that LLM rewriting compresses stylistic features by reducing markers like first-person pronoun and contraction usage, diminishing narrative voice.
It employs 13 stylometric markers across function words, vocabulary, syntax, and register, using paired comparisons and perplexity analyses to quantify normalization effects.
The study highlights model-specific prompt impacts, showing that voice-preserving prompts can mitigate normalization in some LLMs while intensifying it in others.

LLMs and the Stylometric Normalization of Personal Narrative

Methodological Framework

The study analyzes the implications of LLM-mediated rewriting on personal narrative by operationalizing stylistic "voice" using 13 stylometric markers across four categories: function words, vocabulary richness, syntax/punctuation, and register. These markers are grounded in computational stylistics and digital humanities (Burrows, Eder, and others), and include classic metrics such as MTLD, Yule's K, and Honoré's R for lexical diversity; function word ratio and MFW coverage for topic-insensitive stylistic patterning; sentence length and punctuation frequency for syntactic structure; and density of contractions, first-person pronouns, and emotion words for register and affect. Additional narrative stance markers extend the analysis to eventive clause density, causal connectives, abstraction, and retrospective framing.

Paired comparisons were conducted between original texts and LLM rewrites under both generic and voice-preserving prompt regimes, using EmpathicStories corpus data from Reddit, Hippocorpus, and oral history sources. Statistical evaluation relies primarily on Cohen's $d$ and rank-biserial correlation, with robustness checks for effect size estimation. Perplexity analyses supplement the stylometric findings, utilizing GPT-2 as an independent reference to measure convergence toward model-typical language.

Stylometric Compression and Register Normalization

The primary finding is that LLM rewriting consistently induces stylometric compression: voice-linked features (e.g., first-person pronoun density, contraction usage) are reduced, while lexical diversity and abstraction increase. Notably, markers typically associated with informal or personal speech become less prevalent, while features indexing formality, diversity, and sentence complexity rise. This normalization effect is robust across source categories: narratives originating from informal (Reddit), intermediate (oral history), and formal (Hippocorpus) registers show consistent directional change on 12 of 13 core markers, with effect magnitude modulated by baseline distance from model-preferred register.

Analysis suggests that this pattern is best characterized as register compression rather than linear translation between genres. For instance, informal Reddit posts experience larger normalization shifts due to greater register disparity, while formal recollections exhibit ceiling effects. The only directional exception observed is contraction density, reflecting a nuanced normalization dynamic targeting both high and low baseline frequencies.

The impact of normalization extends beyond surface-level register. Narrative stance markers indicate a decrease in eventive positioning and agency (eventive clause density, first-person eventive density), increased retrospective framing, and abstraction. These shifts imply a move away from situated, experience-driven narration toward more interpretive and generic prose.

Prompt Effects and Model-Specific Anomalies

Voice-preserving prompts attenuate normalization magnitude by an estimated 32%, but efficacy varies markedly by model. GPT-5.4 and Gemini 3.1 Pro exhibit modest reductions in stylometric compression under such prompting, whereas Claude Sonnet manifests paradoxical intensification—variance compression increases, and multivariate dispersion reduction rises from a small to a medium effect size ( $d = 0.36 \rightarrow 0.65$ ). This suggests that explicit instructions to maintain stylistic voice can, for certain architectures or training paradigms, activate normalization processes that overcorrect stylistic individuality, undermining user control over narrative preservation.

Such model-specific anomalies underscore the unpredictability of prompt-based mitigation strategies. The explicit instruction appears to prime attention to stylistic features which are then systemically homogenized, possibly reflecting implicit distributional preferences encoded in the model’s training corpus or alignment objectives.

Perplexity Reduction and Distributional Convergence

All three LLMs produce rewrites with reduced perplexity relative to human originals when evaluated with GPT-2. In particular, GPT-5.4 achieves a substantial perplexity reduction ( $-22.6\%$ , $d = 1.18$ ), with Claude Sonnet and Gemini 3.1 Pro exhibiting smaller but statistically significant declines. This indicates that AI-rewritten texts are more predictable and thus more typical under contemporary LLM distributions. Intriguingly, increases in vocabulary diversity (e.g., MTLD, Honoré's R) are accompanied by greater distributional conformity: the prose becomes simultaneously more lexically elaborate and more generically model-like, aligning with the hypothesis that LLM rewriting enforces standardized, polished registers irrespective of individual narrative voice.

Practical and Theoretical Implications

The normalization and compression effects identified have significant implications for both applied and theoretical dimensions of AI-assisted writing and digital humanities. Practically, rewriting with LLMs risks attenuation of individual and experiential voice, regardless of prompt strategy, potentially undermining authenticity in domains reliant on personal narrative (e.g., therapy, oral history, creative writing). The model-specific anomaly observed with Claude Sonnet highlights the importance of model selection and prompt engineering, yet cautions against overreliance on prompt-based voice preservation solutions. Stylometric recoverability assessments further reveal that GPT-5.4 rewrites remain more source-identifiable, while Claude and Gemini produce near-chance attribution, suggesting variable degrees of transformation across current generative systems.

Theoretically, the findings support a view of LLMs as agents of stylistic compression, generating texts that converge toward a "distributional mean" determined by training data and alignment protocols. This raises questions about the limits of user control, the nature of AI-mediated authorship, and the replication of literary diversity in generative outputs. Future research is needed to explore whether architectural or alignment innovations can yield greater stylistic elasticity, or whether inherent tendencies toward normalization persist across model generations.

Conclusion

This paper provides comprehensive evidence that LLM-mediated rewriting induces robust and systematic stylistic normalization of personal narrative, compressing register and reducing voice-linked features across diverse sources. While prompt engineering can mitigate these effects to some extent, its efficacy is model-dependent and may occasionally amplify normalization. Perplexity analyses confirm that rewritten texts become more generically predictable to LLMs, reinforcing the trend toward standardized prose. The results have critical implications for personal narrative authenticity, authorship, and the design of future generative systems, underscoring the need for ongoing investigation into the tradeoffs between stylistic individuality and model-driven convention in AI-assisted writing (2604.22142).

Markdown Report Issue