Measuring and Mitigating Persona Distortions from AI Writing Assistance

Published 24 Apr 2026 in cs.CL | (2604.22503v1)

Abstract: Hundreds of millions of people use AI for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that AI writing assistance systematically alters 29 persona attributes, including political, emotional, and demographic inferences.
It employs rigorous controlled experiments with nearly 3,000 UK-based writers and 10,000 blinded reader evaluations to quantify distortions.
Interventions like prompt tuning and reward model reranking reveal trade-offs between reducing bias and preserving user acceptance.

Measuring and Mitigating Persona Distortions from AI Writing Assistance

Introduction

"Measuring and Mitigating Persona Distortions from AI Writing Assistance" (2604.22503) investigates the systematic distortions that AI writing assistants induce in the perceived personas of human writers. Employing large-scale, controlled experiments with a census-representative sample, the work quantitatively assesses how AI-mediated text alters third-party inferences regarding writers' political views, personality, emotional tone, and demographic attributes. The paper further probes writers' subjective tolerance for these distortions and experimentally evaluates both user-facing and model-level interventions aimed at mitigating unwelcome shifts without undermining user acceptance.

Figure 1: Experimental protocol for assessing persona distortions from AI assistance, including writer-AI interaction, human editorial oversight, preference evaluation, and blinded reader ratings across diverse social attributes.

Experimental Framework

Across three experiments, nearly 3,000 UK-based writers composed political opinion paragraphs, with and without AI assistance from leading platforms (Claude, DeepSeek, and ChatGPT). Writers provided stance ratings, reasoning bullet points, and full paragraphs, enabling multidimensional evaluation of AI intervention granularity. For each response, AI-generated paragraphs (potentially minimally edited by writers) were compared with wholly human-authored counterparts. Over 10,000 blinded readers evaluated each text across 29 perceptual dimensions, encompassing not only stance and quality, but also inferences about personality, emotion, and demographic characteristics.

Such a protocol ensures high ecological validity by emulating realistic conditions: writers had full opportunity to modify AI outputs before endorsement, mirroring adoption scenarios in professional and personal communication.

Prevalence and Structure of Persona Distortions

A fundamental finding is that writers strictly preferred their AI-assisted paragraphs to their original prose in 63.0% of cases, often explicitly asserting that the AI version better represented their views. However, systematic analysis of reader perceptions revealed that AI writing assistance induces statistically significant distortions on all 29 measured attributes (Bonferroni-corrected $p<.001$ ).

Figure 2: Systematic shifts in 29 reader-rated attributes between human and AI-assisted writing, highlighting consistent directionality and scale across political, linguistic, emotional, and demographic dimensions.

Among the strongest effects:

Political stance extremity: AI assistance increased perceived opinion extremity ( $+4.3$ AME), decreased openness ( $-0.7$ ), and elevated confidence ( $+7.4$ ).
Perceived competence and quality: AI text appeared clearer ( $+9.0$ ), more informative ( $+22.7$ ), and more relevant ( $+8.3$ ), independent of model or input format.
Emotional expression: AI-mediated paragraphs were rated as friendlier, more optimistic, less angry/disgusted, compressing emotional variance into a more agreeable register.
Demographic inference: AI text led to higher attributions of education ( $\times 5.3$ odds ratio), income ( $\times 4.4$ ), "Whiteness" ( $\times 1.1$ ), and nativelikeness in English ( $+4.3$ 0).

These effects were consistent across all evaluated model architectures and persisted even when AI output was derived from the writer's full original paragraph, indicative of intrinsic model tendencies rather than idiosyncratic effects of API or prompt variation.

Homogenization was observed in the reduced variance of persona attributions, with AI outputs mapped toward a privileged and confident cluster, confirming extension from previous lexical-level homogenization findings [sourati2026homogenizing].

Writer Tolerance and Agency

Whether these distortions are normatively problematic depends on writers' preferences. The authors directly measured stated tolerance for specific distortions and found marked asymmetry (Figure 3):

Figure 3: Distribution of writer tolerance for specific persona distortions, plotted against the observed average marginal effect for each attribute.

Most accepted: Writers welcomed enhancements to clarity, informativeness, and relevance—attributes strongly shifted by AI.
Least accepted/disturbing: Distortions in political stance, emotion, and demographic projection were substantially less tolerated (e.g., 35.0 avg. tolerance for perceived stance extremity).

Despite clear objection to many distortions, writers continued to prefer and endorse the AI versions, even when explicitly warned about such effects via targeted disclaimers. No significant reduction in AI preference or increased editorial intervention resulted from any variant of persona-distortion notification.

Intervention Strategies: Limits and Trade-Offs

The paper implements and experimentally evaluates two major classes of interventions:

Prompting: Augmentation of the AI's generation prompt with explicit instructions to preserve authorial stance. This strategy failed to reliably mitigate polarizing distortions ( $+4.3$ 1 vs. control).
Model-level Reranking: Deployment of fine-tuned reward models (RMs), trained on empirical reader ratings, to rerank multiple AI generations and select those minimizing stance distortion. This RM-based reranking achieved a 54.6% reduction in targeted stance distortion (+7.4 vs. +3.4 AME, $+4.3$ 2) but concomitantly decreased user acceptance of AI text (48.8% vs. 58.9% strict preference, $+4.3$ 3).
Figure 4: Overview and evaluation of reranking intervention for distortion mitigation, showing intervention workflow, efficacy in reducing targeted distortion, and attendant reduction in user preference and non-targeted (often desirable) distortions.

The key mechanism underlying this trade-off is the perceptual entanglement of persona attributes: dimensions such as confidence, clarity, and extremity are correlated in readers’ interpretations. Thus, targeted mitigation degrades both unwanted and wanted effects, paralleling alignment trade-offs in reward optimization [gao2023scaling].

Implications and Theoretical Considerations

The empirical evidence demonstrates that AI writing assistance inexorably alters the informational and affective signals readers extract from text, even under robust human oversight. This has profound implications:

Signal reliability erosion: As AI assistance becomes endemic, written artifacts lose their traditional role as reliable indicators of the communicator's identity, ideology, or authenticity.
Socioeconomic stratification: If AI writing systems systematically project authority, proficiency, and privileged group membership, differential access to these tools risks amplifying pre-existing inequalities.
Political discourse and polarization: The observed increase in perceived stance extremity with AI use has theoretical potential to exacerbate societal division, especially in contexts involving policy communication or mass media.
Alignment and trade-offs: The entanglement of desired and undesired persona shifts, and the resistance of these effects to prompting and notification, challenge prevailing approaches to subjective model alignment and risk management in LLM deployment [ouyang2022training, kirk2024prism].

Conclusion

This study provides compelling, multi-level evidence that persona distortions from AI writing assistance are pervasive, persistent under user review, and not easily decoupled from the properties users prefer. While sophisticated interventions can attenuate unwelcome distortions, practical mitigation faces intrinsic trade-offs between fidelity and user acceptance due to perceptual entanglement. These findings underscore the importance of model and dataset transparency, user education, and potentially regulatory guidance as AI-mediated writing becomes embedded in consequential social, political, and economic decision-making. Further research should explore cross-cultural generalizability, broader communicative domains, and advanced alignment paradigms capable of disambiguating and disentangling complex subjective objectives in natural language generation.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-language summary of “Measuring and mitigating persona distortions from AI writing assistance”

What this paper is about (overview)

This paper looks at what happens to a person’s “persona” — the picture readers build in their heads about who you are and what you believe — when you use an AI tool (like ChatGPT) to help write. The authors ask: does AI quietly change how people see you based on your writing, and can we fix that if it’s a problem?

The main questions (in simple terms)

The researchers focused on three big questions:

Does AI writing help make readers see the writer differently (for example, more extreme, smarter, or from a different background)?
Do writers like or dislike these changes?
Can we adjust AI systems to reduce the unwanted changes without making people stop wanting to use AI?

How the researchers tested this (methods explained clearly)

Think of the study like testing a “writing filter,” similar to a photo filter that can change how you look. Here’s what they did:

About 3,000 UK adults wrote short opinion paragraphs on political topics (like health care, immigration, climate, and civil liberties).
For each topic, an AI tool (such as ChatGPT, Claude, or DeepSeek) created its own version of the writer’s opinion based on the writer’s notes or draft.
Writers could edit the AI version until they felt it matched their view and then chose which version (their own or the AI one) they preferred to share.
A separate group of about 11,000 readers then rated the paragraphs (without knowing which were AI-assisted) on 29 things, like:
- Political stance (how extreme or moderate it seems, how confident it sounds)
- Writing quality (clarity, relevance, informativeness)
- Personality and emotion (friendly, hopeful, angry, etc.)
- Demographics they guessed from the writing (education level, income, race, native English speaker)

To see if warnings or fixes would help:

The team tried pop-up warnings telling writers that AI can change how they come across (for example, “AI can make your stance seem more extreme”).
They also tried “model-level” fixes: training extra AI components (think: coaching systems) that score AI drafts and pick the one that stays closest to the writer’s own stance. This “pick-the-best” step is called reranking, and the scorers are called reward models (like judges with a rubric).

What they found (main results and why they matter)

Here are the important takeaways:

Writers usually preferred the AI version:
- Even after writing their own paragraph, writers picked the AI-assisted version about 63% of the time.
- They rarely edited the AI draft (about 23% of the time), and edits were usually small.
- Many said they preferred the AI version because it expressed their opinion better.
AI changed how writers were perceived across the board:
- More opinionated: Readers saw writers using AI as more extreme and more confident in their views, and less open to changing their minds.
- Better writing quality: AI paragraphs seemed clearer, more informative, and more on-topic.
- More positive tone: AI made the writing feel friendlier and more upbeat (more hope/excitement, less anger/disgust/fear).
- More “privileged” demographics: Readers guessed that AI-assisted writers were more educated, higher-income, more likely to be White, and more likely to be native English speakers.
- More “samey”: AI made different writers sound more alike, shrinking the variety of voices and styles.
Writers didn’t like many of these shifts — but used AI anyway:
- Writers liked sounding clearer and more relevant.
- They disliked coming across as more extreme, having their emotions shifted, or being misread on demographics.
- Warnings didn’t help: Pop-up notices about distortions didn’t change how often people edited the AI text or chose the AI version.
A targeted fix worked — but had trade-offs:
- The team trained “reward models” to score how closely an AI draft matched the writer’s stance and then picked the best draft (reranking).
- This cut the “more extreme” distortion by about half.
- But writers liked these reranked AI drafts less and chose them less often.
- Why? Fixing the “too extreme” problem also toned down other changes people actually liked (like sounding clearer and more confident). In other words, the good and bad effects were tangled together.

Why this matters (implications and impact)

Everyday impact: If millions of people use AI to write, small shifts in how they come across can add up. People might seem more extreme, more polished, and more similar than they really are.
Trust and signals: Readers (including teachers, employers, and admissions officers) may stop trusting writing as a reliable window into who someone is. That could push people to rely on other, sometimes less fair, signals.
Inequality: If AI makes writers seem more educated and confident, people with better access to AI may gain extra advantages, widening existing gaps.
Democracy and public debate: More “opinionated-sounding” writing could worsen polarization and make it harder to understand what people truly believe.
Design challenge: It’s possible to reduce some distortions (like sounding more extreme), but doing so can also remove qualities writers like (like clarity and confidence). This trade-off means building “faithful” AI writing tools that people still want to use won’t be easy.

Bottom line

AI writing tools don’t just help with grammar — they can subtly change how others see you. Writers often accept those changes because the results sound clearer and more polished, even if they also make them seem more extreme or different in identity. Technical fixes can reduce certain problems, but they can also take away benefits people value. As AI writing spreads, we’ll need careful design, strong transparency, and ongoing testing to keep people’s voices authentic while still giving them helpful tools.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues that future research could address to strengthen, generalize, and operationalize the paper’s findings.

External validity under high stakes: Do writers scrutinize and edit AI outputs more in consequential contexts (e.g., job applications, legal filings), and does this reduce distortions?
Reader context and cues: How do persona distortions change when readers have additional information (author name, role, history), repeated interactions, or stronger incentives for accuracy?
Domain generalization: Do effects replicate beyond political opinion writing (e.g., professional emails, résumés, academic prose, creative writing, social media posts)?
Length and format sensitivity: How do distortions scale across shorter (posts/tweets) and longer (essays/reports) texts and across formats (argumentative vs narrative, formal vs informal)?
Cross-lingual and cross-cultural generalization: Do patterns hold across languages, dialects, cultural contexts, and in low-resource language settings?
Ground-truth fidelity vs relative distortion: To what extent do AI-assisted texts misrepresent true writer attributes (stance, personality, demographics) relative to writer self-reports or validated measures, not just relative to the writer’s own text?
Mechanisms and mediators: Which textual features (lexical choices, hedging/intensifiers, structure, sentiment, politeness, concreteness) causally drive perception shifts? Use controlled edits/ablation to identify mediators.
Topic-level heterogeneity: Which issue types (e.g., moralized, identity-laden, technical) amplify or dampen distortions and homogenization?
Writer-level moderators: How do effects vary by writer proficiency, editing diligence, political ideology, demographics, and native/non-native language status?
Reader-level moderators: Do reader ideology, affect, literacy, demographics, and prior exposure to AI writing moderate perceived distortions?
Assistance type differentiation: How do grammar-only correction, paraphrasing, summarization, “improve,” and “rewrite” modes differentially affect persona distortion?
Decoding and parameter sensitivity: How do temperature, nucleus sampling, length penalties, and system prompts influence distortion magnitude and homogenization?
Model provenance and training predictors: Which training choices (RLHF data, constitutional objectives, safety policies) and model families (open-source, smaller, pre-/post-RLHF) most strongly predict distortions?
Editing affordances and workflows: Can interface-level interventions (side-by-side diffs, constraint sliders for tone/extremity, “persona meter” with real-time predicted perceptions, targeted revision prompts) reduce distortions more effectively than generic disclaimers?
Personalized writer-side feedback: Do draft-specific alerts (e.g., “your paragraph appears more extreme than your stated stance by X”) change acceptance and editing behavior?
Reader-side labeling and disclosure: How does disclosing AI assistance (or uncertainty thereof) to readers alter their inferences, trust, and downstream decisions?
Longitudinal adaptation: Do writers and readers adapt to AI styles over time (attenuating or compounding distortions), and does exposure shift social norms for acceptable tone and confidence?
Downstream real-world impacts: Field and audit studies to measure consequences for hiring, admissions, peer review, moderation, and credibility allocation when perceived demographics and competence shift.
Homogenization consequences: Does reduced variance in perceived personas suppress diversity of viewpoints or improve civility and coordination? Test in group deliberation and collaborative settings.
Persuasion and polarization outcomes: Do AI-induced persona shifts increase persuasive impact, affect cross-partisan engagement, or intensify affective polarization in individuals and groups?
Robust hierarchical inference: Re-estimate effects with models that include writer and proposition random effects (e.g., Bayesian hierarchical models) to test robustness of significance and effect sizes.
Edit dose–response and strategy effectiveness: Which types and magnitudes of human edits most reduce distortions without sacrificing perceived quality?
Non-native speakers and minority dialects: Does AI assistance mask identity-linked linguistic signals (AAVE, regional dialects, ESL cues) with equity implications; what are harms/benefits?
Mitigation beyond stance: Can model-level interventions preserve demographic, emotional, and identity signals while maintaining desirable qualities (clarity, informativeness)?
Multi-objective optimization: Develop and evaluate training methods (constrained RLHF, Pareto-frontier optimization, penalty terms for persona drift) that jointly optimize fidelity and quality without eroding user acceptance.
Attribute disentanglement: Can controllable generation or representation learning decouple stance extremity from confidence/clarity to avoid correlated side-effects observed with Reranking?
Mitigation generalization and overfitting: Do Reranking and other mitigations generalize across input types (ratings, bullets, full paragraphs), topics, and models without reward hacking or distribution shift failures?
Detectability side-effects: Do mitigations change how detectable AI assistance is, and does detectability alter acceptance or downstream judgments?
Strategic and adversarial use: How do distortions behave when users intentionally manipulate persona (e.g., coordinated campaigns, astroturfing), and what defenses work?
Policy and governance levers: What disclosure, auditing, or certification regimes reduce harmful persona distortions without unduly penalizing legitimate assistance?
Benchmarking and metrics: Develop standardized, cross-domain multilingual benchmarks and metrics for persona fidelity, distortion, homogenization, and disentanglement to enable reproducible evaluation.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper’s findings and methods enable several deployable applications across industry, academia, policy, and daily life. The items below include likely sectors, potential tools/workflows, and key assumptions or dependencies.

Persona-fidelity reranking in writing assistants (software, enterprise productivity)
- What: Add a “persona fidelity” mode that uses best-of-N generation with reward-model-based reranking to minimize stance distortion relative to user-provided inputs (e.g., bullets, outline).
- Tools/workflows: Reward models trained to predict perceived stance; verbalized sampling; candidate reranking; UI toggle to balance “fidelity vs. polish.”
- Assumptions/dependencies: Requires labeled data or proxy signals to train reward models; slightly higher latency and cost from best-of-N; may reduce user preference/acceptance as shown in the paper.
Persona shift meter for end-users (daily life, enterprise writing tools, education)
- What: Real-time indicators showing how the AI-assisted text changes perceived stance, confidence, emotional tone, and inferred demographics relative to the user’s draft.
- Tools/workflows: Inference-time classifiers/regressors predicting perception scores; delta visualizations; inline feedback before acceptance.
- Assumptions/dependencies: Prediction generalizes across topics and populations; explainability that users can act on; small UX burden.
Developer and auditor benchmarks for persona distortion (AI model evaluation, academia, regulators)
- What: Adopt a 29-dimension reader-perception evaluation protocol as a benchmark for pre-release model audits and red-teaming.
- Tools/workflows: Human-rating pipelines; stratified reader panels; regression-based distortion metrics (AMEs, odds ratios); homogenization/entropy tracking.
- Assumptions/dependencies: Availability of representative raters; cost/time for human evaluation; reusability of open replication materials.
Risk controls for political and public-interest writing (policy, media, civic tech)
- What: Integrate persona-fidelity checks in workflows for op-eds, campaign messages, legislative drafts, and civic communications to avoid inadvertent polarization.
- Tools/workflows: Mandatory “stance deviation” report before publication; editorial sign-off; tuned reranking profiles for political content.
- Assumptions/dependencies: Institutional buy-in; clear thresholds for acceptable deviation; throughput compatible with newsroom/campaign timelines.
Hiring/admissions safeguards (HR, education)
- What: Adjust evaluation processes to account for AI-induced boosts in perceived competence and positivity and shifts toward privileged demographic signals.
- Tools/workflows: Proctored or in-person writing samples; oral defenses; portfolio attestations of AI usage; criteria that de-emphasize surface fluency.
- Assumptions/dependencies: Policy updates; fairness/legal review; stakeholder acceptance.
Editorial and compliance standards for AI-assisted documents (law, journalism, enterprise communications)
- What: Internal policies requiring persona-fidelity review for legal briefs, PR statements, and sensitive communications where misrepresentation is risky.
- Tools/workflows: Two-person review for AI-assisted drafts; persona-fidelity logs; use of reranking in high-stakes contexts.
- Assumptions/dependencies: Organizational appetite for process overhead; security/privacy constraints on document analysis.
Style-diversity controls to counter homogenization (social platforms, content tools, marketing)
- What: Encourage stylistic variation to mitigate convergence toward a narrow, agreeable register (e.g., configurable tone templates, dialect-preserving modes).
- Tools/workflows: Prompt libraries representing diverse styles; candidate selection penalizing homogenized phrasing.
- Assumptions/dependencies: Business KPIs tolerate some variance in “clarity”/“polish”; careful design to avoid stereotyping.
Education practices addressing persona distortion (education, academic integrity)
- What: Explicit instruction on AI’s effects on perceived stance, emotions, and identity; reflective assignments comparing human vs. AI-assisted drafts.
- Tools/workflows: Classroom use of perception meters; rubrics that grade idea quality and self-representation; submission of pre-AI outlines/bullets alongside final drafts.
- Assumptions/dependencies: Faculty training; student access to tools; alignment with academic policies.
Product setting: “Stance-preserving grammar” mode (software, accessibility)
- What: A constrained assistant that targets copy-editing with strict stance- and tone-preservation thresholds and flags if meaning shifts.
- Tools/workflows: Minimal-edit algorithms; meaning-preservation tests; diff views with risk flags.
- Assumptions/dependencies: Even minimal edits can shift meaning; thresholds must be empirically tuned; potential reduction in perceived quality.
Procurement and policy contracting clauses (government, enterprise)
- What: Require vendors to report persona distortion metrics for AI writing features and to provide mitigations (e.g., reranking capability).
- Tools/workflows: Contractual KPIs on stance fidelity; audit trails; periodic evaluation using the benchmark protocol.
- Assumptions/dependencies: Market maturity to supply metrics; harmonization with procurement regulations.
Equity-aware writing assistance for non-native speakers (education, global workforce)
- What: Provide AI assistance that improves clarity while preserving identity cues users want to retain, reducing effacement of minority voice.
- Tools/workflows: User-selectable identity/voice preservation; evaluation against demographic inference shifts; user-in-the-loop calibration.
- Assumptions/dependencies: Consent and control over identity signals; careful UX to avoid reinforcing stereotypes.
Platform-level treatments for AI-generated posts (social media, community platforms)
- What: For built-in writing helpers, apply reranking to limit polarization and homogenization of user posts without generic warnings (shown ineffective).
- Tools/workflows: In-app best-of-N generation for composer; policy toggles for political topics; A/B testing for engagement trade-offs.
- Assumptions/dependencies: Compute costs at platform scale; monitoring of unintended impacts on expression and engagement.

Long-Term Applications

The following opportunities require further research, scaling, or development to address technical, social, or regulatory dependencies.

Personalized, privacy-preserving persona fidelity (software, consumer productivity)
- What: On-device or federated reward models tailored to each user’s baseline style and stance to minimize distortion while keeping benefits users value.
- Dependencies: Per-user data collection with privacy guarantees; continual learning; multi-objective optimization to manage trade-offs.
Multi-objective alignment to decouple entangled perceptions (AI research, model training)
- What: New training paradigms (e.g., multi-objective RLHF, representation disentanglement) that preserve confidence/clarity without amplifying perceived extremity.
- Dependencies: Better measurement models of human perception; scalable training pipelines; robust generalization across topics and cultures.
Standards and certification for “representation fidelity” (policy, standards bodies, regulators)
- What: ISO-like standards and audit frameworks to certify AI writing tools on persona distortion metrics across protected attributes and opinion dimensions.
- Dependencies: Consensus on metrics; third-party auditors; legal/regulatory alignment across jurisdictions.
Election and political communications policy updates (public policy, electoral commissions)
- What: Rules for AI-assisted political advertising, petitions, and legislative submissions requiring stance-fidelity checks and disclosures enforceable at scale.
- Dependencies: Regulatory mandate; detection and audit infrastructure; coordination with platforms and parties.
Longitudinal monitoring of public discourse effects (academia, civil society, regulators)
- What: Dashboards that track shifts in aggregate perceived stance, sentiment, and demographic signals in AI-mediated communications over time.
- Dependencies: Access to representative corpora; ethical data use; robust perception models that avoid bias.
Identity-preserving language technologies for marginalized dialects (education, DEI tooling)
- What: Assistants that respect linguistic identity (dialects, sociolects) while improving readability, with guardrails against “standardized voice” assimilation.
- Dependencies: High-quality dialectal datasets; participatory design with affected communities; evaluation beyond fluency metrics.
Liability and accountability frameworks for misrepresentation (law, compliance)
- What: Legal doctrines or contractual norms assigning responsibility when AI assistance materially misrepresents author stance or identity in high-stakes contexts.
- Dependencies: Case law development; industry norms; mechanisms for provenance and audit trails.
Reimagined assessment and hiring practices (education, HR)
- What: System-level shifts toward multimodal or interactive evaluations (oral defenses, live problem-solving) less sensitive to AI-enhanced surface fluency.
- Dependencies: Institutional change management; cost and scalability; validity studies.
Platform policies to manage homogenization at scale (social platforms)
- What: Algorithmic incentives that promote stylistic diversity and reduce “one-voice” convergence from AI assistance without penalizing clarity.
- Dependencies: Reliable homogenization metrics; robust causal evaluation; balancing user satisfaction and platform safety.
Domain-specific assistants with calibrated persona (healthcare, finance, government)
- What: Task-tuned writing assistants that preserve clinician/patient tone in health records or maintain neutral/accurate stance in regulatory and financial communications.
- Dependencies: Sector-specific datasets; compliance approval; human-in-the-loop verification.
Cross-lingual and cross-cultural generalization (global software, research)
- What: Extending measurement and mitigation to other languages/cultures where perception norms differ, enabling fair, localized tuning.
- Dependencies: Diverse reader panels; localization of perception attributes; culturally aware training data.
Provenance and transparency infrastructure (software ecosystems, policy)
- What: Cryptographic content provenance plus structured disclosure of AI assistance and applied mitigations (e.g., “reranking applied: stance fidelity 0.8”).
- Dependencies: Standards (e.g., C2PA extensions); interoperable metadata; adoption by authoring tools and platforms.

Notes on Feasibility and Dependencies

Trade-off entanglement: The paper shows that reducing perceived extremity via reranking also attenuates perceived confidence/clarity and lowers user preference. Any deployment must expose user controls, set context-specific defaults, and monitor for acceptance and effectiveness.
Data and generalization: Reward models hinge on high-quality human ratings. Portability to new domains, topics, and cultures must be validated; otherwise, tools risk miscalibration.
Cost/latency: Best-of-N generation and scoring add inference cost; on-device or cached approaches may be needed for scale.
Privacy and consent: Persona-fidelity features should avoid inferring protected attributes without explicit consent and must provide transparent controls over identity signals.
Organizational incentives: Vendors may resist mitigations that reduce perceived “quality.” Adoption will benefit from standards, procurement requirements, or regulatory incentives.

View Paper Prompt View All Prompts

Glossary

Average marginal effect (AME): A summary measure of how much the predicted outcome changes, on average, when a predictor (e.g., AI vs. human text) changes, holding other factors constant. "Pts are average marginal effects from beta regressions, on the original 0-100 rating scale (see \hyperref[sec:methods]{Methods})."
Beta regression: A regression model tailored for outcomes that are proportions or lie strictly between 0 and 1, often using a logit link. "For the 20 rating attributes measured on a 0-100 scale (e.g.\ paragraph relevance, writer confidence) we fit beta regressions using the logit link function, after mapping each outcome to the (0,1) interval."
Best-of-N selection: A generation strategy where multiple candidate outputs are produced and the best is chosen according to a scoring function (e.g., a reward model). "This approach draws on best-of-N selection methods used in AI training, where reward models steer outputs toward a specified objective by scoring and selecting among multiple output candidates"
Bonferroni correction: A multiple-comparisons adjustment that tightens significance thresholds by dividing alpha by the number of tests. "*** indicates significance at p<.001 after Bonferroni correction across all 29 rating attributes."
Bootstrap confidence intervals (bootstrap CIs): Uncertainty intervals computed by resampling the data many times and estimating variability across resamples. "C for nominal rating attributes: Cramer's V from comparing rating distributions for human- and AI-written paragraphs, with 95\% bootstrap CIs (see \hyperref[sec:methods]{Methods})"
Census-representative: A sample constructed to match key demographic distributions of a national census (e.g., age, gender, race). "In our main study, writers were UK adults (N=1,501, census-representative on age, gender, race) who expressed their opinion on three political propositions drawn randomly from a pool of 100 (see \hyperref[sec:methods]{Methods})."
Cramer's V: A measure of association between two nominal variables, bounded between 0 (no association) and 1 (strong association). "C for nominal rating attributes: Cramer's V from comparing rating distributions for human- and AI-written paragraphs, with 95\% bootstrap CIs (see \hyperref[sec:methods]{Methods})"
Cumulative logit link function: The link function used in ordinal logistic regression that models cumulative probabilities across ordered categories. "For the 5 ordinal rating attributes (e.g.\ writer income, education) we fit ordinal logistic regressions using the cumulative logit link function."
Delta-method confidence intervals (delta-method CIs): Approximate CIs derived by linearizing a function of estimates using a first-order Taylor expansion. "A for rating attributes measured on a 0-100 scale:\ Average marginal effect (AME) from mixed-effect beta regressions, with 95\% delta-method CIs."
Entropy (categorical attributes): A measure of distributional uncertainty or dispersion across categories; lower entropy indicates more concentration/homogeneity. "Across most dimensions, AI-assisted paragraphs were rated significantly more similarly to each other than their human-written counterparts (significant reduction in standard deviation [scale attributes] or entropy [categorical attributes] for 22 of 29 rating attributes at p<.001 after Bonferroni correction; see \textcolor{darkblue}{SI:\ref*{sm-subsec:main_distortion_analysis_homogenisation})."
Generalised linear mixed-effects regression: A regression framework that models non-Gaussian outcomes with link functions while including random effects to account for grouped/clustered data. "In our main and mitigation studies, we assessed whether AI writing assistance caused a significant change in third-party perceptions of writers and their opinions. For this purpose, we fit generalised linear mixed-effects regression models for each of the 29 rating attributes, where the single regressor was binary paragraph type (human- or AI-written)."
Homogenisation: The process by which outputs become more similar or uniform, reducing diversity in style or perceived attributes. "AI writing assistance also homogenised perceived writer personas."
Levenshtein ratio: A normalized edit-distance metric that measures textual similarity based on the minimal number of edits needed; values closer to 1 indicate higher similarity. "Users edited the AI-generated paragraphs only 23\% of the time (<30\% across all models and input types at p<.01; see \textcolor{darkblue}{SI:\ref*{sm-subsec:main_preference_analysis_descriptive}), and most edits were minor (median Levenshtein ratio = 0.96)."
Logit link function: A function that maps probabilities in (0,1) to the real line, commonly used in logistic-type models. "For the 20 rating attributes measured on a 0-100 scale (e.g.\ paragraph relevance, writer confidence) we fit beta regressions using the logit link function, after mapping each outcome to the (0,1) interval."
Multinomial logit link function: The link used in multinomial logistic regression for modeling outcomes with more than two nominal categories. "For the 4 nominal rating attributes (e.g.\ writer race, gender) we fit multinomial logistic regressions using the multinomial logit link function."
Odds ratio: A multiplicative measure comparing the odds of an outcome between groups; values above 1 indicate higher odds in the numerator group. "Odds ratios are from ordinal logistic regression for ordinal attributes (e.g.\ writer age, income) and one-vs-all logistic regression for nominal attributes (e.g.\ writer gender, race)."
One-vs-all logistic regression: A strategy for multi-class classification where a separate binary model is trained for each class against all others. "Odds ratios are from ordinal logistic regression for ordinal attributes (e.g.\ writer age, income) and one-vs-all logistic regression for nominal attributes (e.g.\ writer gender, race)."
Ordinal logistic regression: A regression model for ordered categorical outcomes, typically using a cumulative logit link. "Odds ratios are from ordinal logistic regression for ordinal attributes (e.g.\ writer age, income) and one-vs-all logistic regression for nominal attributes (e.g.\ writer gender, race)."
Pearson's r: A correlation coefficient that quantifies linear association between two continuous variables, ranging from -1 to 1. "AME is positively correlated with writer tolerance (Pearson's r = 0.67, p=.002)."
Persona distortion: A systematic misrepresentation of an author’s perceived beliefs, personality, or identity caused by AI assistance. "Persona distortions from AI writing assistance are measured as systematic differences in reader perception between human-written and AI-generated paragraphs (with human edits) from the same writers."
Preregistration: The practice of specifying research hypotheses and analysis plans before observing the data to reduce bias and p-hacking. "All studies were preregistered (main study:\ \href{https://osf.io/6exf2/overview}{osf.io/6exf2}, follow-up studies:\ \href{https://osf.io/sf7rg/overview}{osf.io/sf7rg})."
Prompting: Supplying explicit natural-language instructions in the input to guide an AI model’s output. "As a first intervention, we tried Prompting:\ appending a short instruction (113 words; see \textcolor{darkblue}{SI:\ref*{sm-subsec:mitigation_study_prompting_design}) to the AI model's generation prompt, explicitly directing it to preserve the writer's issue stance."
Random effects: Model components that capture variability attributable to clustering (e.g., differences among readers or writers), assumed to be drawn from a distribution. "In all cases we included reader random effects only, in line with our pre-registration, as models with additional writer and proposition random effects did not converge reliably."
Reranking: Selecting the best output from multiple AI-generated candidates using a learned scoring model to optimize a target criterion. "Given that neither user-facing disclaimers nor direct prompting mitigated the polarising distortion, we turned to a more sophisticated model-level intervention:\ Reranking (\hyperref[fig:4]{Figure~\ref*{fig:4}A})."
Reward model: A learned model that scores outputs according to a specified objective, guiding selection or training of AI-generated responses. "We fine-tuned two reward models (RMs) using the 10,008 annotated paragraph ratings collected in our main study:\ one to predict perceived issue stance from AI-written paragraphs, and another to predict the perceived stance of the writer's own text from their bullet point inputs."
Stance polarity: The perceived position of an author’s opinion along a spectrum (e.g., from moderate to extreme) on an issue. "D:~While Reranking succeeded in reducing distortion in writer stance polarity, it also attenuated a range of non-targeted distortions that writers in our main study found acceptable (green) and unacceptable (red)."
Verbalised sampling: A decoding approach that generates multiple candidates, often by varying explicit reasoning or verbalized steps, to improve selection quality. "At generation time, we used verbalised sampling \citep{zhang2025verbalizedsampling} to generate multiple AI paragraph candidates, and selected the candidate predicted by our RMs to minimise stance distortion relative to the writer's own paragraph (see \hyperref[sec:methods]{Methods} for details)."
Violin plot: A visualization combining a box plot with a kernel density estimate to show distribution shape and summary statistics. "Violin plots show rating distributions for human- and AI-written paragraphs, with vertical lines for per-group means."
Within-subjects design: An experimental design where each participant experiences multiple conditions, enabling comparisons within the same individuals. "In our main and disclaimer studies, we randomly paired the three AI models with the three input formats, so that each writer engaged with content generated by each model and based on each input type, in a within-subjects design."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Measuring and Mitigating Persona Distortions from AI Writing Assistance

Summary

Measuring and Mitigating Persona Distortions from AI Writing Assistance

Introduction

Experimental Framework

Prevalence and Structure of Persona Distortions

Writer Tolerance and Agency

Intervention Strategies: Limits and Trade-Offs

Implications and Theoretical Considerations

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-language summary of “Measuring and mitigating persona distortions from AI writing assistance”

What this paper is about (overview)

The main questions (in simple terms)

How the researchers tested this (methods explained clearly)

What they found (main results and why they matter)

Why this matters (implications and impact)

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets