Papers
Topics
Authors
Recent
Search
2000 character limit reached

StoryScope: Investigating idiosyncrasies in AI fiction

Published 3 Apr 2026 in cs.CL | (2604.03136v2)

Abstract: As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

Summary

  • The paper presents a novel narrative forensics pipeline that leverages discourse-level features to distinguish between human and AI-generated fiction.
  • It systematically extracts 304 interpretable narrative features from over 61,000 stories, achieving up to 93.2% macro-F1 in authorship discrimination.
  • It reveals that AI narratives tend to be over-explicit, linearly structured, and converge towards homogeneity, highlighting limits in creative variability.

Investigating Discourse-Level Idiosyncrasies in AI-Generated Fiction

Introduction

The proliferation of LLM-generated fiction has heightened the need for robust methods of authorship discrimination and originality assessment, especially as stylistic artifacts become less diagnostic due to model improvement and post-processing. "StoryScope: Investigating idiosyncrasies in AI fiction" (2604.03136) advances narrative forensics by systematically extracting and evaluating interpretable, discourse-level features from long-form narratives, thereby shifting the focus of AI authorship attribution from shallow stylistic cues to structural aspects of storytelling. This work operationalizes a large-scale pipeline that captures deep narrative decisions, quantifies their discriminative power, and analyzes the implications for both detection robustness and the broader question of AI creativity.

Methodology

Corpus Construction

The study deploys a parallel corpus comprising 10,272 writing prompts, each linked to one human-written story (from Books3) and five LLM-generated mirrors (Claude Sonnet 4.6, DeepSeek V3.2, Gemini 3 Flash, GPT-5.4, and Kimi K2.5), yielding 61,608 narratives with mean length ~5,000 words. Prompts are reverse-engineered using Gemini to preserve source plot and character structure, enabling direct comparison between human and AI outputs at the narrative level.

Narrative Feature Induction Pipeline

The novel StoryScope pipeline executes a three-stage process:

  1. Structured Narrative Representation: Each story is abstracted into a structured template, grounded in the NarraBench taxonomy, spanning ten core narrative dimensions (e.g., agent, event, plot, structure, time, revelation, style, etc.).
  2. Cross-Source Comparative Analysis: Pairwise and pooled analyses of structured representations are conducted with GPT-5.1 to identify recurrent divergences across sources given the same prompt.
  3. Feature Extraction and Assignment: From ~600 comparative analyses, the pipeline induces 408 candidate features, pruned and deduplicated to 304 interpretable narrative features via clustering over embedding space. Gemini 3 Flash assigns these to all 61K stories with high annotator agreement (Cohen’s κ=0.84\kappa = 0.84).

Feature encoding is categorical, ordinal, scale, binary, or multi-select, yielding a high-dimensional, content-grounded narrative feature space for downstream classification.

Experimental Results

Human vs. AI Detection

Gradient-boosted classifiers (XGBoost) trained solely on narrative features yield a macro-F1 of 93.2% for human vs. AI discrimination—retaining 97% of the performance of combined style and narrative models (macro-F1 96.0%). A compact subset of 30 core narrative features achieves 84.8% macro-F1. Performance is robust to story length and to post-hoc stylistic editing (drop of only 1.6 points after artifact removal via LAMP rewriting).

Discriminative Narrative Idiosyncrasies

Key narrative distinctions identified include:

  • AI Over-Explicitness and Moralizing: AI consistently overstates themes, central morals, and philosophical debates, with narrated lessons, overdetermined unity, and reliance on embodied sensorial expression far exceeding human baselines.
  • Linear, Causally Tidy AI Plots: AI stories preferentially adhere to single-track, causally continuous, temporally linear structures, with minimal subplotting and rare employment of nonlinear techniques (e.g., flashbacks, time jumps).
  • Restricted Intertextuality and Reader Address: Explicit references, direct audience engagement, and complex narrative devices (e.g., fourth-wall breaks) are more prevalent in human writing; AI defaults to vague allusions and impersonal perspectives.
  • Homogeneity and Structural Convergence: Vector-space analysis reveals all five LLMs clustering tightly and distinctly from dispersed, more "original" human stories.

Six-Way Authorship Attribution

For six-way attribution, the narrative model achieves macro-F1 68.4%, with combined style and narrative features reaching 77.3%. Human-writing is the most separable class (F1 93.0%), with Claude and GPT showing the most distinct LLM-specific fingerprints. DeepSeek, Gemini, and Kimi exhibit substantial narrative overlap, supporting claims of increasing LLM convergence.

Feature Robustness and Rarity

Narrative features significantly outperform length or purely stylistic controls, demonstrate high reliability across story genres, and remain substantially effective after editing or length norming. Human-authored stories are more structurally "rare" (mean percentile 0.71 vs. 0.49 for AI), an interpretable proxy for originality under regulatory and legal frameworks.

Implications

Practical

  • Detection Robustness: Discourse-level narrative features are demonstrably more resilient to paraphrasing, stylistic mimicry, and post-hoc edits than conventional lexical or syntactic signals.
  • Fingerprinting and Auditing: Narrative feature taxonomies enable model-specific fingerprinting, supporting detailed audits, source attribution, and forensic analysis as the LLM/AI model distribution grows.
  • Dataset and Tooling Release: The public release of prompts, AI stories, and feature extraction code supports reproducibility, benchmarking, and extension to new generative models.

Theoretical

  • Narrative Homogenization: Findings reinforce accumulating evidence that LLMs, despite stylistic tuning or data augmentation, converge toward a low-diversity, overdetermined narrative regime [jiang2025artificial, xu2025echoes].
  • Creativity Assessment: The quantification of narrative rarity operationalizes a content-based measure of originality, facilitating empirical studies of creativity, legal discussions on copyright, and algorithmic approaches for improving LLM generativity.
  • Limits of Stylistic Imitation: As LLMs approach perfect stylistic mimicry, the inability to simulate diverse, structurally novel human storytelling decisions becomes the critical frontier for detection and authorship assessment.

Future Directions

  • Model Development: Incorporating explicit narrative planning or strongly supervised narrative structures may bridge some of the diversity and complexity gaps, although the tendency toward convergence observed across SOTA LLMs suggests deep-seated architectural/inference constraints.
  • Advancing Narrative Evaluation: The deployment of similar pipelines for long-form improvisational generation, hybrid human-AI authorship, and multilingual narrative domains is warranted.
  • Benchmarking and Policy: Ongoing work to formalize narrative-space rarity as a criterion for creativity and originality will impact regulatory guidance and copyright jurisprudence.

Conclusion

StoryScope demonstrates that discourse-level narrative features provide a highly discriminative and robust axis for separating human- from LLM-generated fiction, outperforming surface-level stylistics as models advance and editing pipelines mature. The evidence of AI narrative convergence, along with persistent gaps in structural diversity and originality, highlights enduring limitations in current LLM generative capacities. The StoryScope pipeline and associated resources represent a significant technical foundation for systematic narrative analysis, AI authorship detection, and the empirical study of literary creativity in the age of generative models (2604.03136).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

StoryScope: A simple guide to what this paper is about

What is this paper about?

This paper looks at how to tell whether a story was written by a human or by an AI, not by looking at the words and style on the surface, but by looking at the deeper “bones” of the story—things like who makes the choices, how the plot is structured, and how time jumps around. The authors build a tool, called StoryScope, to measure these deeper choices and show that AI and humans tend to build stories differently.

What questions did the researchers ask?

  • Can we spot AI-written stories by looking at story structure (the plot, characters, and timeline), even if we ignore writing style and word choice?
  • Do different AI models leave their own “fingerprints” in how they build stories?
  • Are human stories more original or varied in their narrative choices than AI stories?

How did they study this?

Big idea in everyday terms

Imagine every story as a house. Style is the paint and decorations. Narrative is the blueprint: where the rooms go, how you move through them, and what happens in each space. This paper asks: if we ignore the paint, can we tell who built the house just by studying the blueprint?

What they built and used

  • A huge collection: Over 10,000 writing prompts where each prompt has six versions of a story—one written by a human and five by different AI models—making over 60,000 stories in total (each about the length of a short story).
  • StoryScope: a pipeline (step‑by‑step process) that:
    • Who drives the ending—the hero’s choices or outside fate?
    • Are themes explained clearly, or left for the reader to infer?
    • Is the timeline straight or full of flashbacks?
  • Simple, explainable models: They train a straightforward classifier (like a deck of “if–then” rules) on these features. Because the features are clear and human-readable, you can see which choices matter most.

What did they find, and why does it matter?

Main results in plain language

  • Narrative alone is powerful: Even when they throw away style clues, StoryScope can tell human vs. AI stories correctly about 93% of the time. That’s very close to models that also use style.
  • Fingerprints of different AIs: The system could often tell which AI wrote a story (6-way guessing was about 68% correct using only narrative structure). Each AI model has habits:
    • Claude: “Flatter” build‑ups (less escalation) and calm endings.
    • GPT: Uses dream sequences more often and leans into social dynamics.
    • Gemini: Describes characters from the outside more; tidy endings.
  • Shared AI zone vs. human variety: When you map stories by their narrative features, AI stories cluster together in one area, while human stories are spread out more. In other words, AI tends to make similar kinds of narrative choices; humans are more diverse.
  • Human stories feel “rarer”: Measured as how unusual a story’s combination of features is, human stories are more often in the “rare” zone. This lines up with a common idea of originality—humans try more unusual narrative paths.
  • Edits to style don’t fool it: Even when AI stories were rewritten to clean up their prose (removing clichés and “purple” language), the narrative-based detector still worked almost as well. That means it’s the deeper structure—not just the wording—that gives AI away.

What are the biggest human–AI differences?

  • AI explains themes more: AI stories often spell out the moral or lesson clearly, while human stories more often let readers figure it out.
  • AI plots are tidier and straighter: Fewer subplots, more single‑track stories, and endings driven by the main character’s choices. Human stories use more flashbacks, time jumps, and messy or ambiguous endings.
  • Different ways of showing emotion: AI leans hard on bodily sensations and sensory detail (e.g., “her chest tightened”), while humans more often mix in direct emotion words or different techniques.
  • Humans connect more to the outside world: More specific references to books, authors, or real things; more playful “breaking the fourth wall.”

Why is this important?

What this could change

  • Better, fairer detection: As surface style becomes easy for AI to imitate or edit, looking at story structure offers a more durable way to tell AI and human writing apart.
  • Copyright and originality: Laws and publishers care about whether work is genuinely original and human-made. Measuring “rarity” in narrative choices could be a useful signal for originality.
  • Understanding AI storytelling: Knowing where AI defaults to neat, linear plots can help developers and writers push models toward more varied, human‑like storytelling—or encourage new forms that are creative in their own way.
  • Tools for writers and editors: StoryScope-like analyses could help writers see their own narrative habits and experiment with structure.

In short

StoryScope shows that you can often tell who wrote a story by reading its blueprint, not its paint. AI stories tend to be clearer about their morals, tidier in their plots, and more similar to each other. Human stories are messier, more varied, and more likely to try unusual narrative moves. That difference remains even if you rewrite the AI’s style—because the structure underneath hasn’t changed.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains uncertain, missing, or underexplored in the paper, phrased to guide actionable follow-up research.

  • Data representativeness: Generalization beyond English, Books3-sourced short stories, and contemporary Western literary norms is untested; evaluate on multilingual, cross-cultural, and older/younger (e.g., 19th-century, YA) corpora and on non-Western narrative traditions.
  • Genre coverage: Results are reported for prose fiction (~5k words); assess transfer to other forms (microfiction, novels, fanfiction, poetry, screenplays, narrative nonfiction).
  • Length sensitivity: Narrative-only detection is reported as “unchanged with length” on a matched subset, but robustness at extreme lengths (very short drabbles; full-length novels) and excerpt-only detection remains unquantified.
  • Prompt reverse-engineering validity: The LLM-inferred prompts may not faithfully capture original story premises; validate against author-provided premises or human-curated prompts and quantify how prompt inference errors bias observed human–AI differences.
  • Model training contamination: The evaluated LLMs may have seen Books3 or similar texts; rigorously test on human stories outside likely model training data (e.g., post-training publication dates, licensed corpora) to isolate memorization/leakage effects.
  • Genre/period confounds: Human stories drawn from anthologies and editorial pipelines may differ systematically from model outputs (e.g., narrative norms, editing); control for genre, era, and editorial influence to disentangle true narrative-structure gaps.
  • Mixed-authorship detection: The pipeline does not evaluate partially AI-edited human stories or human-revised AI drafts; measure sensitivity to varying proportions and distributions of edits.
  • Adversarial robustness at the narrative level: Only surface-level editing (LAMP) is tested; evaluate robustness against targeted structural rewrites designed to manipulate the core narrative features (e.g., deliberately adding flashbacks/subplots to evade detection).
  • Cross-model editing robustness: LAMP edits were applied to 278 Gemini stories with Gemini as the rewriter; expand to other models, human editors, and larger samples to assess generality.
  • Future-model generalization: Attribution/detection is trained and tested on the same five model families; evaluate zero-shot generalization to unseen models and newer versions to test durability claims.
  • Sampling hyperparameters: Generation settings (temperature, top‑p, length penalties) are not fully reported; quantify how sampling choices affect narrative-feature distributions and detection performance.
  • Template extraction fidelity: The accuracy of GPT-5.1’s structured templates is not validated against expert gold annotations; perform human evaluation of template correctness and its downstream impact on features/classification.
  • Feature-assignment bias: Using a single model (Gemini 3 Flash) to label features risks annotator-model bias; compare against multi-LLM ensembles and human annotations to assess bias and calibration.
  • Inter-annotator reliability beyond spot checks: Agreement is reported on a 240-feature subset; extend reliability studies across all feature types/dimensions and conduct detailed error analyses.
  • Feature discovery stability: The 304-feature taxonomy is induced from 100 prompts (600 stories); test whether discovered features and their importances are stable across different discovery pools, seeds, and clustering thresholds.
  • Narrative–style disentanglement: The boundary between “narrative” and “style” is audited by an LLM and may leak stylistic cues; design controlled perturbations to quantify residual style contamination in the narrative-only set.
  • Causal validation of core features: Beyond correlational SHAP analyses, perform targeted narrative interventions (e.g., inject/remove time jumps or subplots) to causally test whether manipulating a feature shifts detection outcomes as predicted.
  • Rarity as an originality proxy: The paper equates higher rarity in feature space with “originality” but lacks validation against human judgments or legal criteria; collect expert ratings and test correlations, precision-recall, and cross-genre stability.
  • Distance metric sensitivity: Rarity uses Euclidean distance on z-scored, mixed-type encodings; evaluate alternative metrics (e.g., Gower, learned Mahalanobis, probabilistic density) and feature weighting schemes for robustness.
  • Taxonomy coverage: Two NarraBench dimensions (Paratext, Motivation) are excluded; assess whether including them improves detection/attribution and whether key narrative phenomena are currently omitted.
  • Cross-cultural narrative theory reliance: NarraBench is rooted largely in Western literary theory; adapt or extend the taxonomy for non-Western narrative structures and evaluate shifts in core/fingerprint features.
  • Attribution performance ceiling: Narrative-only 6-way attribution (68.4% macro-F1) lags text-based baselines; investigate methods to boost attribution without stylistic cues (e.g., structured causal graphs, social-network dynamics, event semantics).
  • Per-model fingerprint drift: Track whether identified fingerprints persist across version updates and training regimes; quantify drift and its impact on attribution over time.
  • Hybrid detectors: Explore principled ways to combine narrative features with other robust signals (e.g., watermarking, fingerprinting) while preserving interpretability and minimizing brittleness.
  • Real-world deployment and error costs: Calibrate thresholds for acceptable false positive/negative rates across genres and demographics; assess fairness and potential disparate impact (e.g., on marginalized narrative traditions).
  • Reproducibility constraints: Human texts are not released; provide stronger reproducibility via detailed sampling params, evaluation scripts over public human corpora, or synthetic-but-human-like benchmarks with verified licenses.
  • Policy and legal relevance: Translate narrative-feature evidence into actionable standards for publishers/courts (e.g., confidence intervals, evidentiary thresholds) and study how explanations map to legal tests of “human creative control.”
  • Multilingual extension: Specify how templating, feature discovery, and assignment adapt to other languages (morphology, discourse markers, temporal expressions) and validate cross-lingual consistency.
  • Downstream utility beyond detection: Investigate whether narrative features can guide generation to increase diversity/novelty or support editorial tooling for de-slopping at the narrative level.
  • Open-set and unknown-source scenarios: Evaluate detectors when the author/source is outside the training set (open-set recognition), including detection of machine/human from unseen distributions.

Practical Applications

Immediate Applications

These applications can be deployed now using the paper’s released code, the described pipeline (StoryScope), and existing LLMs for feature extraction and classification.

  • Narrative-authorship screening for publishers and marketplaces
    • Sector: Publishing, platforms (e.g., Amazon KDP), media
    • What: Triage and flag long-form submissions that exhibit AI-like narrative structures (e.g., over-explicit themes, tidy single-track plots, low temporal complexity), with per-feature explanations.
    • Tools/workflows:
    • An API that runs template extraction → narrative feature assignment → XGBoost+SHAP report.
    • Dashboard in editorial submission portals highlighting “core” feature deviations and a rarity score.
    • Assumptions/dependencies:
    • Best performance on long-form (~5k words); shorter texts may reduce accuracy.
    • English focus; domain/genre distribution similar to training data.
    • Should be used as decision support due to false positive/negative risks.
    • Reliant on LLMs (e.g., GPT-5.1/Gemini 3 Flash) for feature extraction quality.
  • Policy and legal evidence bundles for copyright review
    • Sector: Policy, legal, IP offices, compliance
    • What: Provide interpretable, discourse-level evidence (core features + rarity statistic) to support claims of human originality or AI involvement, aligned with USCO “human control” guidance.
    • Tools/workflows:
    • Structured reports attaching per-feature values, SHAP attributions, and rarity percentiles.
    • Chain-of-custody logs for court-admissibility.
    • Assumptions/dependencies:
    • Acceptance by courts/regulators is not guaranteed; narrative rarity is a proxy, not a legal standard.
    • Requires careful calibration and expert review to avoid over-reliance.
  • Academic benchmarking for computational narratology and AI detection
    • Sector: Academia
    • What: Use released prompts, AI stories, and feature taxonomy to benchmark narrative understanding, attribution without style cues, and model convergence in “narrative space.”
    • Tools/workflows:
    • Reproducible experiments on detection and 6-way attribution using only narrative features.
    • Comparative studies of narrative diversity/rarity.
    • Assumptions/dependencies:
    • Human stories are not released (copyright constraints); researchers may need to procure comparable human corpora ethically.
  • Creative writing feedback tools for students and authors
    • Sector: Education, creator tooling
    • What: Formative feedback showing agency distribution, temporal complexity, theme explicitness, subplot integration, and moral ambiguity—encouraging richer narrative construction.
    • Tools/workflows:
    • LMS plug-ins or writing IDEs (e.g., Google Docs, Scrivener) with “narrative feature meter” and suggestions.
    • Assumptions/dependencies:
    • Feedback quality depends on accurate feature extraction.
    • Must avoid nudging all writers toward a single “anti-AI” style; configurable pedagogical goals recommended.
  • Platform integrity and anti-spam triage
    • Sector: Content platforms, marketplaces, moderation
    • What: Batch screening to down-rank mass-uploaded AI fiction and “AI slop” while prioritizing human-diverse narratives.
    • Tools/workflows:
    • Queue-based scanning with thresholds on core features and rarity; human review for borderline cases.
    • Assumptions/dependencies:
    • Adversarial attempts to restructure narratives can reduce effectiveness; periodic model updates needed.
  • Model-provider QA and regression monitoring
    • Sector: AI/Software
    • What: Track narrative convergence, fingerprints, and diversity across model versions; detect drift (e.g., increasing thematic explicitness).
    • Tools/workflows:
    • Internal dashboards with LDA visualizations, centroid distances, and per-feature histograms; can be part of model eval suites.
    • Assumptions/dependencies:
    • Needs consistent sampling protocols across releases; sensitive to prompt distributions.
  • Editorial diagnostics for development editors
    • Sector: Publishing, screenwriting, game writing
    • What: Diagnose “over-tidy” plots, lack of subplots, low ambiguity; guide structural revisions independent of prose style edits.
    • Tools/workflows:
    • Development-edit reports focusing on plot structure, agency, revelation depth, and time discontinuity.
    • Assumptions/dependencies:
    • Most valuable on drafts >2–3k words; feature reliability drops with fragments.
  • Forensic attribution among common model families
    • Sector: Trust & safety, enterprise compliance
    • What: Use fingerprint features to roughly attribute long-form AI content to model families (e.g., Claude vs. GPT), aiding incident response and disclosure audits.
    • Tools/workflows:
    • 6-way classifier outputs with confidence intervals and SHAP explanations.
    • Assumptions/dependencies:
    • Attribution accuracy (~68% macro-F1) is moderate and model-cluster dependent; use as a lead, not conclusive proof.
  • Content discovery and recommendation diversification
    • Sector: Platforms, libraries
    • What: Use narrative rarity and dispersion to surface more diverse human-like narratives and reduce homogeneity.
    • Tools/workflows:
    • Re-ranking modules that include rarity percentiles and feature-space distance to nearest neighbors.
    • Assumptions/dependencies:
    • Requires careful fairness analysis to avoid penalizing certain genres or communities.
  • CMS and newsroom checks for long-form ghostwriting
    • Sector: Media, communications
    • What: Flag narrative-structural signals of AI ghostwriting in essays, op-eds, and long features.
    • Tools/workflows:
    • Editorial integrations producing confidence scores and structure-level rationales.
    • Assumptions/dependencies:
    • Less reliable on short articles; intended for essays and longform pieces.

Long-Term Applications

These require further research, scaling, or institutional adoption before wide deployment.

  • Standards for originality and AI disclosure in policy
    • Sector: Policy, standards bodies (e.g., WIPO, ISO), legal
    • What: Codify discourse-level indicators and rarity-based metrics as part of best-practice guidance or disclosure frameworks for creative works.
    • Tools/workflows:
    • Standardized reporting templates; third-party auditing ecosystems.
    • Assumptions/dependencies:
    • Broad stakeholder agreement and validation across genres, languages, and cultures; governance for error handling.
  • Narrative-level watermarking and provenance signals
    • Sector: AI/Software, trust & safety
    • What: Embed or detect stable, hard-to-edit narrative patterns as complementary signals to lexical watermarks, enabling robust provenance in long-form content.
    • Tools/workflows:
    • Training-time regularizers and post-hoc detectors focused on structural choices (plot arcs, revelation timing).
    • Assumptions/dependencies:
    • Risk of arms race; models may learn to evade or mimic signals; requires cross-vendor collaboration.
  • Controllable narrative generation to reduce AI convergence
    • Sector: AI/Software, creative tools
    • What: Train or fine-tune LLMs with objectives that diversify narrative space (more subplots, temporal nonlinearity, moral ambiguity) or let users dial features explicitly.
    • Tools/workflows:
    • RLHF/RLAIF with narrative-feature rewards; constrained decoding with feature-targeted control.
    • Assumptions/dependencies:
    • Must avoid simply “gaming” the metric; human preference and readability need joint optimization.
  • Cross-genre and cross-lingual generalization
    • Sector: Academia, publishing, global platforms
    • What: Extend pipeline to poetry, screenplays, interactive fiction, non-English works, and short-form content with adjusted feature sets.
    • Tools/workflows:
    • Genre- and language-specific templates; retrained classifiers; calibration datasets curated ethically.
    • Assumptions/dependencies:
    • New annotation schemas and validations needed; performance may vary widely across genres/languages.
  • Automated structural editing assistants
    • Sector: Creator tools, publishing
    • What: Beyond style edits, agents that propose coherent structural rewrites (e.g., alternate chronology, subplot weaving, revelation repositioning) to improve originality.
    • Tools/workflows:
    • Agentic pipelines that simulate outline-level alternatives aligned to feature targets; human-in-the-loop acceptance.
    • Assumptions/dependencies:
    • Hard problem—maintaining coherence and character arcs through structural changes; significant QA needed.
  • Cultural analytics and literary scholarship at scale
    • Sector: Academia, libraries
    • What: Longitudinal mapping of narrative trends (e.g., shifts toward/away from moral explicitness), impacts of GenAI on literary diversity over time.
    • Tools/workflows:
    • Feature extraction on large corpora; time-series analyses; public dashboards for scholars.
    • Assumptions/dependencies:
    • Requires access to broad, legally licensed corpora; careful handling of biases and canon representation.
  • Procurement and grant-funding compliance
    • Sector: Government, foundations, competitions
    • What: Screening long-form submissions for undisclosed AI authorship where human-only clauses apply; auditing compliance post-award.
    • Tools/workflows:
    • Batch evaluation pipelines with manual review of flagged cases and appeals processes.
    • Assumptions/dependencies:
    • Policy acceptance; due process for creators; domain adaptation to proposals/grant narratives.
  • Narrative QA and design in games and interactive media
    • Sector: Gaming, XR, robotics storytelling
    • What: Validate quest/plotlines for excessive linearity or lack of agency; auto-suggest branching structures that increase player engagement.
    • Tools/workflows:
    • Feature-driven validators integrated into narrative design tools; content generators with feature targets.
    • Assumptions/dependencies:
    • Requires genre-specific calibration and user-testing links to engagement metrics.
  • Enterprise knowledge management and documentation quality
    • Sector: Software/enterprise content
    • What: Diagnose overly linear, didactic documentation and suggest restructuring for user comprehension (e.g., controlled reveal of concepts).
    • Tools/workflows:
    • Feature-informed authoring guidelines and linting for docs and training materials.
    • Assumptions/dependencies:
    • Transfer from fiction to instructional prose is non-trivial; would require adapted feature sets.
  • Forensic analysis of influence operations
    • Sector: Security, public policy
    • What: Attribute long-form propaganda narratives to AI pipelines by identifying convergent structural fingerprints across campaigns.
    • Tools/workflows:
    • Cross-corpus clustering, centroid analysis, and fingerprint tracking; integration with OSINT.
    • Assumptions/dependencies:
    • Adversarial adaptation likely; needs multi-signal fusion (metadata, network analysis).
  • Library cataloging and discovery via narrative metadata
    • Sector: Libraries, archives, discovery platforms
    • What: Enrich catalog records with narrative-structure tags (agency patterns, temporality, revelation depth) for advanced search and reader advisory.
    • Tools/workflows:
    • Batch feature extraction across catalogs; UI filters for readers (e.g., “nonlinear timelines,” “morally ambiguous protagonists”).
    • Assumptions/dependencies:
    • Licensing and compute for large-scale processing; validation for misclassification impact on discovery.

Cross-cutting assumptions and dependencies

  • Accuracy depends on consistent, high-quality template extraction and feature assignment by capable LLMs; model updates may shift outputs.
  • Current evidence is strongest for English long-form fiction around ~5,000 words; generalization to short forms, other genres, and languages requires additional validation.
  • Adversaries can reduce detection by performing genuine structural rewrites; however, such changes are costlier than stylistic edits.
  • Ethical and legal considerations (e.g., prior use of Books3 for analysis only) necessitate careful data sourcing for production deployments.
  • Use as decision support with human oversight is recommended to mitigate harms from false positives/negatives.

Glossary

  • Ablation: The removal or isolation of components to assess their impact on a system's performance. "Style ablations"
  • Agent (narrative dimension): A narratology category focusing on characters, their roles, and motivations within a story. "We adopt ten of its twelve aspects: Agent, Social Network, Event, Plot, Structure, Setting, Time, Revelation, Perspective, and Style."
  • AUPRC: Area Under the Precision–Recall Curve; a metric summarizing performance across recall levels, useful for imbalanced classification. "macro-F1 and AUPRC as the primary binary metrics."
  • Avalanche endings: A literary term for rapidly escalating, dramatic conclusions. "quiet endings over 'avalanche' endings."
  • Binoculars: A specific zero-shot AI-text detector referenced as a baseline. "Binoculars \citep{hans2024spotting}, a zero-shot AI-text detector."
  • Bootstrap: A resampling technique used to estimate the stability and variance of metrics or model explanations. "bootstrap SHAP analysis (B=50B{=}50 iterations with prompt-level resampling)"
  • Break the fourth wall: A narrative technique where the story directly acknowledges the audience. "Humans break the fourth wall far more often"
  • Causal chain: A sequence of events linked by cause-and-effect relationships. "causal chains and key events for Event"
  • Centroid: The center point of a cluster in feature space, used to summarize distributions. "retaining the feature nearest each cluster centroid"
  • Cohen's d: An effect size measure indicating the magnitude of differences between two groups. "Cohen's dd\,=\,0.83"
  • Cohen's kappa: An inter-rater agreement metric that accounts for chance agreement. "Cohen's κ=0.84\kappa = 0.84"
  • Confidence interval (CI): A range of values likely to contain a true metric with a specified confidence level. "95\% prompt-bootstrap CI 2.09--3.54"
  • Cosine similarity: A measure of similarity between vectors based on the cosine of the angle between them. "clustered at cosine similarity threshold 0.85"
  • Denouement: The final resolution or winding down of a narrative after the climax. "extended denouements"
  • Discourse-level: Pertaining to narrative structure beyond sentence-level style, such as plot and temporal organization. "discourse-level narrative features"
  • Embedding-based clustering: Grouping items by their vector representations learned from data. "We then deduplicate via embedding-based clustering, retaining the feature nearest each cluster centroid"
  • Epilogue: A concluding section of a story that comments on or extends the narrative's resolution. "It favors epilogues and avoids dream sequences"
  • F2LLM-4B: A specific embedding or encoding model used to represent features. "Each feature is encoded with F2LLM-4B"
  • Few-shot: A learning or prompting paradigm using a small number of examples to guide a model. "using 25 few-shot examples from professional writers."
  • Fingerprint features: Source-specific signals enabling attribution to a particular model or author. "Per-model fingerprint features enable six-way attribution"
  • Flashback: A narrative technique that depicts events from an earlier time than the current storyline. "e.g., flashbacks, nonlinear structure"
  • Gradient boosting: An ensemble learning technique that builds models sequentially to minimize errors. "simple classifiers (e.g., gradient boosting)"
  • Krippendorff’s alpha: A reliability coefficient for agreement among annotators across various data types. "Krippendorff's α=0.88\alpha = 0.88"
  • LAMP: A span-level rewriting framework used to edit AI-text artifacts. "We test this using \citet{chakrabarty2024salvaged}'s span-level rewriting framework (LAMP)"
  • Linear discriminant components: Axes derived from Linear Discriminant Analysis to separate classes in feature space. "Projection of narrative feature vectors onto the first two linear discriminant components."
  • Macro-F1: The unweighted average F1 score across classes, treating each class equally. "macro-F1 of 93.2\% for the binary human vs.\ AI detection task"
  • ModernBERT: A transformer-based baseline model used on raw text for classification. "ModernBERT"
  • Multi-hot encoding: Encoding multi-select categorical variables as binary vectors with multiple ones. "multi-select features are multi-hot encoded"
  • NarraBench: A taxonomy and framework for narrative benchmarking and dimensions. "grounded in the NarraBench taxonomy"
  • Nearest neighbors: The closest items in feature space used for measuring local density or rarity. "mean Euclidean distance to a story's 25 nearest neighbors."
  • Nonlinear narrative: Story structure that departs from chronological order (e.g., jumps in time). "nonlinear structure"
  • One-hot encoding: Encoding categorical variables as binary vectors with a single one at the category index. "Nominal features are one-hot encoded"
  • Ordinal features: Variables with ordered categories treated numerically in modeling. "ordinal / scale features retain numeric encoding."
  • Paratext: Materials surrounding the main text (e.g., titles, forewords) affecting interpretation. "We exclude Paratext and Motivation"
  • Pastiche: A work that imitates the style of others, often as homage. "explicit named, retelling, pastiche, myth/religion, self referential"
  • Prompt-level grouping: Ensuring train/test splits prevent leakage by grouping data generated from the same prompt. "All evaluations use prompt-level grouping"
  • Purple prose: Overly ornate or flowery writing style considered an artifact in AI text. "purple prose"
  • Rarity percentile: A normalized measure of how uncommon a story’s feature configuration is within a corpus. "mean rarity percentile 0.71 vs.\ 0.49 for AI"
  • SHAP: A method for explaining model predictions via Shapley values. "whose predictions can be decomposed via SHAP into per-feature contributions"
  • Span-level rewriting: Editing text at the level of contiguous segments to remove artifacts. "After span-level artifact removal"
  • Stylometric features: Quantitative textual characteristics used for authorship or style analysis. "Stylometric+XGB, XGBoost on 144 hand-crafted stylometric features (character n-grams, POS distributions, readability scores)"
  • TF-IDF: Term Frequency–Inverse Document Frequency; a weighting scheme for text features. "TF-IDF+XGB, XGBoost on 5{,}000 TF-IDF features"
  • XGBoost: A high-performance gradient boosting library for decision trees. "We train XGBoost classifiers"
  • Zero-shot: Performing a task without task-specific training examples. "a zero-shot AI-text detector"
  • z-scored: Standardized features transformed to zero mean and unit variance. "Working in the z-scored encoded feature space with Euclidean distance"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 28 tweets with 3847 likes about this paper.

HackerNews