CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

Published 26 Apr 2026 in cs.MM | (2604.23579v1)

Abstract: Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framework that decomposes this complex task through specialized multi-agent orchestration. Our framework employs three key innovations: (1) a multi-agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross-modal specifications; (2) a decoupled character-centric pipeline that maintains identity consistency through instance-level tracking and integration while enabling flexible multi-character composition; and (3) a hierarchical audio-visual synchronization mechanism ensuring frame-level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40\% improvement in overall consistency, 4.4\% gain in subject consistency, 5.4\% enhancement in aesthetic quality, and 28.7\% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi-scene video generation that preserves narrative coherence and character authenticity.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper proposes a hierarchical multi-agent framework that decomposes movie synthesis into specialized modules for narrative planning and character integration.
It demonstrates significant improvements over baseline methods by enforcing cross-modal alignment and achieving higher metrics in character consistency and audiovisual coherence.
Experimental evaluations reveal that decoupled character pipelines and agent orchestration yield superior narrative fidelity and robust scalability for long-form content.

CineAGI: Hierarchical Multi-Agent Orchestration for Character-Consistent Automated Movie Generation

Introduction

CineAGI addresses the core problem of end-to-end automated movie generation, focusing on persistent character identity, cross-scene narrative coherence, and multi-modal (audio-visual) synchronization—persistent challenges in large-scale text-to-video systems. Contemporary diffusion-based T2V models and LLM-based pipelines struggle with maintaining continuity in character attributes and aligning visual, narrative, and audio modalities over temporally extended, multi-scene productions. CineAGI’s hierarchical design decomposes movie synthesis into agent-specialized modules, explicitly coordinating narrative planning, character creation, and scene synthesis with decoupled and synchronized pipelines. This enables fine-grained control over characters and boosting narrative fidelity and audiovisual consistency, outperforming current state-of-the-art baselines across quantitative and human-evaluated benchmarks.

Figure 1: Three critical limitations in prior T2V: character identity inconsistency, scene-level style drifts, and cross-modal audiovisual misalignment; CineAGI’s architecture directly addresses these via hierarchical, agent-driven pipelines.

Hierarchical Multi-Agent Narrative Decomposition

CineAGI introduces a principled, multi-agent framework governing narrative synthesis, drawing a sharp distinction from existing monolithic or loosely-structured systems. The pipeline leverages specialized LLM agents representing production roles: Character Designer, Script Writer, Storyteller, Composer, and Quality Inspector. Each agent operates both as an independently parameterized decision module and as part of a strictly coordinated information flow, collectively synthesizing a fully structured cinematic blueprint. This includes: (1) hierarchical character profiles with multi-scale attribute specification (appearance, persona, behavioral cues); (2) shooting scripts with precise scene decompositions and technical directives; (3) narrative flows and frame-precision dialogue with emotional tempo anchoring; and (4) inter-modality music directives conditioned on both scene and character context.

The Quality Inspector agent is critical, serving as a convergence and validation layer—enforcing cross-agent alignment, filtering information cascades, and serializing outputs for downstream modules. This structure enforces both top-down and bottom-up consistency, enabling robust translation from global narrative intent to scene-specialized production directives.

Figure 2: Overview of CineAGI: hierarchical LLM agent narrative synthesis, decoupled character-centric asset generation, and integrated cinematographic orchestration, with explicit scene-to-character-to-modality flow.

Character-Centric Visual and Audio Asset Generation

Post-narrative orchestration, CineAGI decouples character asset creation from scene synthesis to systematically guarantee subject persistence. The Character Generation Module utilizes state-of-the-art image and voice synthesis:

Portrait Artist: Leverages RealVisXL 3.0, conditioned on multi-faceted character profiles, to output high-resolution, multi-view reference portraits. These serve as global anchors for cross-scene face identity.
Sound Generator: Employs ChatTTS for identity- and emotion-conditioned multi-character voice generation, parameterized to preserve consistent voice timbre and emotional expressivity across all scenes.

Assets from both components are precisely linked to narrative slots, supporting direct, disentangled integration with varied scene backgrounds and context-dependent visual/auditory requirements.

Cinematographic Synthesis and Decoupled Character Integration

Cinematographic Synthesis is built around a decoupled, three-stage character integration pipeline, enabling multi-character scene composition with strict spatial-temporal consistency:

Scene Creator: HunyuanVideo-13B generates context-rich background scenes from structured, narrative-derived cues, omitting direct character visual references to maximize controllability at subsequent integration stages.
Decoupled Character Integration (DCI):
- Segmentation: Grounded-SAM2 isolates and tracks character regions in the scene, offering robust per-character compositional control.
- Face Swapping: SimSwap aligns segmented character regions with reference portraits, maintaining precise facial and physical identity coherence.
- Talking Face: Wav2Lip synchronizes lip motion and facial dynamics to voice tracks with frame-level precision, utilizing frame-timed dialogue from the Storyteller agent.
  Figure 3: Visualization of the DCI pipeline—scene segmentation, face-swapping with reference portraits, and talking-head synthesis, enabling identity-preserving, context-consistent character video integration with audio sync.

Additionally, scene music is generated via MusicGen, dynamically adapted to each narrative segment and coordinated across scenes by the Composer agent. The Cinematographer module then overlays voice, music, and subtitle tracks, assembling the movie with precise narrative order.

Experimental Evaluation and Results

CineAGI is benchmarked using an industry-standard experimental pipeline: 100 story prompts spanning genres, controlled generation conditions (24 FPS, 5.375 s scenes, 512×512 resolution), and the VBench metric suite. Comparison is made versus CogVideoX, VideoCrafter2, and Hunyuan, with metrics including Overall Consistency (OC), Subject Consistency (SC), Aesthetic Quality (AQ), and Motion Smoothness (MS).

Key quantitative results:

OC: 0.259 (+40% over Hunyuan)
SC: 0.949 (+4.4%)
AQ: 0.600 (+5.4%)
MS: 0.987 (+1.1%)

Human evaluations corroborate these findings, with CineAGI scoring highest in Visual Quality, Narrative Coherence, Character Consistency (28.7% improvement), and Audio Coherence (only supported by CineAGI among evaluated systems).

Figure 4: Qualitative output comparisons with SOTA baselines; CineAGI achieves superior narrative coherence, character fidelity, visual/artistic consistency, and natural character movement.

Ablation studies validate the necessity of each architectural module. Removal of the Narrative Synthesis Module, DCI, or Quality Inspector causes statistically significant degradations in respective evaluation dimensions, directly linking observed performance gains to CineAGI’s hierarchical multi-agent construction.

Theoretical and Practical Implications

CineAGI’s two principal contributions—strict hierarchical multi-agent narrative planning and decoupled, identity-preserving character integration—articulate a new paradigm for AI-driven long-form video generation. Theoretically, CineAGI moves beyond monolithic or loosely modular T2V pipelines by enforcing multi-scale, cross-modal constraints through explicit agent roles and a cascading, validated information path. This structure supplements current SOTA video diffusion and LLM integration research (Polyak et al., 2024, Lin et al., 2023, Yang et al., 2024) by demonstrating that principled decomposition and agent specialization yield persistent gains in long-range temporal and subject coherence.

Practically, CineAGI supports robust extension to higher degrees of narrative complexity, multi-character interplay, and arbitrarily long content generation—a critical step for applications in entertainment, simulation, and narrative content automation. The introduced decomposed integration and validation mechanisms provide a foundation for scalable, plug-and-play upgrades and human-in-the-loop creative workflows.

Conclusion

CineAGI establishes a rigorous standard for automated movie creation, demonstrating that hierarchical multi-agent orchestration and decoupled, character-centric pipelines are essential for maintaining narrative, visual, and audio consistency in long-form, multi-modal synthesis. With strong empirical performance and architectural extensibility, CineAGI lays the groundwork for persistent, scalable AI-based narrative video systems.

Markdown Report Issue