- The paper introduces Camera Artist, a multi-agent framework that automates cinematic video generation by integrating narrative planning and filmic language injection.
- It employs Recursive Shot Generation and Cinematic Language Injection to maintain shot continuity and enrich visuals with expressive cinematic details.
- Experimental results demonstrate superior narrative coherence and visual quality compared to competing systems, validated by objective metrics and user studies.
Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation
Introduction
Camera Artist introduces a multi-agent collaboration framework designed to automate the end-to-end process of narrative video generation with explicit cinematic language. Unlike conventional text-to-video (T2V) and image-to-video (I2V) solutions, which largely optimize for local visual fidelity but lack global narrative and stylistic reasoning, Camera Artist models the hierarchical and agentic workflow of real-world filmmaking. The pipeline prioritizes not just narrative consistency but also stylistic cinematic expressiveness, explicitly incorporating mechanisms for shot-to-shot coherence and technical film language injection.
Framework Overview
The system operates via three specialized agents—Director, Cinematography Shot, and Video Generation—executing a two-stage process: footage construction and shot generation. The Director Agent decomposes story outlines into hierarchical narrative units, asset generation, and scene breakdown; the Cinematography Shot Agent recursively generates context-aware shot descriptions, injecting detailed cinematic language; the Video Generation Agent then produces visually coherent videos in a shot-wise fashion, merging them into a long-form output.
Figure 1: The Camera Artist agent pipeline, with Director Agent for hierarchical planning, Cinematography Shot Agent for recursive shot generation and cinematic language injection, and Video Generation Agent for visual realization.
This structured workflow enables Camera Artist to bridge the gap between fragmentary clip-level generation and the semantic-temporal requirements of cinematic storytelling.
Recursive Shots Generation and Cinematic Language Injection
The core of Camera Artist's contribution lies in two modules integrated within the Cinematography Shot Agent: Recursive Shot Generation (RSG) and Cinematic Language Injection (CLI).
Recursive Shot Generation
RSG recursively generates each shot conditioned on prior context (scene, previous shot, and script), explicitly modeling shot-to-shot dependencies and maintaining narrative continuity. Shot types are contextually chosen as scene-start, midpoint, or endpoint and are chained to ensure logical and visual flow. This simulation of the human process of sequential shot planning reconstructs the connective tissue typically missing in prior systems.
Cinematic Language Injection
CLI addresses the generation of film-style, cinematically expressive shot descriptions. Qwen3 LLMs are fine-tuned using LoRA on paired data (ordinary VLM-generated shot captions and professional-shot annotations) to transform standard shot descriptions into film-oriented technical language. CLI enriches the generated shots with explicit intent regarding shot size, camera movement, composition, and lighting.
Figure 2: (a) Recursive Generation ensures each shot leverages prior context; (b) Cinematic Language Injection transforms vanilla shot descriptions into domain-specific, expressive language.
Experimental Results
Quantitative and Qualitative Evaluation
Camera Artist is benchmarked against state-of-the-art multi-agent systems (VGoT, Anim-Director, MovieAgent) using MoviePrompts and custom storytelling datasets. The evaluation employs both objective metrics (VBench, CLIP-T) and automatic and human assessments of narrative and cinematic quality. Camera Artist achieves strong results across all axes, with peak scores in video quality, narrative and camera-movement consistency, and aesthetic and dynamic metrics.
Figure 3: For similar shot content, Camera Artist demonstrates superior cinematic richness compared to prior methods.
Figure 4: Inter-shot narrative coherence is maintained by conditioning each shot on explicit scene and preceding shot context.
Notably, Camera Artist surpasses strong baselines on CLIP-based semantic consistency and all VLM-based evaluation dimensions, including Script Consistency, Camera-Movement Consistency, Video Quality, and Real-Movie Similarity.
User Study
User studies (five-point Likert scale) confirm the human-perceived improvement in script adherence, cinematic motion consistency, and overall visual quality yielded by Camera Artist against competing systems.
Figure 5: User study results demonstrate consistently higher subjective ratings for Camera Artist across four cinematic and narrative metrics.
Ablation and Module Analysis
Ablative experiments demonstrate the distinct contributions of RSG and CLI. Removing RSG yields fragmented narrative and abrupt visual transitions; omitting CLI results in static and generic visuals devoid of purposeful cinematic expression. The full Camera Artist configuration is necessary for both cohesive storytelling and rich cinematic realization.
Figure 6: RSG sustains narrative flow across shots; CLI ensures dynamic cinematography and expressive visual style.
Practical and Theoretical Implications
Camera Artist’s explicit modeling of both narrative reasoning and cinematic style represents a significant evolution in the automation of long-form video generation. Practically, this framework enables automated, professional-grade cinematic content creation from textual input, with applications in creative industries, educational content, rapid prototyping, and personalized media production. Theoretically, it demonstrates that integrating recursive, agentic reasoning and domain adaptation (via LoRA) into LLM-driven pipelines is critical for high-level sequence tasks in multimodal generation.
Unique to this framework is its capacity for reference-free generation: it can synthesize character/scene assets from the script, operating purely from a textual outline without pre-existing visual references. This further broadens its applicability.
Figure 7: When only provided with textual stories and no reference images, Camera Artist autonomously produces coherent characters, structured scenes, and cinematic video sequences.
Future Directions
Advancements may be realized via scaling the fine-tuned LLMs with additional cinematic annotation corpora (e.g., extending ShotBench coverage, incorporating annotated real-world film datasets), tighter integration with diffusion-based video generation models, and incorporating more nuanced temporal editing and pacing controls. Addressing current limitations in cross-shot identity retention and further automatic evaluation of “cinematic intent” at the system level are promising directions for future research.
Conclusion
Camera Artist systematically integrates multi-agent collaboration, recursive context reasoning, and cinematic language adaptation into automated video generation. It consistently delivers superior results in narrative coherence, visual diversity, motion dynamics, and cinematic quality compared to previous multi-agent frameworks. The framework demonstrates the necessity of domain-specific reasoning over both narrative and stylistic axes, representing a substantive step toward scalable, fully-automated cinematic video generation systems.