- The paper introduces a novel neural codec using masked diffusion and I/P-frame temporal conditioning to ensure exact pixel-level reconstruction.
- Its architecture employs bijective tokenization and lightweight reference embeddings to exploit spatial and temporal redundancies, achieving a 29.71% compression rate.
- Experimental results show significant bitrate reductions over H.264 and H.265 lossless modes, though encoding speed remains a challenge for real-time applications.
NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
Introduction and Motivation
Lossless video compression is critical in domains requiring pixel-exact fidelity, such as medical imaging, professional broadcast, and digital film archiving. Traditional codecs, notably H.264 and H.265 in their lossless modes, utilize fixed block-based prediction and hand-designed entropy coding to exploit spatial and temporal redundancies. Contemporary neural video coding, however, has predominantly focused on lossy compression, leveraging deep spatiotemporal context modeling to optimize rate-distortion trade-offs—an approach fundamentally incompatible with strict lossless requirements.
While neural techniques have advanced lossless image compression—culminating in strong methods such as LC-FDNet, ArIB-BPS, and, more recently, hierarchical parallel adaptive coding (HPAC)—the extension to lossless video remains largely unaddressed in the literature. Notably, existing neural image compressors operate frame-independently and thus fail to exploit temporal redundancy, while established generative video compressors explicitly trade off perfect fidelity for perceptual realism and bitrate gains via generative modeling.
The NeuralLVC approach (2604.03353) addresses this gap, proposing a codec architecture that combines masked diffusion entropy modeling with explicit I/P-frame temporal conditioning. The design ensures exact pixel recovery in YUV420 or RGB domains and leverages lightweight reference embeddings to model short-term temporal structure efficiently, setting a new benchmark in neural lossless video coding.
NeuralLVC Architecture
NeuralLVC employs a two-branch architecture reflecting the I/P-frame paradigm. The I-frame branch handles the first frame in a group of pictures (GOP) independently, while subsequent frames are coded as temporally-conditioned P-frames, referencing the previously decoded frame. This design directly exploits the temporal redundancy present in video streams, as illustrated below.


Figure 2: Temporal redundancy and compression cost for the coastguard sequence (Y channel); static regions compress much more efficiently due to high temporal redundancy.
The I-frame model uses a bijective linear tokenization that guarantees pixel-level losslessness. Each 8-bit pixel is deterministically mapped to a unique even-numbered token. The P-frame model encodes the per-pixel difference relative to the previous frame, centered to guarantee invertibility. In both cases, tokens are packed into non-overlapping 32×32 patches for parallel processing.
The entropy model is a masked diffusion Transformer (derived from LLaDA), employing bidirectional attention within each patch to exploit non-causal spatial context. Key architectural optimizations include: (a) group-wise parallelism (from HPAC) for efficient decoding, and (b) a reference embedding table for P-frames, which conditions token representations on co-located pixel values from the reference frame with minimal parameter overhead (≈1.3% increase).
Temporal Conditioning and Group-wise Parallelism
Temporal conditioning is realized by passing, at every pixel position, an embedding of the corresponding token from the previous frame (reference embedding) into the masked diffusion backbone. This enables the entropy model to receive local temporal context, significantly reducing uncertainty in P-frame residuals and hence the average code length.
Group-wise parallelism mitigates the computational inefficiency of sequential diffusion-based autoregressive decoding. Patches are divided into groups (based on a user-tunable δ parameter), and all masked tokens in one group are predicted in parallel per diffusion step. This yields 10× acceleration over naive per-token inference, a necessary prerequisite for practical deployment at video-scale resolutions.
Experimental Results and Analysis
NeuralLVC is evaluated on the Xiph.org CIF video dataset (9 sequences, YUV420), with I/P-architecture and full end-to-end arithmetic coding for exact reconstruction verification. The codec demonstrates a 29.71% compression rate on average—18.3% lower than H.265 lossless and 19.2% lower than H.264 lossless, both widely deployed traditional solutions. These gains are consistent across low-motion and high-motion sequences.
The ablation experiments strongly isolate the source of the performance improvements: without temporal conditioning (i.e., I-frame model or mere frame differencing without reference embeddings), compression rates remain high (e.g., 49.56% and 45.91%). Adding reference embedding drops the rate dramatically to 29.71%, indicating that temporal conditioning via lightweight reference embeddings is the enabling factor for bridging the performance gap with hand-designed codecs.
Analysis of per-frame rates shows that P-frame costs are nearly constant and that the initial I-frame rate is amortized to below 1% of total video bitrate for long sequences. This stability contrasts with large fluctuations seen in H.265, which arise from its complex B/P/I GOP structure.
Speed remains a limitation: NeuralLVC encodes at approximately 0.06 FPS on CIF resolution (NVIDIA GH200), compared to 2.2 FPS for H.265 ("veryslow" preset) and 0.13 FPS for VVC (QP=0). Still, this speed, enabled through group-wise parallel patch processing, is practical for offline archival workloads, which constitute the primary application domain for lossless video codecs.
Implications and Theoretical Advancements
NeuralLVC demonstrates that masked diffusion models, when augmented with local temporal references, can achieve and surpass traditional codec performance in the lossless video compression domain. The strict bijectivity of the tokenization scheme ensures provable reconstruction guarantees, overcoming prior neural proposals (e.g., cluster-based tokenization in LMCompress) that could only offer token-level or approximate losslessness.
The work substantiates the claim that temporal redundancy in video is not trivially extractable using naive differencing or per-frame image coding, as the learned entropy model's capacity to adaptively exploit both spatial and temporal structure is crucial. This is reflected in the strong numerical results compared to both classic and state-of-the-art neural image codecs.
Future advances could focus on addressing speed limitations, incorporating cross-patch and cross-channel modeling for further efficiency, and extending the architecture to handle arbitrary GOP structures, adaptive scene-change detection, or more sophisticated temporal models (potentially leveraging long-context memory as in recent DCVC-LCG).
Furthermore, the framework provides a path for integrating pre-trained multimodal models, LLM-based entropy coders, or distillation techniques to trade-off between rate, speed, and computational resources for deployment in different production pipelines.
Conclusion
NeuralLVC establishes a new paradigm for neural lossless video compression, rigorously combining masked diffusion entropy modeling with explicit I/P-frame temporal conditioning and bijective linear tokenization for exact reconstruction. The results demonstrate compression rates substantially better than H.264 and H.265 lossless on Xiph CIF sequences and validate the hypothesis that temporal conditioning is critical for neural lossless video compression performance. Practical adoption will likely depend on further engineering optimizations to improve computational throughput and integration into diverse archival and industrial workflows, but this work provides strong theoretical and empirical evidence for advancing neural coding in the strict lossless setting.