KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models

Published 3 Apr 2026 in cs.CV | (2604.03414v1)

Abstract: Video LLMs (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces KiToke, a training-free framework that uses kernel density estimation to maximize token diversity and address spatiotemporal redundancy.
It employs adaptive temporal interval segmentation and diversity-weighted token merging to preserve semantic structure even under extreme compression ratios.
Empirical results across multiple benchmarks and Video LLM backbones show KiToke maintains high accuracy and efficiency with minimal overhead.

Kernel-based Interval-aware Token Compression for Efficient Video LLMs

Motivation and Problem Context

Video LLMs have demonstrated strong performance on a range of multimodal understanding tasks but face scalability bottlenecks due to the large number of visual tokens required, particularly for videos with long temporal duration and high spatial resolution. Unlike images, videos exhibit compounded redundancy both spatially and temporally—repeated backgrounds, slow motion, recurring objects—that results in excessive token proliferation. Prior approaches to token compression either involve training-time modifications, requiring costly retraining and specific architecture adaptation, or employ inference-time heuristics that often handle spatial or temporal redundancy separately and degrade rapidly under extreme compression budgets.

KiToke Framework: Global Diversity-Driven Compression

KiToke introduces a principled, training-free, query-agnostic token compression framework. The core objective is to maximize information content within a fixed token budget by explicitly addressing global spatiotemporal redundancy and preserving temporal structure. Unlike local or segment-level heuristics, KiToke formulates token selection and aggregation as a global diversity maximization problem using Kernel Density Estimation (KDE) on visual token embeddings across the entire video. This approach yields robust compression and maintains high performance even under extreme retention ratios, such as $\gamma=1\%$ .

Figure 1: Performance vs.\ retention curves comparing KiToke's robust performance decline under extreme token compression with prior state-of-the-art methods across multiple benchmarks and backbones.

Model Architecture and Compression Pipeline

KiToke operates in three main stages:

Kernel-based diversity estimation: KDE is applied to all visual tokens extracted from a video to produce a smooth, nonparametric density. Each token's diversity score is defined as the inverse of its density, penalizing redundant embeddings that cluster in high-density regions.
Temporal interval construction: Fine-grained intervals are dynamically constructed using token-level visual difference measures, including positionwise and best-match displacement between consecutive frames. Temporal boundaries are detected via abrupt deviations relative to local dynamic context, ensuring intervals coincide with semantic transitions.
Interval-aware token merging: Within each interval, unselected tokens are merged into retained representatives via diversity-weighted averaging. This restricts aggregation to semantically coherent spans, preventing information blurring across unrelated moments.
Figure 2: Schematic overview of KiToke’s architecture, detailing global diversity estimation, adaptive interval segmentation, and diversity-aware token merging.

Figure 3: KDE-based visualization of token diversity estimation and selection; distinctive tokens are preferentially retained through diversity-weighted stochastic sampling.

Adaptive Temporal Partitioning and Visual Dynamics Preservation

KiToke’s temporal segmentation combines both absolute magnitude and relative deviation measures, capturing subtle and abrupt changes in token-level visual statistics. This approach overcomes the limitations of global thresholding or cluster-based strategies, which either over-segment or miss semantically relevant transitions. The resulting content-adaptive intervals form the basis for token merging, maintaining temporal coherence and preventing aggregation across misaligned events.

Figure 4: Comparative illustration of temporal interval boundaries; KiToke aligns cuts with content-dependent transitions and local dynamics, outperforming static and cluster-based schemes.

Token Selection and Merging: Diversity-weighted Sampling

Instead of deterministic top-K selection, which risks omitting entire groups of meaningful but less unique tokens, KiToke employs stochastic sampling weighted by global diversity scores. This approach ensures representative coverage both within dense groups and rare regions of the embedding space, crucial under aggressive compression. Merging is performed within intervals, using cosine similarity matching and diversity-weighted averaging to aggregate information while minimizing redundancy.

Empirical Results and Ablations

KiToke was benchmarked across MVBench, LongVideoBench, MLVU, and VideoMME on three Video LLM backbones: LLaVA-OneVision, LLaVA-Video, and Qwen3-VL. The method consistently outperformed prior training-free baselines, preserving baseline-level accuracy at moderate retention and showing substantial robustness at $\gamma=1\%$ . Efficiency comparisons highlighted the minimal overhead and high throughput, with the lowest prefill latencies among competitive methods.

Figure 5: Ablation study of $\alpha$ (Gaussian bandwidth in KDE); performance is stable across a wide range, showing low hyperparameter sensitivity.

Figure 6: Model scaling ablation; KiToke consistently maintains superior relative performance across LLaVA-OneVision variants (0.5B, 7B, 72B).

Qualitative Analysis: Interval and Token Selection

Qualitative visualizations demonstrate KiToke’s superior content-aligned interval segmentation compared to baselines; interval boundaries closely track semantic and dynamic changes. Token selection visuals show adaptive allocation—more tokens retained in informative segments and fewer in redundant portions. In challenging cases (query-relevant evidence is visually minor), KiToke still outperforms others in preserving critical tokens and model confidence.

Figure 7: KiToke yields interval boundaries matching content-dependent temporal transitions, capturing abrupt deviations related to local dynamics.

Figure 8: Case study of token selection—KiToke retains query-relevant visual evidence, sharply increasing answer confidence at $\gamma=1\%$ .

Figure 9: Second qualitative case, further illustrating KiToke's coverage of rare evidence and reduced distraction from redundant content.

Comparative Analysis and Limitations

KiToke is empirically and theoretically distinct from prototype-based (VidCom2), cluster-local (PruneVID), and density-prioritized (FastVID) compression strategies. Its global diversity-driven selection and adaptive interval formation minimize redundancy and preserve semantic structure, especially vital under extreme token constraints. Limitations include the potential for missed query-specific evidence in visually minor regions—alleviable through integration with query-aware keyframe or attention-based techniques. Direct compatibility with specialized spatiotemporal formatting (e.g., newline tokens or timestamp encodings) in certain backbones remains an area for further adaptation.

Implications and Future Directions

KiToke enables efficient, scalable multimodal inference for Video LLMs, facilitating deployment in resource-constrained and long-context applications. Its approach is readily extensible: future work may focus on hybrid query-aware compression, domain adaptation for specialized video structures, and principled integration with evolving Video LLM architectures. The kernel-based global diversity paradigm provides a rigorous foundation for further spatiotemporal compression research.

Conclusion

KiToke establishes a rigorous, training-free framework for video token compression, grounded in global KDE-based diversity estimation and adaptive interval-aware aggregation. By minimizing spatiotemporal redundancy and preserving critical information—even under extreme retention ratios—KiToke sets a new standard for plug-and-play inference efficiency in Video LLMs, with demonstrated generalization and scalability across benchmarks and architectures (2604.03414).

Markdown Report Issue