Papers
Topics
Authors
Recent
Search
2000 character limit reached

StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

Published 8 Apr 2026 in cs.CL | (2604.06746v1)

Abstract: As LLMs scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

Authors (3)

Summary

  • The paper introduces StructKV, which preserves the structural skeleton of token sequences using global in-degree centrality.
  • The paper employs dynamic pivot detection to compress the KV cache effectively, balancing prefill speed, decoding latency, and accuracy.
  • The paper demonstrates near-lossless long-context inference on benchmarks like LongBench and RULER while significantly reducing computation.

Structure-Aware KV Cache Compression for Long-Context LLM Inference: The StructKV Framework

Introduction: Motivation and Context

Scaling context windows for LLMs to encompass millions of tokens introduces severe memory and compute bottlenecks, predominantly driven by the linear growth of the Key-Value (KV) cache. Standard approaches for long-context inference optimize either the prefill (computationally bound due to quadratic attention complexity) or decoding (memory bound owing to the KV cache) phase, often sacrificing one for improvements in the other. Most recent compression methods operate on local attention saliency at a single layer, systematically discarding tokens that may function as information hubs across the network, thus undermining long-range dependency retention and retrieval robustness.

The StructKV framework defines a fundamentally different paradigm by preserving the "structural skeleton" of the context. It leverages global in-degree centrality, dynamic pivot detection, and decoupled structural propagation to robustly compress the KV cache while maintaining near-lossless performance under extreme memory constraints. StructKV's design addresses the trilemma of prefill speed, decoding latency, and long-context accuracy without their conventional trade-offs. Figure 1

Figure 1: Comparison of StructKV with state-of-the-art methods. StructKV exhibits superior robustness under strict KV cache budgets and resolves the trilemma of speed, latency, and accuracy.

Methodological Innovations

Global In-Degree Centrality

StructKV replaces myopic, layer-local saliency snapshots with an aggregated cross-layer in-degree centrality metric. The centrality score for each token is recursively computed across early network layers, with an exponential decay factor λ\lambda enabling semantic prioritization of deeper layers. This aggregation robustly identifies tokens that serve as persistent information hubs, ensuring structural anchors are retained irrespective of transient local saliency fluctuations. The formalization enables the structural skeleton to persist through aggressive pruning, preserving essential long-range dependencies for logical and retrieval tasks.

Dynamic Pivot Detection

Selecting a fixed pruning layer is brittle given architectural scaling. StructKV employs online pivot detection via information-theoretic metrics (entropy, sparsity, variance) to dynamically locate the phase transition where attention stabilizes. A composite gradient-based transition score Tl\mathcal{T}_l identifies the structural phase boundary, allowing compression to occur precisely when the context's semantic organization is fully formed. This dynamic alignment generalizes across model depths and facilitates reliable adaptation in evolving LLM architectures. Figure 2

Figure 2: The three-phase workflow of StructKV, illustrating global centrality accumulation, pivot detection, and structural propagation with decoupling.

Structural Propagation and Decoupling

StructKV decouples the structural propagation (computation) budget from the KV cache (memory) budget. At the dynamically detected pivot layer, only tokens comprising the structural skeleton (high centrality plus local window) are retained and propagated, significantly reducing computation for subsequent layers. The KV cache used for decoding is separately selected via local saliency, allowing ultra-low memory footprints while preserving rich structural context during prefill. This design enables high-fidelity reasoning and retrieval even under strict memory constraints.

Empirical Evaluation and Results

Robustness Across Compression Regimes

StructKV demonstrates superior robustness on LongBench and RULER benchmarks. Under a 10% KV cache budget, StructKV achieves near-lossless accuracy, outperforming FastKV (+1.02) and GemFilter (+8.21) with a LongBench average of 48.61, closely tracking full-context performance (49.33) while reducing prefill computation by 40%. Summarization and code tasks particularly benefit from global centrality: MultiNews and GovReport scores surpass FastKV, affirming StructKV's ability to preserve distributed structural anchors required for complex summarization and code reasoning. Figure 3

Figure 3: Layer-wise hidden state fidelity demonstrating stable recovery mass with StructKV, substantially outperforming local snapshot methods at depth.

Adaptability to Model Depth

Dynamic pivot detection ensures StructKV generalizes across architectures. Pivot layers shift from 12 (Qwen-2.5-7B) to 28 (Qwen-2.5-32B), illustrating the inapplicability of static pruning methods on deep models. The transition score automatically aligns structural pruning with stabilization of attention patterns, preventing premature structural information loss. Figure 4

Figure 4: Adaptability of pivot selection across Qwen-2.5 Series, Llama-3.1-8B-Instruct, and Ministral-8B-Instruct.

Figure 5

Figure 5: Visualization of dynamic pivot detection on Qwen-2.5-32B-Instruct, showing entropy and sparsity evolution with layer index.

Extreme Context Handling

StructKV sustains high-fidelity retrieval at 128K+ tokens (RULER), outperforming FastKV and other baselines. When local saliency snapshot methods fail under extreme length pressure (FastKV drops to 68.2, StructKV retains 73.6 at 128K), StructKV's historical centrality accumulation acts as a safety net, preserving dormant but topologically critical tokens. Needle-in-a-Haystack tests yield perfect retrieval scores (1.000) for StructKV, matching full-context and FastKV but outperforming GemFilter and SnapKV. Figure 6

Figure 6: Needle-in-a-Haystack results on LLaMA-3.1-8B-Instruct with 10% KV retention rate, confirming stable retrieval across contexts and depths.

Decoupling Analysis and Efficiency

Coupling computation and cache budgets leads to catastrophic accuracy collapse under aggressive compression. StructKV's decoupling enables robust accuracy recovery (+13.8 points when increasing structural rate to 20% under 10% KV), and at 30%+ structural propagation, sustains high-fidelity regardless of tight cache budgets. Figure 7

Figure 7: Decoupling analysis showing robust accuracy under independent computation and memory budgets.

End-to-end latency evaluations demonstrate StructKV provides 1.87×\times prefill speedup with negligible additional overhead (<2.5% latency from centrality and pivot detection components). The efficient vectorized implementation and block-based memory management ensure practical deployment at scale. Figure 8

Figure 8: Prefill latency breakdown (LLaMA-3.1-8B, 32K context) showing substantial attention and FFN computational reductions post-pruning.

Practical and Theoretical Implications

StructKV's global structural awareness addresses the brittleness of local saliency for long-context inference, enabling memory-efficient deployment without quality trade-offs. The dynamic pivot detection mechanism supports rapid adaptation in large/deep LLMs, critical for scalability in real-world production systems. Decoupling propagation and cache constraints is essential for architecting inference systems robust to heterogeneous resource budgets and dynamic workloads.

On a theoretical front, the centrality-based paradigm aligns with graph-theoretic principles of information flow in Transformer networks and invites further exploration into structural importance metrics over attention topologies. The framework's limitations highlight opportunities for scaling to million-token contexts, extending to MoE and non-attention architectures, and optimizing aggregation for hardware with restricted memory bandwidth.

Conclusion

StructKV advances the state of scalable long-context inference by structurally preserving global information hubs, dynamically aligning compression with semantic stabilization, and decoupling computational and memory requirements. Experimental results rigorously validate its superiority over local snapshot, static pruning, and decoding-only baselines in accuracy, robust retrieval, and efficiency. StructKV is architecturally agnostic and poised for practical deployment in memory-constrained LLM applications, with potential extensions to future model architectures and scaling regimes.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.