- The paper presents a novel KV-cache management framework that leverages semantic token clustering and hierarchical indexing to reduce memory usage.
- It employs M-DCI and head-specific ANN retrieval to optimize token selection while preserving near-full accuracy on long sequences.
- Empirical results demonstrate significant latency reduction and scalability over baselines, enabling efficient long-context LLM inference.
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
Motivation and Problem Statement
Autoregressive generation with LLMs relies on the Key-Value (KV) cache to store intermediate attention states for efficient token-by-token inference. However, the cache’s linear growth with sequence length results in severe memory pressure, especially during tasks needing extended context, such as summarization or chain-of-thought (CoT) reasoning. Existing solutions either evict tokens or offload the cache between CPU/GPU, but they suffer from imprecise relevant token identification, inefficiency in dynamic updates, and degraded accuracy in long-context scenarios.
IceCache Framework and Design
IceCache introduces a novel strategy integrating semantic token clustering with PagedAttention. Rather than organizing memory pages based on token order, IceCache clusters semantically similar tokens in a hierarchical structure (the DCI-tree) mapped directly to memory pages. This structure is dynamically updatable, facilitating efficient maintenance during decoding.
Key technical components include:
- Semantic clustering: Key embeddings are hierarchically organized via Multi-level Dynamic Continuous Indexing (M-DCI), maximizing the co-location of tokens relevant to a given query.
- Head-specific retrieval: For each attention head, IceCache performs approximate nearest neighbor (ANN) search on the DCI-tree to select top-k relevant pages, enabling fine-grained retrieval.
- Efficient paging and offloading: Bulk GPU-CPU transfers are optimized via preloading buffers, reducing data movement latency. IceCache fully overlaps CPU indexing with GPU computation for latency hiding.
- Incremental updates: New tokens are incrementally inserted into the DCI-tree; page splits maintain balance and ensure efficient access.
This approach substantially increases retrieval hit rates, reduces fragmentation, supports dynamic adaptation as new tokens are generated, and avoids overhead from retrieving irrelevant tokens.
Empirical Results
Accuracy
IceCache was evaluated using LongBench, Passkey Retrieval, GSM8K CoT, and RULER benchmarks across several LLMs (Llama3.1-8B-Instruct, Mistral-7B-Instruct-v0.2, Qwen3-32B, LongChat-7B-v1.5).
- LongBench: With a 256-token budget, IceCache maintains 99% of full KV-cache accuracy. Remarkably, with only 64 tokens, IceCache surpassed PQCache, which needed a 4x larger cache to attain similar performance.
- GSM8K CoT: At a 10% token budget, IceCache yields 47.4% accuracy (vs. 48.2% for full KV), outperforming PQCache (46.2%) and Ark Vale (30.9%).
- RULER: For contexts up to 250k tokens, IceCache retains ≥99% accuracy on needle retrieval tasks, maintaining scalability with minimal latency increase.
Latency
- Time to Second Token (TT2T): On a 36k-token sequence, IceCache achieves competitive latency (7.7s). With page reuse (IceCache(reuse)), latency drops to 5.9s, matching OmniKV (5.8s) and outperforming Ark Vale and PQCache.
- Decoding Latency: Eviction-based baselines are fastest but lose accuracy. IceCache, with and without reuse, demonstrates superior accuracy-latency balance in retrieval-based methods, with per-token latency at 0.06s (IceCache(reuse))—substantially faster than PQCache.
- Latency Scaling: Under extremely long contexts (up to 300k tokens), IceCache's per-token latency grows sublinearly, sharply contrasting with the exponential scaling observed in full-attention baselines.
Robustness
Evaluation across attention architectures (GQA, standard multi-head) and model scale demonstrates that IceCache consistently delivers strong performance and scalability, making it generalizable.
Theoretical and Practical Implications
IceCache’s semantic clustering strategy represents a fundamental shift in KV-cache management by leveraging dynamic, query-aware token grouping. This minimizes bandwidth requirements and memory overhead in high-throughput inference regimes, laying groundwork for efficient long-context processing.
Practically, IceCache enables LLM deployment on resource-constrained hardware, facilitating new applications in document-level QA, extended summarization, and real-time chain-of-thought generation. The hierarchical, dynamic index aligns with sparse attention and hardware-friendly memory management, paving the way for further optimizations such as page reuse and hybrid quantization.
Future Directions
Potential extensions include:
- Integrating IceCache with context compression and dynamic context selection across layers.
- Further acceleration via hardware-aligned sparse kernels and asynchronous memory management.
- Application to models with even longer context windows and retrieval-augmented generation.
- Exploration of adaptive clustering strategies, leveraging in-situ attention statistics, as well as expansion to multi-modal contexts.
Conclusion
IceCache provides a scalable, memory-efficient framework for KV-cache management in long-sequence LLM inference. By combining semantic clustering, hierarchical indexing, and efficient offloading, it achieves high accuracy and low latency with minimal memory footprint. Experimental validation across diverse tasks and model architectures highlights its superiority over established baselines, establishing IceCache as a robust solution for practical, long-context LLM deployment (2604.10539).