Papers
Topics
Authors
Recent
Search
2000 character limit reached

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Published 12 Apr 2026 in cs.LG and cs.AI | (2604.10539v1)

Abstract: Key-Value (KV) cache plays a crucial role in accelerating inference in LLMs by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.

Summary

  • The paper presents a novel KV-cache management framework that leverages semantic token clustering and hierarchical indexing to reduce memory usage.
  • It employs M-DCI and head-specific ANN retrieval to optimize token selection while preserving near-full accuracy on long sequences.
  • Empirical results demonstrate significant latency reduction and scalability over baselines, enabling efficient long-context LLM inference.

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Motivation and Problem Statement

Autoregressive generation with LLMs relies on the Key-Value (KV) cache to store intermediate attention states for efficient token-by-token inference. However, the cache’s linear growth with sequence length results in severe memory pressure, especially during tasks needing extended context, such as summarization or chain-of-thought (CoT) reasoning. Existing solutions either evict tokens or offload the cache between CPU/GPU, but they suffer from imprecise relevant token identification, inefficiency in dynamic updates, and degraded accuracy in long-context scenarios.

IceCache Framework and Design

IceCache introduces a novel strategy integrating semantic token clustering with PagedAttention. Rather than organizing memory pages based on token order, IceCache clusters semantically similar tokens in a hierarchical structure (the DCI-tree) mapped directly to memory pages. This structure is dynamically updatable, facilitating efficient maintenance during decoding.

Key technical components include:

  • Semantic clustering: Key embeddings are hierarchically organized via Multi-level Dynamic Continuous Indexing (M-DCI), maximizing the co-location of tokens relevant to a given query.
  • Head-specific retrieval: For each attention head, IceCache performs approximate nearest neighbor (ANN) search on the DCI-tree to select top-k relevant pages, enabling fine-grained retrieval.
  • Efficient paging and offloading: Bulk GPU-CPU transfers are optimized via preloading buffers, reducing data movement latency. IceCache fully overlaps CPU indexing with GPU computation for latency hiding.
  • Incremental updates: New tokens are incrementally inserted into the DCI-tree; page splits maintain balance and ensure efficient access.

This approach substantially increases retrieval hit rates, reduces fragmentation, supports dynamic adaptation as new tokens are generated, and avoids overhead from retrieving irrelevant tokens.

Empirical Results

Accuracy

IceCache was evaluated using LongBench, Passkey Retrieval, GSM8K CoT, and RULER benchmarks across several LLMs (Llama3.1-8B-Instruct, Mistral-7B-Instruct-v0.2, Qwen3-32B, LongChat-7B-v1.5).

  • LongBench: With a 256-token budget, IceCache maintains 99% of full KV-cache accuracy. Remarkably, with only 64 tokens, IceCache surpassed PQCache, which needed a 4x larger cache to attain similar performance.
  • GSM8K CoT: At a 10% token budget, IceCache yields 47.4% accuracy (vs. 48.2% for full KV), outperforming PQCache (46.2%) and Ark Vale (30.9%).
  • RULER: For contexts up to 250k tokens, IceCache retains ≥99% accuracy on needle retrieval tasks, maintaining scalability with minimal latency increase.

Latency

  • Time to Second Token (TT2T): On a 36k-token sequence, IceCache achieves competitive latency (7.7s). With page reuse (IceCache(reuse)), latency drops to 5.9s, matching OmniKV (5.8s) and outperforming Ark Vale and PQCache.
  • Decoding Latency: Eviction-based baselines are fastest but lose accuracy. IceCache, with and without reuse, demonstrates superior accuracy-latency balance in retrieval-based methods, with per-token latency at 0.06s (IceCache(reuse))—substantially faster than PQCache.
  • Latency Scaling: Under extremely long contexts (up to 300k tokens), IceCache's per-token latency grows sublinearly, sharply contrasting with the exponential scaling observed in full-attention baselines.

Robustness

Evaluation across attention architectures (GQA, standard multi-head) and model scale demonstrates that IceCache consistently delivers strong performance and scalability, making it generalizable.

Theoretical and Practical Implications

IceCache’s semantic clustering strategy represents a fundamental shift in KV-cache management by leveraging dynamic, query-aware token grouping. This minimizes bandwidth requirements and memory overhead in high-throughput inference regimes, laying groundwork for efficient long-context processing.

Practically, IceCache enables LLM deployment on resource-constrained hardware, facilitating new applications in document-level QA, extended summarization, and real-time chain-of-thought generation. The hierarchical, dynamic index aligns with sparse attention and hardware-friendly memory management, paving the way for further optimizations such as page reuse and hybrid quantization.

Future Directions

Potential extensions include:

  • Integrating IceCache with context compression and dynamic context selection across layers.
  • Further acceleration via hardware-aligned sparse kernels and asynchronous memory management.
  • Application to models with even longer context windows and retrieval-augmented generation.
  • Exploration of adaptive clustering strategies, leveraging in-situ attention statistics, as well as expansion to multi-modal contexts.

Conclusion

IceCache provides a scalable, memory-efficient framework for KV-cache management in long-sequence LLM inference. By combining semantic clustering, hierarchical indexing, and efficient offloading, it achieves high accuracy and low latency with minimal memory footprint. Experimental validation across diverse tasks and model architectures highlights its superiority over established baselines, establishing IceCache as a robust solution for practical, long-context LLM deployment (2604.10539).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.