GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

Published 27 Feb 2026 in cs.DB, cs.DS, and cs.IR | (2602.23999v1)

Abstract: Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a GPU-native, codebook-free quantization method integrated with IVF clustering to accelerate ANN search while reducing storage needs.
It employs a two-stage pipeline with coarse and refinement phases, achieving up to 2.2× higher QPS and 7.7× faster index build times compared to state-of-the-art methods.
Innovative GPU kernel fusion and optimized data layouts minimize global memory traffic, enabling scalable, high-recall retrieval in high-dimensional vector search.

IVF-RaBitQ (GPU): GPU-Native Cluster-Based ANN Search with RaBitQ Quantization

Introduction and Problem Context

Approximate Nearest Neighbor Search (ANNS) is foundational for high-dimensional vector search in domains such as retrieval-augmented generation, recommender systems, and semantic retrieval. Traditional approaches bifurcate into graph-based and cluster-based paradigms. On GPUs, graph-based methods (e.g., HNSW, CAGRA) deliver high recall but at the expense of storage and build efficiency. Cluster-based approaches, notably IVF-PQ, offer efficient index build and adaptability, but often necessitate reranking and incur bandwidth bottlenecks at high recall, especially when using product quantization.

IVF-RaBitQ (GPU) (2602.23999) presents a GPU-native cluster-based ANN indexing and search solution tightly integrating IVF partitioning with RaBitQ vector quantization. The system achieves improvements in throughput, build time, storage footprint, and recall, leveraging architectural and algorithmic innovations tailored for GPU compute and memory hierarchies.

System Design and Algorithmic Innovations

IVF-RaBitQ is architected as a two-stage pipeline: a highly parallelized index build phase and a batched, pipelined search phase.

Index Build: GPU-Optimized RaBitQ Quantization

Following standard Balanced K-means, database vectors are assigned to IVF clusters, normalized, and rotated via a shared orthogonal matrix. Key to RaBitQ is the absence of codebook training. Instead, quantization involves searching over rescaling factors and identifying the nearest code in an analytic codebook defined by per-dimension bit budgets. Recognizing the inefficiency of CPU-optimized, warp-divergent quantization routines, the authors implement a two-phase parallel grid search:

Coarse phase: parallel evaluation of grid-sampled rescaling factors across dimensions.
Refinement phase: narrowed search around the optimal candidate from the coarse phase.

This block-per-vector mapping strategy maximizes GPU parallelism and memory coalescing, enabling quantization of over one million 960-dimensional vectors/s at 8-bit precision.

Search Pipeline: Fused Cluster-Local Kernels

Given a query batch, each vector is rotated, and the nearest clusters are selected by fast centroid distance computation. Search is then scheduled as a set of (query, cluster) pairs, enabling fine-grained parallelism. Each search proceeds via a two-stage estimator:

Stage 1 (Filter): compute lower bounds using 1-bit RaBitQ codes, implemented using two alternative GPU-native inner-product methods.
Stage 2 (Refinement): apply ex-codes only to survivors from the filter stage for full-precision distance estimation.

This logical flow is fused into a single CUDA kernel, minimizing global memory traffic and kernel launch overhead. Careful shared memory reuse is orchestrated across pipeline stages, minimizing SM residency bottlenecks.

Figure 1: Bitwise inner-product computation between 1-bit RaBitQ codes and $B$ -bit quantized queries via bitwise and POPCNT operations.

GPU-Native Inner Product Computation

The innermost kernel operation is the asymmetric inner product between a floating-point (or quantized) query and compressed 1-bit codes. IVF-RaBitQ implements two alternatives:

Lookup Table (LUT) Strategy: Inner products are recast as lookups from $2^U$ -entry tables, trading arithmetic for shared-memory bandwidth. Optimal for moderate dimensions.
Bitwise Decomposition: Queries are quantized. Inner products reduce to combinations of AND and POPCNT instructions across register-packed data, allowing efficient SIMD execution and high parallelism on wide registers, essential for high dimensions.

Index Layout and Data Organization

To maximize global memory bandwidth, IVF-RaBitQ employs a CSR-inspired, cluster-major format. Data is decomposed into contiguous arrays:

1-bit codes (interleaved to align with coalesced 32-thread warps),
ex-codes (for refinement),
auxiliary data (precomputed scaling factors),
vector IDs.
Figure 2: Data layout for IVF-RaBitQ: codes and auxiliary metadata are packed per cluster for high cache coherence, supporting coalesced access by warp-aligned threads.

This layout is tightly coupled to the fused kernel's access patterns, ensuring high L1/L2 cache hit rates and maximizing parallel memory throughput.

Empirical Evaluation

Experiments span diverse datasets (ImageNet, CodeSearchNet, GIST, OpenAI/text, Wiki-all) and contrast IVF-RaBitQ against IVF-Flat, IVF-PQ, and CAGRA.

Time–Accuracy Trade-off

IVF-RaBitQ consistently defines the Pareto frontier, attaining up to 2.2× the QPS of CAGRA and up to 31.4× IVF-PQ (without refinement) throughput at Recall ≈ 0.95, without reranking or storing RAW vectors.

Figure 3: IVF-RaBitQ surpasses cluster- and graph-based competitors in QPS at high recall across datasets and quantization strategies.

Bitwise and LUT modes achieve comparable speed; bitwise scales more effectively with dimensionality (see OpenAI-3072 results for pronounced divergence).

Index Build Time

Thanks to codebook-free RaBitQ and an $O(1)$ -round parallel quantization routine, build time for IVF-RaBitQ is on average 7.7× faster than CAGRA and 4.4× faster than IVF-PQ.

Figure 4: Indexing times show significant build acceleration for IVF-RaBitQ compared to IVF-PQ and CAGRA.

Storage Footprint

By obviating the need for storing and reranking raw vectors, IVF-RaBitQ indexes consume <25% of the storage of CAGRA or IVF-PQ with refinement at equivalent recall. For 10M-vectors (768d Wiki-all), IVF-RaBitQ requires only 7.5GB compared to >30GB for graph-based or reranking methods.

Figure 5: Storage requirement comparisons among indexing techniques. IVF-RaBitQ achieves strong efficiency due to fully quantized search.

Space-Performance Tuning

IVF-RaBitQ exposes a flexible knob (bit-depth per dimension) for trading off recall against storage and throughput. 5 bits/dimension suffices for Recall>0.95 in standard setups; 1-bit mode suffices for sub-0.6 recall with maximal QPS.

Figure 6: IVF-RaBitQ recall and throughput as a function of quantization bits, enabling adaptive deployment under diverse memory/accuracy constraints.

GPU vs. CPU

Direct comparison against RaBitQLib (a high-efficiency CPU system) reveals 12.5–13.0× QPS speedups on comparable hardware at matching accuracy.

Figure 7: IVF-RaBitQ (GPU) vs. RaBitQLib (CPU): order-of-magnitude QPS gains from optimized GPU-native implementation.

Implications and Future Directions

IVF-RaBitQ (GPU) demonstrates that codebook-free quantization can be effectively adapted to highly parallel architectures, rectifying the historical limitations of cluster-based GPU ANN search at high recall and constrained memory. The combination of analytic quantization, index/data layout co-design, and kernel fusion provides a compelling alternative to graph-based methods, especially when rapid index construction and small memory footprints are priorities. Eliminating RAW vector storage and search reranking pipeline dramatically simplifies deployment in production ANN serving.

Theoretically, the work underscores the continued relevance of analytic quantization with strong error guarantees for high-dimensional search. Practically, it paves the way for RaBitQ-based quantization or similar methods to become a mainstay in vector database backends, particularly as embedding dimensionality and dataset sizes continue to scale in modern AI workflows.

Future extensions may explore hybrid index structures combining IVF-RaBitQ with compact proximity graphs for adversarial queries, advanced fusion of code-based and learned quantization, or tight integration with attribute-filtered search and multimodal deployments.

Conclusion

IVF-RaBitQ (GPU) leverages GPU-native algorithm and data layout co-design to enable cluster-based ANN search that is competitive in recall and throughput with state-of-the-art graph-based methods, while being superior in index build time and storage efficiency. The core contribution is showing that analytic, codebook-free quantization (RaBitQ) can be implemented efficiently and scalably for GPU architectures, delivering practical performance advantages across regimes relevant to industry-scale vector search and retrieval systems.

Markdown Report Issue