LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

Published 6 Apr 2026 in cs.AR | (2604.04523v1)

Abstract: Lookup tables (LUTs) have recently gained attention as an alternative compute mechanism that maps input operands to precomputed results, eliminating the need for arithmetic logic. LUTs not only reduce logic complexity, but also naturally support diverse numerical precisions without requiring separate circuits for each bitwidth-an increasingly important feature in quantized DNNs. This creates a favorable tradeoff in PIM: memory capacity can be used in place of logic to increase computational throughput, aligning well with DRAM-PIM architectures that offer high bandwidth and easily available memory but limited logic density. In this work, we explore this capacity-computation tradeoff in LUT-based PIM designs, where memory capacity is traded for performance by packing multiple MAC operations into a single LUT lookup. Building on this insight, we propose LOCALUT, a PIM-based design for efficient low-bit quantized DNN inference using operation-packed LUTs. First, we observe that these LUTs contain extensive redundancy and introduce LUT canonicalization, which eliminates duplicate entries to reduce LUT size. Second, we propose reordering LUT, a lightweight auxiliary LUT that remaps weight vectors to their canonical form required by LUT canonicalization with a simple LUT lookup. Third, we propose LUT slice streaming, a novel execution strategy that exploits the DRAM-buffer hierarchy by streaming only relevant LUT columns into the buffer and reusing them across multiple weight vectors. Evaluated on a real system based on UPMEM devices, we demonstrate a geometric mean speedup of 1.82x across various numeric precisions and DNN models. We believe LOCALUT opens a path toward scalable, low-logic PIM designs tailored for LUT-based DNN inference. Our implementation of LOCALUT is available at https://github.com/AIS-SNU/LoCaLUT.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces LOCALUT, which employs LUT canonicalization, reordering, and slice streaming to enhance low-bit DNN inference in DRAM-PIM.
It reduces LUT size by up to 611× and achieves speedups ranging from 1.82× to 4.73× over prior methods by optimizing capacity-computation tradeoffs.
It demonstrates significant energy efficiency improvements and scalable performance across diverse workloads, including BERT, ViT, and OPT.

LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

Introduction and Motivation

The paper addresses the intersection of DRAM-based Processing-in-Memory (PIM) architectures and low-bit quantized Deep Neural Network (DNN) inference, focusing on the disproportion between available DRAM capacity and the area- and energy-limited logic density for on-chip computation. Recent advances in PIM have mitigated the memory wall for data-intensive primitives but remain suboptimal for compute-intensive DNN kernels such as GEMM due to the prohibitive cost of embedding sufficient arithmetic units in DRAM technology nodes. Lookup Table (LUT)-based inference emerges as an alternative, leveraging the abundance of DRAM capacity to replace complex arithmetic with table lookups, particularly advantageous for quantized models with a limited operation space.

Technical Contributions

LOCALUT introduces a memory-centric LUT-based inference architecture for DRAM-PIM, focusing on maximizing arithmetic throughput via operation-packing in LUT entries. Three synergistic techniques are presented:

LUT Canonicalization: By analyzing the inherent permutation redundancy in operation-packed LUT indices, the authors design a canonicalization procedure that massively reduces LUT size. For $p$ -way packed LUTs, only canonical input configurations are stored, leveraging the fact that MAC operations are invariant under joint index permutations. This reduces table size growth from exponential to polynomial with packing degree at fixed bitwidths.
Reordering LUT: Canonicalization introduces runtime permutation overhead for weight alignment. LOCALUT introduces an auxiliary reordering LUT that, given a sorted activation permutation and weight vector, outputs the canonicalized weight vector in a single additional lookup, offloading the bitwise permutation from the limited in-memory processing core and trading slight capacity expansion for significant compute reduction.
LUT Slice Streaming: To exploit the DRAM-buffer hierarchy, LUT slice streaming partitions the LUT vertically, loading only access-relevant columns (slices) into the faster local buffer. With input-stationary reuse, slices are amortized across many weight vector accesses, enabling support for higher packing degrees without incurring full LUT reloads.

A first-order analytical model determines the optimal packing degree and when to switch between buffer-resident and streaming LUT designs by balancing slice reuse against DRAM access overheads.

Empirical Results

Evaluation is conducted on a real UPMEM-based DRAM-PIM system and via architectural simulation. The main findings are:

LOCALUT achieves a geometric mean speedup of 1.82 $\times$ over state-of-the-art LUT approaches ("LUT Tensor Core" [63], "T-MAC" [94]) and up to 4.73 $\times$ over naive PIM for low-bit quantized DNN inference.
The energy reduction is substantial, with up to 3.37 $\times$ improvement over Naive PIM and 1.88 $\times$ over LTC for highly quantized (W1A $x$ ) configurations.
Canonicalization achieves order-of-magnitude LUT size reductions (up to 611 $\times$ at $p=7$ , $b_a=1$ ), enabling packing degrees unattainable for prior architectures.
Reordering LUT reduces the dominant runtime weight permutation overhead, constituting a substantial share of total compute time in canonicalized designs.
LUT slice streaming further extends supported packing degrees by enabling partial-LUT reuse in limited local buffers without excessive DRAM traffic.
Speedup is more pronounced for extremely low-bit settings (W1A3, W1A4), closely matching trends in industry which increasingly favor aggressive quantization.

Performance robustness is confirmed across a range of DNN workloads, including BERT, ViT, and OPT, with the methods shown to generalize irrespective of matrix size or batch size. The architecture is further demonstrated to be portable to conventional bank-level PIM designs (e.g., HBM-PIM), with minimal area overhead (0.0591 mm² per bank LUT vs 0.0592 mm² for SIMD).

Implications and Theoretical Significance

Practical Implications: LOCALUT provides a scalable mechanism for exploiting DRAM's capacity to boost compute throughput in quantized DNN inference, directly addressing the dichotomy between abundant memory and scarce logic in DRAM-PIM. The architecture enables real hardware deployments to achieve near-optimal low-bit GEMM performance without requiring complex host offload or wide multipliers. The design offers hardware flexibility, as LUTs can be reconfigured for diverse precisions or even alternative operations, allowing applicability across generations of quantized DNN models.

Theoretical Impacts: The canonicalization and slice streaming methods formalize a capacity-computation tradeoff, previously only exploited peripherally in LUT-based neural accelerators. The work shows that operation-packing, coupled with combinatorial redundancy elimination, can asymptotically alter hardware feasibility for high-throughput applications in constrained logic environments.

Contradictory Claims: The authors empirically demonstrate that, unlike traditional wisdom, DRAM-resident LUTs are not always optimal; buffer-sized LUTs with canonicalization and streaming outperform approaches that naïvely leverage only DRAM size. Moreover, for moderate/large bitwidths, traditional MACs still remain competitive as the LUT size scaling becomes unfavorable, highlighting limits to LUT dominance.

Future Directions

Avenues for future exploration include:

Integration into accelerators beyond DRAM-PIM such as GPU-side memory systems, providing fine-grained dynamic tradeoff management between table-based and arithmetic paths.
Specialized logic acceleration for reordering procedures or in-memory permutation, as the reordering LUT index calculation remains a dominant latency source.
Extension to sparse or structured quantizations beyond current integer/floating-point formats, leveraging LUTs’ flexibility.
Dynamic LUT partitioning and sharing algorithms to minimize redundant capacity usage in multi-tenant inference scenarios.
Exploration of the interaction between LUT-based computations and inter-bank/inter-DIMM communication fabric as network bandwidth, rather than compute or capacity, may become dominant at higher packing degrees.

Conclusion

LOCALUT makes a compelling case for operation-packed LUT inference as the primary compute primitive for DRAM-PIM DNN acceleration in the quantized regime. By synergistically combining canonicalization, lightweight reordering, and slice streaming, LOCALUT demonstrates significant improvements in throughput and energy efficiency at feasible area overheads. The work positions LUT-centric designs as a key enabler for future scalable memory-centric inference architectures and sets the foundation for broader adoption as quantization continues to reduce the operation space of DNNs.

Reference:

"LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM" (2604.04523)

Markdown Report Issue