- The paper introduces LOCALUT, which employs LUT canonicalization, reordering, and slice streaming to enhance low-bit DNN inference in DRAM-PIM.
- It reduces LUT size by up to 611ร and achieves speedups ranging from 1.82ร to 4.73ร over prior methods by optimizing capacity-computation tradeoffs.
- It demonstrates significant energy efficiency improvements and scalable performance across diverse workloads, including BERT, ViT, and OPT.
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
Introduction and Motivation
The paper addresses the intersection of DRAM-based Processing-in-Memory (PIM) architectures and low-bit quantized Deep Neural Network (DNN) inference, focusing on the disproportion between available DRAM capacity and the area- and energy-limited logic density for on-chip computation. Recent advances in PIM have mitigated the memory wall for data-intensive primitives but remain suboptimal for compute-intensive DNN kernels such as GEMM due to the prohibitive cost of embedding sufficient arithmetic units in DRAM technology nodes. Lookup Table (LUT)-based inference emerges as an alternative, leveraging the abundance of DRAM capacity to replace complex arithmetic with table lookups, particularly advantageous for quantized models with a limited operation space.
Technical Contributions
LOCALUT introduces a memory-centric LUT-based inference architecture for DRAM-PIM, focusing on maximizing arithmetic throughput via operation-packing in LUT entries. Three synergistic techniques are presented:
- LUT Canonicalization: By analyzing the inherent permutation redundancy in operation-packed LUT indices, the authors design a canonicalization procedure that massively reduces LUT size. For p-way packed LUTs, only canonical input configurations are stored, leveraging the fact that MAC operations are invariant under joint index permutations. This reduces table size growth from exponential to polynomial with packing degree at fixed bitwidths.
- Reordering LUT: Canonicalization introduces runtime permutation overhead for weight alignment. LOCALUT introduces an auxiliary reordering LUT that, given a sorted activation permutation and weight vector, outputs the canonicalized weight vector in a single additional lookup, offloading the bitwise permutation from the limited in-memory processing core and trading slight capacity expansion for significant compute reduction.
- LUT Slice Streaming: To exploit the DRAM-buffer hierarchy, LUT slice streaming partitions the LUT vertically, loading only access-relevant columns (slices) into the faster local buffer. With input-stationary reuse, slices are amortized across many weight vector accesses, enabling support for higher packing degrees without incurring full LUT reloads.
A first-order analytical model determines the optimal packing degree and when to switch between buffer-resident and streaming LUT designs by balancing slice reuse against DRAM access overheads.
Empirical Results
Evaluation is conducted on a real UPMEM-based DRAM-PIM system and via architectural simulation. The main findings are:
- LOCALUT achieves a geometric mean speedup of 1.82ร over state-of-the-art LUT approaches ("LUT Tensor Core" [63], "T-MAC" [94]) and up to 4.73ร over naive PIM for low-bit quantized DNN inference.
- The energy reduction is substantial, with up to 3.37ร improvement over Naive PIM and 1.88ร over LTC for highly quantized (W1Ax) configurations.
- Canonicalization achieves order-of-magnitude LUT size reductions (up to 611ร at p=7, baโ=1), enabling packing degrees unattainable for prior architectures.
- Reordering LUT reduces the dominant runtime weight permutation overhead, constituting a substantial share of total compute time in canonicalized designs.
- LUT slice streaming further extends supported packing degrees by enabling partial-LUT reuse in limited local buffers without excessive DRAM traffic.
- Speedup is more pronounced for extremely low-bit settings (W1A3, W1A4), closely matching trends in industry which increasingly favor aggressive quantization.
Performance robustness is confirmed across a range of DNN workloads, including BERT, ViT, and OPT, with the methods shown to generalize irrespective of matrix size or batch size. The architecture is further demonstrated to be portable to conventional bank-level PIM designs (e.g., HBM-PIM), with minimal area overhead (0.0591 mmยฒ per bank LUT vs 0.0592 mmยฒ for SIMD).
Implications and Theoretical Significance
Practical Implications: LOCALUT provides a scalable mechanism for exploiting DRAM's capacity to boost compute throughput in quantized DNN inference, directly addressing the dichotomy between abundant memory and scarce logic in DRAM-PIM. The architecture enables real hardware deployments to achieve near-optimal low-bit GEMM performance without requiring complex host offload or wide multipliers. The design offers hardware flexibility, as LUTs can be reconfigured for diverse precisions or even alternative operations, allowing applicability across generations of quantized DNN models.
Theoretical Impacts: The canonicalization and slice streaming methods formalize a capacity-computation tradeoff, previously only exploited peripherally in LUT-based neural accelerators. The work shows that operation-packing, coupled with combinatorial redundancy elimination, can asymptotically alter hardware feasibility for high-throughput applications in constrained logic environments.
Contradictory Claims: The authors empirically demonstrate that, unlike traditional wisdom, DRAM-resident LUTs are not always optimal; buffer-sized LUTs with canonicalization and streaming outperform approaches that naรฏvely leverage only DRAM size. Moreover, for moderate/large bitwidths, traditional MACs still remain competitive as the LUT size scaling becomes unfavorable, highlighting limits to LUT dominance.
Future Directions
Avenues for future exploration include:
- Integration into accelerators beyond DRAM-PIM such as GPU-side memory systems, providing fine-grained dynamic tradeoff management between table-based and arithmetic paths.
- Specialized logic acceleration for reordering procedures or in-memory permutation, as the reordering LUT index calculation remains a dominant latency source.
- Extension to sparse or structured quantizations beyond current integer/floating-point formats, leveraging LUTsโ flexibility.
- Dynamic LUT partitioning and sharing algorithms to minimize redundant capacity usage in multi-tenant inference scenarios.
- Exploration of the interaction between LUT-based computations and inter-bank/inter-DIMM communication fabric as network bandwidth, rather than compute or capacity, may become dominant at higher packing degrees.
Conclusion
LOCALUT makes a compelling case for operation-packed LUT inference as the primary compute primitive for DRAM-PIM DNN acceleration in the quantized regime. By synergistically combining canonicalization, lightweight reordering, and slice streaming, LOCALUT demonstrates significant improvements in throughput and energy efficiency at feasible area overheads. The work positions LUT-centric designs as a key enabler for future scalable memory-centric inference architectures and sets the foundation for broader adoption as quantization continues to reduce the operation space of DNNs.
Reference:
"LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM" (2604.04523)