Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive Block-Scaled Data Types

Published 30 Mar 2026 in cs.CL | (2603.28765v1)

Abstract: NVFP4 has grown increasingly popular as a 4-bit format for quantizing LLMs due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize LLMs, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

Summary

  • The paper introduces the IF4 format, which adaptively selects between FP4 and INT4 representations per block to minimize quantization error in low-precision models.
  • It leverages the unused FP8 scale factor sign bit and applies a 7/6 scaling factor for accurate range alignment, ensuring consistent performance across distributions.
  • Empirical results demonstrate significant reductions in training loss and perplexity on benchmarks, with marginal hardware overhead for on-chip acceleration.

Adaptive Block-Scaled Data Types: Methodology and Implications

Introduction

This work introduces Adaptive Block-Scaled Data Types, with particular focus on the Int/Float 4 (IF4) format. The motivation is to improve the quantization fidelity of LLMs in memory- and compute-constrained environments, particularly when training or inferring at 4 bits per parameter, a regime where quantization error is a principal barrier. Existing block-scaled 4-bit numerical formats, such as NVFP4, suffer from significant error characteristics, especially with near-maximal input values within quantization groups. The proposed IF4 format dynamically selects between FP4 and INT4 encodings for each block, leveraging the previously-unused sign bit of the FP8 scale factor, thereby minimizing quantization error for a wide variety of input distributions.

Limitations of NVFP4 and Prior Techniques

The widespread adoption of NVFP4 arises from its hardware compatibility and operational efficiency, scaling blocks of 16 FP4 values by an FP8 E4M3 scale factor. Nevertheless, mean squared error (MSE) is unevenly distributed: near-maximal block values incur the most significant quantization distortion. Reactive strategies such as Four Over Six (4/6) address this, adaptively scaling each block to better align the error distribution, but at the cost of dynamic range reduction and underutilization of quantization levels.

Post-training quantization (PTQ) with 4-bit formats (e.g., W4A4) has been enabled by transformations mitigating outliers (e.g., Hadamard), but standard block-scaled floating-point formats underperform on flattened distributions. Meanwhile, block-scaled integer schemes (e.g., NVINT4) exhibit better error properties under these conditions, motivating hybrid approaches.

The IF4 Approach

The IF4 data type adaptively encodes each block of 16 values as either scaled FP4 or scaled INT4, with a FP8 E4M3 block-wise scale factor indicating the mode via its sign bit. This introduces no storage overhead relative to NVFP4, as the sign bit is unused there. For blocks chosen to be INT4, scale alignment is achieved by applying a 7/6 factor pre-quantization and a matching dequantization factor, ensuring equivalent representational range across modes. Group-wise selection between FP4 and INT4 is based on blockwise MSE computation, guaranteeing the lowest attainable error for each group.

This mechanism allows IF4 to adapt to local input statistics, accommodating both the outlier-prone and the uniform regimes, thereby generalizing and unifying the error characteristics that motivated Four Over Six and NVINT4.

Empirical Results

Training: In quantized training (W4A4G4), IF4 significantly reduces training loss relative to NVFP4 and its derivatives. Notably, when combined with aggressive backward-path transforms (e.g., MS-EDEN Hadamard), the relative performance benefit grows further, underscoring the adaptability of IF4 to the input distribution.

Inference and Post-Training Quantization: On standard PTQ benchmarks (e.g., WikiText-2 and C4), IF4 consistently achieves lower perplexity across a range of model sizes compared to MXFP4, NVFP4, NVINT4, and Four Over Six. On downstream tasks (ARC-Easy, ARC-Challenge, HellaSwag, LAMBADA, PIQA), IF4 outperforms all other block-scaled 4-bit schemes on average. The magnitude of improvement is more pronounced for large models.

Mean Squared Error: Across Qwen3.5-35B-A3B weight channels and synthetic normal distributions, IF4 exhibits the lowest quantization error, confirming its theoretical advantage.

Hardware Implementation and Overheads

The feasibility of IF4 for on-chip acceleration is validated via a System Verilog MAC block synthesized for a 28 nm CMOS process. The architecture supports mixed FP4/INT4 decoding using the scale factor sign as the block selector, with modest additional logic for range alignment. Relative to a pure NVFP4 MAC, IF4 incurs:

  • Latency: +4.7%
  • Throughput: −4.6%
  • Power: +27.8%
  • Area: +66.6%
  • Energy and Area Efficiency: −25.3% and −42.8%

These increases are primarily localized to the MAC datapath. Given that overall accelerator energy and area are dominated by memory and data movement subsystems, and that modern AI workloads are largely memory-bound, the system-level impact is minimal. In real deployments, the cost is further masked by the necessity of mixed-precision support already present in leading architectures.

Broader Implications and Future Directions

The efficacy of IF4 demonstrates that hybrid quantization schemes—adaptively selecting encoding types at the block level—can offer strong fidelity improvements with negligible storage or critical-path penalties. This bridges the gap between floating-point and integer quantization characteristics, especially valuable for activations and gradients in emerging 4-bit training hardware.

Generalization to other bit allocations (IF3, IF6) and block sizes yields similar improvements, as demonstrated by synthetic and real-model perplexity evaluations. The approach is thus extensible, providing a clear path for the design of future numerical formats tailored for highly efficient, low-precision generative model deployment.

For hardware vendors, supporting adaptive block-scaled formats like IF4 will be increasingly attractive, given the low incremental cost and strong empirical benefits. Opportunities for further research include hardware/software co-design for optimal datapath sharing, minimization of quantization bias under stochastic rounding, and automated data type selection strategies.

Conclusion

Adaptive Block-Scaled Data Types, specifically IF4, present a rigorous, empirical advance in low-precision quantization for LLMs. By dynamically selecting between FP4 and INT4 representations per block and leveraging existing storage infrastructure, IF4 minimizes quantization error—improving both training and inference at 4 bits—while imposing only marginal hardware overhead. This format sets a benchmark for future memory- and compute-efficient deep learning deployment and motivates continued exploration of adaptable numerical encoding schemes for AI accelerators.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces a new way to store and process very small numbers in computers so LLMs can run faster and use less memory without losing too much accuracy. The new method is called IF4 (short for Int/Float 4). It smartly decides, for small groups of numbers, whether to store them like tiny floats (FP4) or tiny integers (INT4), picking whichever is more accurate for that group. This helps reduce rounding errors when using super-low precision.

What questions were the authors trying to answer?

  • Can we make 4-bit math (which is very fast and memory‑efficient) accurate enough to train and run big AI models?
  • Why do current 4-bit formats sometimes make big mistakes, especially on numbers near the biggest value in each group?
  • Is there a better 4-bit format that adapts to the data so it keeps accuracy but still runs fast on modern hardware?
  • Can this new format be built efficiently in future chips?

How did they do it? (Explained simply)

Think of fitting many numbers onto a tiny ruler with only 16 tick marks. That’s what 4-bit numbers are like: they can only represent 16 different values. When you squeeze real numbers onto that tiny ruler, you have to round them, which causes error.

Modern GPUs use a trick called “block scaling”:

  • You break numbers into small groups (here, 16 at a time).
  • Each group shares a “zoom level” (a scale factor), so the biggest number in the group fits on the tiny ruler.
  • Then each number in the group is rounded to the closest tick mark.

The popular format NVFP4 uses 4-bit floats (FP4) for each value and one 8-bit scale factor (FP8) per group of 16. This is fast and supported by new chips, but it has a problem: rounding errors tend to be large for numbers close to the group’s maximum.

The paper’s key idea, IF4:

  • Let each group of 16 decide between two tiny rulers:
    • FP4 (tiny floats, unevenly spaced ticks)
    • INT4 (tiny integers, evenly spaced ticks)
  • Pick the one that gives less rounding error for that particular group.
  • Use the group’s scale factor to store the “zoom level” and repurpose one unused bit to record which tiny ruler was chosen.
  • Use a small “6/7” adjustment so both choices line up to the same overall range (so nothing overflows and hardware stays simple).

Analogy:

  • Imagine each group of numbers as a photo you want to fit into a frame. Sometimes a curved frame (FP4) fits better; sometimes a straight-edged frame (INT4) fits better. IF4 lets you choose the better frame for each photo and labels which frame was used so you can display it correctly later.

They also:

  • Tested IF4 while training a model (all weights, activations, and gradients in 4 bits).
  • Tested IF4 after training (post-training quantization) on several real models.
  • Built a hardware design for an IF4 multiply-accumulate unit to check if it can be implemented efficiently.
  • Extended the idea to other sizes (IF3 and IF6).

What did they find, and why is it important?

Main results:

  • Lower error: IF4 reduces rounding (quantization) error compared to NVFP4 and other 4-bit formats.
  • Better training: In 4-bit training, models using IF4 had training loss closer to high-precision training (BF16). The advantage was even bigger when the training used a technique (like Hadamard transforms) that makes numbers more evenly spread—because those cases benefit from INT4’s even spacing.
  • Better after training: In post-training tests on several models (Nemotron 3 and Qwen 3.5 across sizes), IF4 usually achieved lower perplexity (a measure of how well the model predicts text; lower is better) and higher average scores on tasks like ARC, HellaSwag, LAMBADA, and PIQA.
  • Fits in hardware: A prototype IF4 hardware block ran with only about 4.7% more latency than a standard NVFP4 block—small enough that, in real systems that are often limited by memory speed, this overhead should be negligible.
  • No extra storage: IF4 needs no extra memory compared to NVFP4 because it reuses an unused bit to store the format choice.
  • Generalizable: The same “pick float or integer per group” idea also helps at 3 bits (IF3) and 6 bits (IF6), and with different group sizes.

Why it matters:

  • 4-bit math can be up to twice as fast as 8-bit and much faster than 16-bit on new GPUs. If we can keep models accurate with 4 bits, we can train and run them cheaper and faster.
  • IF4 adapts to the data in each group, so it avoids the “one-size-fits-all” problem that causes big rounding errors in some cases.

What could this mean for the future?

  • Faster, cheaper AI: IF4 brings us closer to reliable 4-bit training and inference, which could cut costs and energy use while keeping quality high.
  • Simpler recipes: Because IF4 itself reduces rounding error, future training might need fewer extra tricks, saving time and complexity.
  • Hardware-ready: Since IF4 adds little hardware overhead and no memory overhead, future accelerators can support it to get better accuracy at 4 bits.
  • More flexible formats: The “adaptive per-group choice” idea (IF3, IF4, IF6) suggests a family of formats that can balance speed, memory, and accuracy across different needs.

In short, this paper shows a smart way to choose the best tiny ruler for each small group of numbers, leading to better accuracy at very low precision—without slowing things down or using more memory.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, organized to guide actionable follow-up research.

Method design and algorithmic choices

  • Lack of analysis on alternative selection criteria beyond per-group MSE (e.g., loss-aware, Hessian-aware, KL, or outlier-aware objectives) for choosing FP4 vs INT4; impact on training/inference quality remains unknown.
  • No study of tie-breaking rules or stability when FP4 and INT4 produce near-equal error; risk of instability across steps/devices is unquantified.
  • The 6/7 range-alignment scaling was chosen heuristically; optimality, sensitivity, and alternatives (including learned or per-layer scaling) are not investigated.
  • INT4 coding details (e.g., giving up −8, sign representation) are under-specified; how different INT4 codings affect error symmetry, implementation complexity, and model quality is an open question.
  • Interaction between IF4 selection and other common transforms (e.g., learned rotations, block-wise transforms like WUSH/MR-GPTQ, per-channel scaling) is unstudied; composability may yield larger gains or conflicts.
  • Effect of grouping/layout (how groups of 16 are formed in memory or by tensor axes) on selection decisions and error is not analyzed; optimal grouping strategies remain unknown.

Evaluation scope and rigor

  • End-to-end performance is not measured on real hardware (all formats emulated via FP32 dequantization); actual wall-clock speedups, throughput, and memory/bandwidth benefits for training and inference remain unquantified.
  • W4A4G4 pretraining results use a 340M model and partial high-precision exceptions (e.g., final layers); effectiveness for fully quantized pipelines and for large-scale pretraining (≥7B–70B) is untested.
  • Downstream evaluation is limited (perplexity plus five tasks); sensitivity on broader and harder reasoning/code/math benchmarks (e.g., MMLU, GSM8K, MATH, HumanEval) is unknown.
  • Calibration details for PTQ (dataset size, sampling strategy, sequence lengths) are not fully specified; reproducibility and sensitivity to calibration choices are unclear.
  • No ablations quantifying how often IF4 selects INT4 vs FP4 per layer/module and when/where gains predominantly arise (e.g., Q/K/V, MLP, attention outputs).
  • Interactions with KV-cache quantization and serving recipes (e.g., W4A8KV4) are not evaluated; implications for real LLM serving are unclear.
  • Robustness across architectures (e.g., Llama, Mistral, Transformer variants), modalities (vision, audio, multimodal), and non-Transformer models remains unexplored.
  • Statistical significance of small average gains (perplexity and accuracy deltas) is not reported; variability across random seeds and datasets is not characterized.

Stochastic rounding and quantization bias

  • Quantization bias with stochastic rounding is acknowledged but only probed in small-scale experiments; large-scale impact on convergence stability, final accuracy, and gradient noise is unresolved.
  • Potential mitigations (e.g., bias-corrected selection, dual-pass selection with SR-aware rules, or adaptive scale bias removal) are not explored.

Alternative formats and block sizes

  • IF3 and IF6 are only simulated; their practical training/inference performance, stability, and hardware feasibility are not validated experimentally.
  • Smaller block sizes (e.g., 8) show promising simulations but may lack hardware support; the practical trade-offs (accuracy vs. bandwidth/compute/memory traffic) and feasibility on real accelerators remain open.
  • The paper fixes FP8 E4M3 as scale format; effects of other scale formats (UE8M0, E5M2, per-channel or per-row scales) on IF-type methods are untested.

Hardware and systems considerations

  • The IF4 MAC is evaluated in isolation (28 nm) with latency only; no area, power, energy/OP, or throughput numbers, nor system-level modeling with realistic memory pipelines are provided.
  • Added scaling multiplications (6/7, 36/49) are computed in FP32 in the prototype; cost/benefit of lower-precision or fused scaling, table-based approximations, or amortized scaling is unstudied.
  • Backward compatibility and interoperability with existing NVFP4 kernels/ISAs are unclear; using the FP8 scale’s sign bit as an indicator may conflict with existing FP8 semantics, compression, or tooling.
  • Software-side overhead of double encoding (trying FP4 and INT4 per block) at runtime on current GPUs is not benchmarked; practical cost in training/inference frameworks is unknown.
  • Memory-layout implications (format packing, alignment, kernel scheduling) and their impact on real-world throughput are not analyzed.

Numerical analysis and theory

  • No formal analysis of error bounds or probabilistic guarantees comparing IF4 vs NVFP4/NVINT4 under realistic data distributions (beyond Gaussian) and transforms.
  • Accumulation and scaling error propagation (e.g., with FP16 vs FP32 accumulators) are not examined; many accelerators accumulate in FP16.
  • Sensitivity to tensor dynamic range and outliers (with/without Hadamard) lacks a principled model predicting when IF4 should win; a selection-theoretic framework could guide deployment.

Training recipes and integration

  • Limited exploration of optimizer/schedule dependence (e.g., AdamW vs others), gradient clipping, and normalization strategies (e.g., QK-norm variants) on IF4 effectiveness.
  • Interactions with modern stabilization techniques (e.g., oscillation suppression, late-stage precision switching, adaptive transforms) are not systematically tested.
  • No study of finetuning, RLHF, instruction tuning, or continued pretraining under IF formats; generalization to common LLM workflows remains uncertain.

Standardization and reproducibility

  • A formal bit-level specification for IF4/IF3/IF6 encodings suitable for standardization is missing; edge cases (NaNs, subnormals, denormals for scales) are unspecified.
  • Open-source implementations are referenced but complete details for reproducing all experiments, including datasets, seeds, and exact quantization pipelines, are not consolidated in the main text.

These gaps point to clear next steps: rigorous end-to-end hardware measurements; broader and deeper evaluations across models, tasks, and recipes; principled selection criteria and theoretical analysis; and practical integration studies in real training/serving stacks with standardized encodings.

Practical Applications

Immediate Applications

Below are concrete near-term uses that can be implemented in software stacks and workflows today (even without new hardware), primarily leveraging IF4 as a drop‑in block‑scaled quantizer with fused dequantization.

  • Software/Cloud AI (LLM inference)
    • Use case: Lower‑memory W4A4 post‑training quantization with better accuracy than NVFP4/NVINT4.
    • Workflow/products:
    • Add IF4 quantizer/dequantizer to frameworks such as PyTorch, Triton kernels, vLLM, TensorRT‑LLM plugins, and ONNX runtime backends.
    • Fused “IF4→FP16/BF16” dequantization kernels in GEMM epilogues to minimize overhead on current GPUs.
    • Assumptions/dependencies: No native IF4 tensor‑core support yet; gains mainly from bandwidth/memory reduction and accuracy, not from 4‑bit matmul throughput. Requires kernel engineering to avoid dequantization bottlenecks.
  • Model delivery and checkpoint storage (MLOps)
    • Use case: Compress model weights in IF4 for distribution and on‑disk storage with no accuracy penalty relative to other 4‑bit formats and improved perplexity/task accuracy in many cases.
    • Tools/workflows: Model conversion CLI (BF16→IF4), HF Transformers weight converters, model hub “IF4” variants.
    • Assumptions: Consumers must dequantize to standard FP during load or use fused dequantization at runtime.
  • Quantized training research (academia/industry R&D)
    • Use case: Prototype W4A4G4 training with lower loss than NVFP4; test with Hadamard and MS‑EDEN‑style backward passes where IF4 gains are largest.
    • Workflow: Add IF4 emulation ops to PyTorch; integrate into training recipes; swap weight‑gradient quantization to prefer INT4 blocks post‑Hadamard.
    • Assumptions: Training still uses FP16/BF16 matmuls; stochastic rounding and block‑wise scaling pipelines must be available.
  • Edge/embedded inference (software-defined NPUs, FPGAs)
    • Use case: Deploy LLMs or smaller transformer models with IF4 for memory‑bound edge workloads; dequantize to FP16 on programmable accelerators.
    • Tools: FPGA soft cores or DSP kernels with IF4 decode; embedded runtimes (e.g., TVM micro) with IF4 weight loaders.
    • Assumptions: Compute budget for dequantization available; batch sizes remain small, so memory savings dominate.
  • Quantization toolchains and transformations
    • Use case: Combine IF4 with Hadamard/WUSH/rotation transforms that flatten distributions, letting IF4 select INT4 on uniform blocks and FP4 on outlier‑heavy blocks.
    • Products: Plugins for GPTQ/MR‑GPTQ, SpinQuant, FlatQuant, QuaRot; calibration scripts that compute per‑block MSE to choose FP4 vs INT4.
    • Assumptions: Calibration data and transform kernels available; added per‑block evaluation cost kept low via vectorized kernels.
  • Cost/energy reduction for on‑prem LLMs (healthcare, finance, enterprise IT)
    • Use case: Lower memory footprint and bandwidth for compliant on‑prem deployment; reduce per‑token energy and server count.
    • Workflows: Convert existing models to IF4 (W4A4) during PTQ; deploy via vLLM with fused IF4 dequantization.
    • Assumptions: Accuracy targets remain acceptable under W4A4; approval for non‑standard format storage.
  • Curriculum and benchmarking (academia)
    • Use case: Teaching and reproduction of W4A4G4 recipes that align with modern hardware trends; head‑to‑head PTQ comparisons across Qwen/Nemotron variants.
    • Tools: Open‑source repo extensions (e.g., “fouroversix”) with IF4 reference kernels and evaluation scripts.
    • Assumptions: Access to datasets and compute for calibration and evaluation.

Long‑Term Applications

These rely on hardware, compiler, or ecosystem changes to realize full throughput advantages; they include new products and broader cross‑domain adoption.

  • Hardware accelerators with native IF4 support (semiconductors, cloud GPUs/NPUs)
    • Use case: Tensor cores/NPU MACs decoding IF4 per block with minimal overhead (≈4.7% MAC latency increase vs NVFP4 baseline reported), enabling true FP4‑class matmul speedups.
    • Products:
    • IF4‑capable GPU tensor cores and NPU IP blocks (datacenter, mobile, edge).
    • ISA extensions and CUDA/ROCm intrinsics for IF4 load/decode and matmul.
    • Assumptions: Vendor adoption; compiler and kernel support; backward compatibility with NVFP4 paths.
  • End‑to‑end 4‑bit training at scale (foundation model pretraining)
    • Use case: Mainstream W4A4G4 training with fewer high‑precision fallbacks and reduced reliance on heavy outlier control, lowering training cost and energy.
    • Workflows: IF4 everywhere (weights/activations/gradients), Hadamard/MS‑EDEN in backward, stochastic rounding; per‑block dynamic selection of INT4 or FP4.
    • Assumptions: Stable training across model sizes; ecosystem support for 4‑bit matmuls and optimizer states; improved scale‑factor handling.
  • Broader adaptive block‑scaled formats (IF3, IF6, smaller block sizes)
    • Use case: Tailor memory/performance Pareto points for different domains (vision, speech, recommendation, robotics control) and hardware constraints.
    • Products:
    • IF3 for extreme compression (≈3.5–4.0 bpp) and small activations.
    • IF6 for near‑FP8 quality at lower bandwidth.
    • BS8 variants when hardware supports smaller groups.
    • Assumptions: New decode tables and 4/3, 28/31, 7.5/31 alignment factors supported in hardware; careful error analysis per domain.
  • Memory‑bound system optimization (systems/infra, energy sector)
    • Use case: System‑wide throughput gains by cutting HBM and interconnect traffic with IF4 activations and KV‑cache, not just weights.
    • Products/workflows:
    • IF4‑aware memory hierarchies with on‑the‑fly decode in SRAM scratchpads.
    • Network transport of IF4 tensors between nodes; IF4 in collective ops.
    • Assumptions: End‑to‑end toolchain (drivers, NCCL‑like libs) can carry non‑standard numeric payloads; error propagation acceptable.
  • On‑device assistants and robotics (mobile, wearables, autonomous systems)
    • Use case: Run larger LLMs or multi‑modal models on‑device within power/thermal limits; enable private, low‑latency inference.
    • Products: Phones and wearables with IF4‑capable NPUs; robots/UAVs with IF4 MACs to host on‑board LLM planners.
    • Assumptions: Silicon updates; thermal design and memory sized for increased model capacity.
  • Healthcare and finance sector deployments
    • Use case: Privacy‑preserving, on‑prem or edge inference with reduced TCO; greater model capacity per server rack.
    • Products/workflows: IF4‑enabled inference appliances with validated accuracy for clinical/financial NLP; compliance‑ready deployment kits.
    • Assumptions: Regulatory approval; rigorous validation to ensure no clinically/financially significant degradation.
  • Compiler and graph‑level support (software tooling)
    • Use case: Automatic per‑block INT4/FP4 selection during compilation and kernel fusion; schedule‑aware placement of dequantization.
    • Products: Passes in TVM, Triton, XLA, MLIR for IF4; profilers that visualize per‑block error and bandwidth savings.
    • Assumptions: Profiling hooks and IR extensions for mixed block encodings; calibration datasets integrated into compilation.
  • Standards and interoperability (policy/consortia)
    • Use case: Open specification for adaptive block‑scaled formats (metadata, scale encoding, sign‑bit repurposing), enabling multi‑vendor compatibility.
    • Products: Format specs, test vectors, conformance suites; model interchange (ONNX) extensions for IF4/IF3/IF6.
    • Assumptions: Agreement on encoding (e.g., use of E4M3 scale sign‑bit), patent/licensing clarity.
  • Sustainability metrics and incentives (policy/operations)
    • Use case: Track and reward energy savings per token via low‑bit formats; integrate into green SLAs and carbon accounting.
    • Products: Reporting modules in serving stacks that attribute energy reductions to IF4; procurement guidelines favoring low‑bit‑capable hardware.
    • Assumptions: Auditable measurement pipelines; consensus on metrics.
  • New SKUs and services (cloud & enterprise IT)
    • Use case: “IF4‑optimized” inference/training instances and appliances promising higher capacity per GPU and lower cost/token.
    • Products: Cloud SKUs with IF4 kernels pre‑installed; enterprise accelerators with IF4 MACs.
    • Assumptions: Demand for W4A4 workloads; software ecosystem maturity.

Notes on feasibility and dependencies common across applications:

  • Hardware support: Full speedups require native IF4 decode/MAC; without it, benefits are primarily memory/bandwidth and accuracy, offset by dequantization overhead that must be fused.
  • Software kernels: Efficient per‑block FP4/INT4 selection and 6/7 (or analogous) scaling must be implemented without branching bottlenecks.
  • Training recipe dependencies: Stochastic rounding, block scaling, and transforms (e.g., Hadamard, WUSH) affect where IF4 provides the largest gains.
  • Numerical assumptions: Using the FP8 E4M3 scale sign bit to indicate mode; omission of INT4 value −8 for symmetry; group size of 16; global tensor scales must avoid overflow.
  • Validation: Domain‑specific accuracy checks (especially in safety‑critical sectors) are necessary before production.

Glossary

  • 4/6 (Four Over Six): An adaptive NVFP4 scaling technique that chooses a per-group FP4 maximum of 4 or 6 to reduce quantization error. "Another recently proposed operation for NVFP4 training is Four Over Six (4/6), which reduces NVFP4 quantization error by adaptively scaling each group of FP4 values to a maximum value of either 4 or 6, rather than scaling all groups to the default maximum value of 6."
  • BF16: A 16-bit floating-point format (bfloat16) with 8 exponent bits and 7 fraction bits, often used as a high-precision baseline in training. "We find that IF4 outperforms NVFP4 during pre-training, resulting in training loss closer to a high-precision baseline trained using BF16."
  • Block-scaled format: A quantization scheme where groups of low-precision values share a higher-precision scale factor to extend dynamic range. "Like NVFP4, IF4 is a block-scaled format with a tensor-wide FP32 scale factor α\alpha and an FP8 E4M3 scale factor Δi\Delta_i for every 16 values."
  • CMOS: Complementary metal-oxide-semiconductor; a semiconductor process technology used to fabricate digital circuits. "To evaluate the hardware feasibility of IF4, we implement an IF4 multiply accumulate (MAC) block in SystemVerilog and synthesize it in 28nm CMOS technology."
  • Dynamic range: The ratio between the largest and smallest representable magnitudes in a numerical format or quantized tensor. "reducing the quantized tensor's dynamic range by 42.9\%."
  • E2M0: A floating-point format layout with 2 exponent bits and 0 mantissa bits (all precision in the exponent). "we additionally propose and analyze IF3, which allows values to be stored as E2M0 floating point values or scaled integers"
  • E2M1: A floating-point format layout with 2 exponent bits and 1 mantissa bit (used for FP4 in this work). "where MFP4M^\text{FP4} and MFP8M^\text{FP8} are the largest values that can be represented by FP4 E2M1 and FP8 E4M3, 6 and 448 respectively"
  • E4M3: An FP8 format layout with 4 exponent bits and 3 mantissa bits, used for per-group scale factors. "Like NVFP4, we scale groups of 16 values by an FP8 E4M3 scale factor with 4 exponent bits and 3 significand bits."
  • FP4: A 4-bit floating-point format (here, E2M1) used for low-precision training and inference. "For FP4, these are ±{0,0.5,1,1.5,2,3,4,6}\pm\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}."
  • FP8: An 8-bit floating-point format (e.g., E4M3) used as a balance between precision and speed. "which can be multiplied twice as fast as 8-bit numbers (i.e. FP8)"
  • Hadamard transform: An orthogonal transform using Hadamard matrices; in quantization it helps reduce outliers and flatten distributions. "Other key operations include the random Hadamard transform which smooths outliers"
  • HBM bandwidth: The data transfer capacity of High Bandwidth Memory, often the limiting factor in accelerator performance. "which is instead dominated by HBM bandwidth, cache behavior, and data movement overheads."
  • IF3: A 3-bit adaptive block-scaled data type that selects per-group between an E2M0 float or scaled int representation. "we additionally propose and analyze IF3, which allows values to be stored as E2M0 floating point values or scaled integers"
  • IF4: An adaptive 4-bit block-scaled data type that per group chooses FP4 or scaled INT4 and encodes the choice in the scale’s sign bit. "For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values"
  • IF6: A 6-bit adaptive block-scaled data type extending IF4’s idea to 6 bits per value. "and IF6, which leverages the same concept but with 6 bits per value."
  • INT4: A 4-bit signed integer representation used as an alternative to FP4 within groups for lower quantization error on uniform-like inputs. "Adaptive Block-Scaled Data Types offer each group the choice between standard FP4 and an even more uniform distribution of error in the form of scaled INT4 values"
  • Lookup table (LUT): A small memory used for fast value decoding or mapping; here used to decode NVFP4’s non-uniform codebook. "NVFP4 values are generated through a lookup table (LUT), whereas INT4 values are decoded using simple shifter-based logic."
  • MAC (Multiply-Accumulate): A hardware unit that multiplies pairs of values and accumulates the results, central to matrix multiplication. "We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators."
  • Mean squared error (MSE): The average of squared differences between quantized and original values, used to choose per-group representation. "selecting the option with less mean squared quantization error."
  • Mixture-of-Experts (MoE): A neural architecture where multiple expert submodules are selectively activated per input. "we employ a fairly aggressive paradigm where weights and activations are quantized using round-to-nearest quantization in all linear and Mixture-of-Experts~\cite{shazeer_outrageously_2017} layers except the LM head."
  • MS-EDEN: A backward-pass technique that applies transforms (e.g., Hadamard) to reduce gradient quantization bias. "When MS-EDEN~\cite{panferov_quartet_2026} is used in the backward pass, IF4 results in an even larger improvement."
  • MXFP4: A block-scaled FP4 scheme with larger group size (e.g., 32) and FP8 scales, supported in some GPUs. "First, most works use block-scaled formats such as NVFP4 or MXFP4, which scale groups of FP4 values (containing 16 or 32 values respectively) by separately-quantized FP8 scale factors"
  • NVFP4: NVIDIA’s block-scaled FP4 format using per-16-value FP8 E4M3 scales; widely supported in recent GPUs. "NVFP4 has grown increasingly popular as a 4-bit format for quantizing LLMs"
  • NVINT4: A block-scaled INT4 scheme with per-16-value FP8 scales, often favorable on more uniform distributions. "such as after undergoing a Hadamard transformation, NVFP4 can underperform relative to other proposed data types such as NVINT4 (an FP8 scale factor for every 16 INT4 values)"
  • Pareto frontier: The trade-off curve of non-dominated choices balancing competing objectives (here, memory vs. performance). "We replicate prior findings of a pareto frontier between model performance and memory"
  • Perplexity: A language-model evaluation metric; lower perplexity indicates better modeling of test data. "as measured by WikiText-2 and C4 perplexity for Nemotron 3 Nano~\cite{nvidia_nvidia_2025} and Qwen 3.5~\cite{qwen_team_qwen35_2026} across different model sizes"
  • Post-training quantization (PTQ): Quantizing a pre-trained model without retraining to reduce memory/compute cost. "When quantizing existing models with post-training quantization (PTQ), much existing literature focuses on W4A16"
  • Query-key normalization: A stabilization technique for attention by normalizing queries and keys to control scale and variance. "We first evaluate the performance of IF4 by pre-training several 340-million-parameter dense Transformer models with query-key normalization"
  • Range-alignment scaling: Extra scaling applied in hardware to reconcile mixed per-group operand types (FP4 vs. INT4) before block scaling. "For IF4, additional range-alignment scaling is applied depending on the operand types."
  • Round-to-nearest (RTN): A deterministic rounding rule used during quantization that rounds to the closest representable value. "W4A4 round-to-nearest post-training quantization accuracy averaged across five tasks"
  • Stochastic rounding: A probabilistic rounding method that yields unbiased quantization, especially useful for gradients. "stochastic rounding which provides unbiased estimates of gradients"
  • SystemVerilog: A hardware description and verification language used to implement and simulate digital designs. "we implement an IF4 multiply accumulate (MAC) block in SystemVerilog and synthesize it in 28nm CMOS technology."
  • Tensor-wide FP32 scale factor: A single FP32 scaling applied to the entire tensor in block-scaled formats to set overall magnitude. "Like NVFP4, IF4 is a block-scaled format with a tensor-wide FP32 scale factor α\alpha"
  • UE4M3: A variant of the E4M3 FP8 format (unsigned exponent) used as a scale format in some schemes. "IF4 & 16 & UE4M3"
  • W4A4: A regime where both weights and activations are quantized to 4 bits. "In this work, we focus on the much more challenging W4A4 paradigm, in which weights and activations are quantized to 4 bits."
  • W4A4G4: A fully quantized training setting with 4-bit weights, activations, and gradients. "W4A4G4 is a difficult paradigm in which weights, activations, and gradients must all be quantized using 4-bit formats during training."
  • W4A8: A regime with 4-bit weights and 8-bit activations (INT8 or FP8). "or W4A8, in which activations are quantized to INT8 or FP8"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 77 likes about this paper.

HackerNews

  1. Adaptive Block-Scaled Data Types (2 points, 0 comments)