Table of Contents

Introduction
#

Model quantization has evolved far beyond the classic INT8 regime. As large language models (LLMs) surpass hundreds of billions of parameters and vision/diffusion models demand ever-increasing computational budgets, researchers have pushed quantization to its extreme limits. This post provides a deep, technical exploration of extreme and mixed-precision quantization – from 8-bit floating point down to single-bit binary representations – along with the sophisticated algorithms that make such aggressive compression possible without catastrophic quality loss.

We will cover the full landscape: the bit-level mechanics of FP8 and INT4 formats, sub-4-bit methods including binary neural networks and BitNet, state-of-the-art algorithms such as QuIP#, AQLM, and HQQ, mixed-precision strategies driven by sensitivity analysis and reinforcement learning, domain-specific challenges for Transformers, vision models, and diffusion models, and finally the hardware-aware inference optimization perspective.


FP8: 8-Bit Floating Point
#

Why Floating Point at 8 Bits?
#

Traditional INT8 quantization maps floating-point values to 256 uniformly spaced integers. While effective for inference, this uniform spacing poorly represents the heavy-tailed distributions common in neural network weights and activations. FP8 retains the logarithmic spacing of floating-point arithmetic, providing higher precision near zero (where most values cluster) and coarser precision for outliers.

E4M3 and E5M2 Bit Layouts
#

The IEEE working group and hardware vendors (NVIDIA, AMD, Intel) have standardized two FP8 formats, both using 8 bits total:

E4M3 Format (1 sign + 4 exponent + 3 mantissa):
+---+----+---+---+---+---+---+---+
| S | E3 | E2| E1| E0| M2| M1| M0|
+---+----+---+---+---+---+---+---+
  1    4 bits exponent   3 bits mantissa

E5M2 Format (1 sign + 5 exponent + 2 mantissa):
+---+----+---+---+---+---+---+---+
| S | E4 | E3| E2| E1| E0| M1| M0|
+---+----+---+---+---+---+---+---+
  1    5 bits exponent     2 bits mantissa

The value of a normal FP8 number follows the standard floating-point formula:

$$\text{value} = (-1)^S \times 2^{(E - \text{bias})} \times (1 + \frac{M}{2^{m}})$$

where \(E\) is the stored exponent, \(\text{bias}\) is the exponent bias, \(M\) is the stored mantissa, and \(m\) is the number of mantissa bits.

PropertyE4M3E5M2
Exponent bits45
Mantissa bits32
Exponent bias715
Max normal value44857344
Min positive normal\(2^{-6}\) = 0.015625\(2^{-14}\) = 6.1e-5
Dynamic range (decades)~4.9~9.5
Precision (ULP at 1.0)0.1250.25
Special valuesNaN only (no Inf)NaN and Inf

Numerical Examples
#

E4M3 encoding of 3.5:

  1. \(3.5 = 1.75 \times 2^1\)
  2. Sign: \(S = 0\) (positive)
  3. Exponent: \(E = 1 + 7 = 8 = 1000_2\)
  4. Mantissa: \(1.75 = 1 + 0.5 + 0.25 = 1 + \frac{M}{8}\), so \(M = 6 = 110_2\)
  5. Final bit pattern: 0 1000 110 = 0x46

E5M2 encoding of 0.1875:

  1. \(0.1875 = 1.5 \times 2^{-3}\)
  2. Sign: \(S = 0\)
  3. Exponent: \(E = -3 + 15 = 12 = 01100_2\)
  4. Mantissa: \(1.5 = 1 + 0.5 = 1 + \frac{M}{4}\), so \(M = 2 = 10_2\)
  5. Final bit pattern: 0 01100 10 = 0x32

Quantization error comparison at value 1.3:

  • E4M3: rounds to 1.25 (error = 0.05, relative = 3.8%)
  • E5M2: rounds to 1.25 (error = 0.05, relative = 3.8%) – same here, but at value 5.3:
  • E4M3: rounds to 5.25 (error = 0.05, relative = 0.9%)
  • E5M2: rounds to 5.0 (error = 0.3, relative = 5.7%) – E4M3 wins with more mantissa bits

FP8 Training
#

FP8 training uses both formats in a complementary fashion, as pioneered by NVIDIA’s Transformer Engine:

FP8 Mixed-Format Training Pipeline:

                    FP8 E4M3              FP8 E4M3
  Weights -----> [Forward Pass] -----> Activations
  (E4M3)             |                    |
                     |                    |
                     v                    v
               FP8 E5M2              FP8 E5M2
            [Backward Pass] <----- [Loss Gradient]
            (grad weights)          (grad activations)
                     |
                     v
              FP32 Master Weights (optimizer update)

The key insight: E4M3 for forward pass (higher precision needed for accurate outputs) and E5M2 for backward pass (wider dynamic range needed for gradients, which can span many orders of magnitude).

Per-tensor scaling is critical for FP8 training. Each tensor maintains a scaling factor \(s\) updated via a delayed scaling strategy:

$$s_{t+1} = \frac{\text{maxval}(\text{FP8})}{\max(|X_t|)} \times \alpha$$

where \(\alpha\) is a safety margin (typically 0.9) to prevent overflow, and the scaling factor is applied before casting to FP8:

$$X_{\text{FP8}} = \text{cast\_to\_fp8}(X \times s)$$

NVIDIA’s H100 GPU achieves up to 2x throughput improvement with FP8 Tensor Cores compared to FP16, making FP8 training practical for models with hundreds of billions of parameters.


INT4: 4-Bit Integer Quantization
#

Uniform INT4 Quantization
#

At 4 bits, we have only 16 distinct values. For symmetric quantization:

$$q = \text{clamp}\left(\left\lfloor \frac{x}{s} \right\rceil, -8, 7\right), \quad s = \frac{\max(|x|)}{7}$$

For asymmetric quantization:

$$q = \text{clamp}\left(\left\lfloor \frac{x - z}{s} \right\rceil, 0, 15\right), \quad s = \frac{\max(x) - \min(x)}{15}, \quad z = \min(x)$$

With only 16 levels, the quantization error is significant for per-tensor quantization. This motivates group quantization.

Group Quantization
#

Group quantization divides a weight tensor into small groups of \(g\) consecutive elements, each with its own scale and zero-point:

Weight tensor (1x16):
[0.1, 0.5, -0.3, 0.8, | -0.1, 0.2, 0.9, -0.7, | 0.3, -0.4, 0.6, 0.1, | -0.2, 0.7, -0.5, 0.4]
     Group 0 (g=4)          Group 1 (g=4)           Group 2 (g=4)           Group 3 (g=4)
     s0, z0                  s1, z1                  s2, z2                  s3, z3

The overhead of storing per-group parameters adds bits per weight:

$$\text{effective bits} = 4 + \frac{b_s + b_z}{g}$$

where \(b_s\) and \(b_z\) are the bit-widths of the scale and zero-point. For \(g = 128\) with FP16 scale and zero-point:

$$\text{effective bits} = 4 + \frac{16 + 16}{128} = 4.25 \text{ bits}$$

Common group sizes in practice: 32, 64, 128, 256. Smaller groups improve accuracy but increase overhead.

NF4: NormalFloat 4-bit
#

QLoRA introduced NF4 (NormalFloat4), an information-theoretically optimal data type for normally distributed weights. The key insight: neural network weights after pretraining are approximately normally distributed with zero mean.

NF4 constructs its 16 quantization levels by computing the quantiles of the standard normal distribution \(\mathcal{N}(0,1)\), ensuring each quantization bin contains equal probability mass:

$$q_i = \Phi^{-1}\left(\frac{2i + 1}{2 \times 16}\right), \quad i = 0, 1, \ldots, 15$$

where \(\Phi^{-1}\) is the inverse cumulative distribution function (probit function) of the standard normal.

The resulting NF4 quantization levels (normalized to [-1, 1]):

NF4 levels (16 values):
[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
  0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230, 1.0]

Notice the non-uniform spacing: levels are denser near zero where the normal distribution has higher probability density. This minimizes the expected quantization error:

$$\mathbb{E}[|x - Q(x)|^2] = \int_{-\infty}^{\infty} |x - Q(x)|^2 \, \phi(x) \, dx$$

where \(\phi(x)\) is the standard normal PDF. NF4 achieves lower expected error than uniform INT4 for normally distributed data.

QLoRA further applies double quantization – quantizing the FP32 group scales themselves to FP8, reducing the per-parameter overhead:

$$\text{effective bits (NF4 + double quant)} = 4 + \frac{8}{64} + \frac{32}{64 \times 256} \approx 4.127 \text{ bits}$$

GGUF Format and Quant Types
#

The GGUF (GPT-Generated Unified Format) file format, developed by the llama.cpp community, has become the de facto standard for distributing quantized LLMs for CPU and mixed CPU/GPU inference. It supports a wide array of quantization types:

Quant TypeBits/WeightGroup SizeScale FormatDescription
Q2_K2.5625256 (super) / 16 (sub)FP16 + 4-bit2-bit with 4-bit importance-based scales
Q3_K_S3.4375256 / 16FP16 + 4-bit3-bit small, fewer high-precision groups
Q3_K_M3.875256 / 16FP16 + 4-bit3-bit medium
Q3_K_L4.125256 / 16FP16 + 4-bit3-bit large, more high-precision groups
Q4_04.532FP16Basic 4-bit, per-group absmax
Q4_15.032FP16 + FP164-bit with scale + min value
Q4_K_S4.5256 / 32FP16 + 6-bit4-bit K-quant small
Q4_K_M4.85256 / 32FP16 + 6-bit4-bit K-quant medium, mixed precision
Q5_05.532FP165-bit per-group
Q5_16.032FP16 + FP165-bit with min
Q5_K_S5.5256 / 32FP16 + 6-bit5-bit K-quant small
Q5_K_M5.75256 / 32FP16 + 6-bit5-bit K-quant medium
Q6_K6.5625256 / 16FP16 + 8-bit6-bit K-quant
Q8_08.532FP168-bit per-group
IQ1_S1.5625256FP161-bit importance-weighted
IQ2_XXS2.0625256FP162-bit ultra-extreme
IQ2_XS2.3125256FP162-bit extreme
IQ2_S2.5256FP162-bit
IQ3_XXS3.0625256FP163-bit ultra-extreme
IQ3_XS3.3256FP163-bit extreme
IQ4_NL4.532FP164-bit non-linear (NF4-like)
IQ4_XS4.25256 / 32FP164-bit extreme with super-blocks

The K-quant variants (e.g., Q4_K_M) use a two-level grouping hierarchy: super-blocks of 256 weights containing sub-blocks of 16 or 32 weights. The super-block stores a shared FP16 scale, while sub-blocks store smaller quantized scales relative to the super-block. This hierarchical approach significantly reduces overhead.

The IQ (Importance Quantization) variants use lattice-based codebooks and importance weighting (derived from the Fisher information or Hessian diagonal) to allocate bits more efficiently to important weights.


Sub-4-Bit Quantization
#

INT3 and INT2
#

At 3 bits (8 levels) and 2 bits (4 levels), naive uniform quantization causes severe accuracy degradation. The key challenge can be visualized:

Weight Distribution vs. Quantization Levels:

Probability
  |
  |     ***
  |    *****
  |   *******
  |  *********
  | ***********
  |*************
  +-----|---|---|---|---> value
       L0  L1  L2  L3    (INT2: only 4 levels!)

Most of the distribution's probability mass falls
between L1 and L2, wasting 2 of the 4 levels on
the rarely-occupied tails.

Successful INT3/INT2 methods rely on several key techniques:

  1. Non-uniform quantization: Place levels according to the weight distribution (as in NF4)
  2. Compensation: Adjust remaining FP16 weights to compensate for quantization error in quantized layers
  3. Learned rounding: Optimize the rounding decisions (up or down) jointly rather than independently
  4. Group quantization with very small groups: Groups of 8-32 to capture local statistics
  5. Mixed-precision residuals: Store a small FP16 or INT8 residual correction term

Binary Neural Networks (BNNs)
#

Binary Neural Networks represent the extreme of quantization: weights (and optionally activations) are constrained to \({-1, +1}\), requiring only 1 bit per value.

Binarization function:

$$w_b = \text{sign}(w) = \begin{cases} +1 & \text{if } w \geq 0 \\ -1 & \text{if } w < 0 \end{cases}$$

The key advantage: matrix multiplications reduce to XNOR and popcount operations:

$$y = \mathbf{w}^T \mathbf{x} \approx \alpha \cdot \text{popcount}(\text{XNOR}(\mathbf{w}_b, \mathbf{x}_b))$$

where \(\alpha\) is a learned or computed scaling factor. The XNOR-popcount operation is extremely fast on modern hardware:

Binary Matrix Multiply (XNOR + Popcount):

w_b = [+1, -1, +1, +1, -1, +1, -1, -1]  -->  [1,0,1,1,0,1,0,0] = 0xB4
x_b = [+1, +1, -1, +1, +1, -1, +1, -1]  -->  [1,1,0,1,1,0,1,0] = 0xDA

XNOR(0xB4, 0xDA) = 0x91 = [1,0,0,1,0,0,0,1]
popcount(0x91) = 3

dot_product = 2 * popcount - n = 2 * 3 - 8 = -2

Verification: (+1)(+1) + (-1)(+1) + (+1)(-1) + (+1)(+1)
            + (-1)(+1) + (+1)(-1) + (-1)(+1) + (-1)(-1)
            = 1 - 1 - 1 + 1 - 1 - 1 - 1 + 1 = -2  (correct)

Computational savings of BNNs:

OperationFP32Binary
Multiply32-bit FPU multiply1-bit XNOR
Accumulate32-bit FP addInteger popcount
Memory per weight32 bits1 bit (32x reduction)
Theoretical speedup1x~58x (on specialized hardware)

However, BNNs suffer from severe accuracy loss. For ImageNet classification, a binary ResNet-18 typically loses 15-20% top-1 accuracy compared to the full-precision version. This limits BNNs to edge applications where extreme efficiency is paramount.

Training BNNs requires the Straight-Through Estimator (STE) because the sign function has zero gradient almost everywhere:

$$\frac{\partial L}{\partial w} \approx \frac{\partial L}{\partial w_b} \cdot \mathbb{1}_{|w| \leq 1}$$

The STE passes the gradient through the sign function as if it were the identity (clipped to [-1, 1]).

BitNet b1.58
#

BitNet b1.58 (Microsoft Research, 2024) represents a breakthrough in ternary quantization. Instead of binary \({-1, +1}\), it uses ternary weights \({-1, 0, +1}\), requiring \(\log_2(3) \approx 1.58\) bits per weight.

Quantization function:

$$\tilde{w} = \text{RoundClip}\left(\frac{w}{\gamma + \epsilon}, -1, 1\right)$$

where \(\gamma = \frac{1}{nm}\sum_{i,j}|w_{ij}|\) is the mean absolute value of the weight matrix, and:

$$\text{RoundClip}(x, a, b) = \max(a, \min(b, \lfloor x \rceil))$$

Activation quantization uses absmax quantization to \(b\)-bit integers (typically 8-bit):

$$\tilde{x} = \text{Quant}(x) = \text{clamp}\left(\left\lfloor \frac{x}{Q_b} \times (2^{b-1} - 1) \right\rceil, -(2^{b-1}-1), 2^{b-1}-1\right)$$

where \(Q_b = |x|_\infty\).

The linear layer in BitNet b1.58:

$$y = \tilde{W} \tilde{x} = \sum_{j} \tilde{w}_j \tilde{x}_j$$

Since \(\tilde{w}_j \in {-1, 0, +1}\), each multiply becomes:

  • If \(\tilde{w} = +1\): add \(\tilde{x}\)
  • If \(\tilde{w} = -1\): subtract \(\tilde{x}\)
  • If \(\tilde{w} = 0\): skip (no operation)

This eliminates all floating-point multiplications entirely. The matrix multiply reduces to integer addition only.

Energy and performance comparison (from the BitNet b1.58 paper):

Energy per Operation (relative to FP16 multiply-add):

  FP16 Multiply:  |========================| 100%
  FP16 Add:       |=====|                    20%
  INT8 Multiply:  |=======|                  31%
  INT8 Add:       |=|                         4%
  1.58-bit (add): |=|                         4%

Memory Footprint for a 70B model:

  FP16:   |========================================| 140 GB
  INT8:   |====================|                     70 GB
  INT4:   |==========|                               35 GB
  1.58b:  |====|                                     17.5 GB (fits single GPU!)

BitNet b1.58 key results:

At the 3B parameter scale, BitNet b1.58 matches full-precision LLaMA LLM performance on perplexity benchmarks while using:

  • 3.55x less memory than FP16
  • 2.71x faster on a single device (latency)
  • 8.9x higher throughput at batch size 1

The zero values in the ternary representation provide implicit sparsity (roughly 1/3 of weights are zero), further reducing computation.


Advanced Quantization Algorithms
#

QuIP and QuIP#
#

QuIP (Quantization with Incoherence Processing) and its successor QuIP# achieve near-lossless 2-bit quantization by exploiting the concept of incoherence in weight matrices.

The Incoherence Principle:

Quantization error is minimized when the weight matrix and the Hessian (input correlation matrix) are “incoherent” – meaning they have no concentrated structure. Formally, if a matrix \(W\) has its entries spread uniformly rather than concentrated in a few large values, rounding errors tend to cancel out statistically.

QuIP achieves incoherence by applying random orthogonal rotations:

$$W' = U W V^T$$

where \(U\) and \(V\) are random orthogonal matrices. The quantized version is:

$$\hat{W} = U^T \text{Quantize}(U W V^T) V$$

The rotation spreads outlier values across all entries, making the rotated matrix more amenable to uniform quantization.

QuIP# improvements:

  1. Kronecker product rotations: Instead of storing full random orthogonal matrices, QuIP# uses the Kronecker product of smaller Hadamard matrices: \(U = H_1 \otimes H_2\). This reduces storage from \(O(n^2)\) to \(O(n)\) and enables fast application via the Fast Walsh-Hadamard Transform in \(O(n \log n)\).

  2. E8 Lattice Quantization: Instead of rounding each scalar independently, QuIP# quantizes vectors of 8 values jointly using the \(E_8\) lattice.

The E8 Lattice:

The \(E_8\) lattice is a mathematical structure in 8-dimensional space with remarkable properties. It is the densest sphere packing in 8 dimensions and the optimal vector quantizer for 8D uniform distributions.

The \(E_8\) lattice points can be defined as:

$$E_8 = \left\{ x \in \mathbb{Z}^8 \cup \left(\mathbb{Z} + \frac{1}{2}\right)^8 : \sum_{i=1}^{8} x_i \equiv 0 \pmod{2} \right\}$$

That is, coordinates are either all integers or all half-integers, and their sum is even.

E8 Lattice Quantization (simplified 2D analogy):

  Scalar Quantization:          Lattice Quantization:
  Each dimension independent    Joint optimization in 8D

       |  .  |  .  |              .     .     .
  -----+-----+-----+---         .   .   .   .   .
       |  .  |  .  |              .     .     .
  -----+-----+-----+---         .   .   .   .   .
       |  .  |  .  |              .     .     .

  Grid points: N^8             Lattice points: ~ N^8 / 4
  (for N levels per dim)       (denser packing, fewer wasted points)

The lattice quantizer finds the nearest \(E_8\) lattice point to each 8-dimensional weight vector:

$$\hat{w}_{1:8} = \arg\min_{v \in E_8 \cap \mathcal{C}} \|w'_{1:8} - v\|^2$$

where \(\mathcal{C}\) is the codebook subset used for 2-bit encoding. Each 8D lattice point is encoded with \(8 \times 2 = 16\) bits, yielding exactly 2 bits per weight.

QuIP# results: At 2 bits per weight on LLaMA-2 70B, QuIP# achieves a perplexity of approximately 4.15 on WikiText-2, compared to 3.32 for the FP16 baseline – a remarkably small degradation for 8x compression.

AQLM: Additive Quantization for Language Models
#

AQLM applies multi-codebook quantization (a form of additive vector quantization) to LLM weight compression.

Core idea: Instead of quantizing each weight independently, AQLM groups weights into vectors and represents each vector as a sum of entries from multiple codebooks:

$$\hat{w}_{1:d} = \sum_{m=1}^{M} C_m[i_m]$$

where \(C_m \in \mathbb{R}^{K \times d}\) is the \(m\)-th codebook with \(K\) entries, each of dimension \(d\), and \(i_m \in {0, 1, \ldots, K-1}\) is the index into codebook \(m\).

AQLM Multi-Codebook Quantization:

Weight vector w = [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]

Codebook 1:  C1[3] = [0.1,  -0.2,  0.3,  0.5,  -0.6,  0.1,  -0.3,  0.4]
Codebook 2:  C2[7] = [0.02, -0.14, 0.26, 0.28, -0.31, 0.13, -0.15, 0.27]
                     -------------------------------------------------------
Approximation:        [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]

Stored: indices (3, 7) + codebooks C1, C2 (shared across all vectors)

Bit rate calculation:

For \(M\) codebooks, each with \(K = 2^B\) entries, quantizing vectors of dimension \(d\):

$$\text{bits per weight} = \frac{M \times B}{d} + \text{codebook overhead}$$

For example, with \(M = 2\), \(B = 8\) (256 entries per codebook), \(d = 8\):

$$\text{bits per weight} = \frac{2 \times 8}{8} = 2 \text{ bits}$$

The codebook overhead is amortized across the entire weight matrix and is typically negligible.

AQLM optimization uses beam search combined with fine-tuning:

  1. Initialize codebooks using k-means on weight vectors
  2. Beam search over index combinations to minimize \(|W - \hat{W}|_H^2\) (Hessian-weighted error)
  3. Fine-tune codebook entries end-to-end with a small calibration dataset

AQLM achieves state-of-the-art results at 2-bit precision, outperforming QuIP# on several benchmarks when both use the same bit budget.

HQQ: Half-Quadratic Quantization
#

HQQ takes a fundamentally different approach to quantization by formulating it as a half-quadratic optimization problem, enabling fast, data-free quantization.

Problem formulation:

Most PTQ methods minimize the layer-wise output error:

$$\min_{\hat{W}} \|WX - \hat{W}X\|^2$$

This requires calibration data \(X\). HQQ instead directly minimizes the weight reconstruction error with a sparsity-promoting penalty:

$$\min_{Q} \|W - Q\|_p^p$$

where \(|\cdot|_p\) is the \(\ell_p\) norm with \(0 < p \leq 1\) (promoting sparse residuals), and \(Q\) is constrained to the quantization grid.

Half-quadratic splitting introduces an auxiliary variable \(Z\):

$$\min_{Q, Z} \|W - Z\|_p^p + \frac{\mu}{2}\|Z - Q\|_2^2$$

This decouples into two tractable sub-problems that are solved alternately:

  1. Z-update (proximal operator of \(\ell_p\) norm): has a closed-form solution for \(p = 1\) (soft-thresholding) and \(p = 0\) (hard-thresholding)
$$Z^{(k+1)} = \text{prox}_{p/\mu}\left(Q^{(k)} + \frac{1}{\mu}(W - Q^{(k)})\right)$$
  1. Q-update (nearest quantization level): simple rounding
$$Q^{(k+1)} = \text{Quantize}(Z^{(k+1)})$$
HQQ Iteration:

Step 0:  W = [0.12, -0.87, 0.34, 0.93, -0.21, 0.78, -0.56, 0.45]

Step 1 (Z-update): Apply proximal operator (soft-thresholding)
         Z = [0.10, -0.85, 0.32, 0.91, -0.19, 0.76, -0.54, 0.43]

Step 2 (Q-update): Round to nearest INT4 grid point
         Q = [0.13, -0.87, 0.33, 0.93, -0.20, 0.73, -0.53, 0.40]

Repeat steps 1-2 until convergence (typically 10-20 iterations)

HQQ advantages:

  • No calibration data needed: Works directly on weights, no forward passes required
  • Extremely fast: Quantizing a 70B model takes minutes, not hours
  • Strong quality: Competitive with GPTQ and AWQ at INT4, and superior at INT3/INT2
  • Simple implementation: No Hessian computation, no matrix decomposition

Mixed-Precision Quantization
#

Mixed-precision quantization assigns different bit-widths to different layers (or even different channels/heads) based on their sensitivity to quantization. The insight is simple: not all layers are equally sensitive. Some layers can tolerate 2-bit quantization with minimal accuracy loss, while others require 8 bits.

Layer Sensitivity Analysis
#

The most straightforward approach measures each layer’s sensitivity independently:

Perturbation-based sensitivity:

For each layer \(l\), quantize it to \(b\) bits while keeping all other layers at full precision, and measure the change in task loss:

$$\Delta L_l(b) = L(\theta_1, \ldots, \theta_l^{(b)}, \ldots, \theta_N) - L(\theta_1, \ldots, \theta_N)$$
Sensitivity Profile of a Typical LLM:

Sensitivity
  |
  |##                                                    ##
  |##                                                    ##
  |###                                                  ###
  |###                                                  ###
  |####                                                ####
  |####              ##                ##              ####
  |#####            ####              ####            #####
  |######          ######            ######          ######
  |########      ########          ########        ########
  |############################################################
  +-------------------------------------------------------------> Layer
   0  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30
                      First & Last layers: HIGH sensitivity
                      Middle layers: LOW sensitivity

This U-shaped sensitivity curve is remarkably consistent across architectures. The first few layers (embedding projection, early attention) and the last few layers (final attention, output projection) are most sensitive, while middle layers are more robust to quantization.

Hessian-based sensitivity (second-order):

The sensitivity can be estimated more efficiently using the Hessian:

$$\Delta L_l \approx \frac{1}{2} \delta_l^T H_l \delta_l = \frac{1}{2} \text{tr}(\delta_l \delta_l^T H_l)$$

where \(\delta_l = \theta_l - \theta_l^{(b)}\) is the quantization perturbation and \(H_l\) is the Hessian of the loss with respect to layer \(l\) parameters. The trace of the Hessian (or its top eigenvalue) serves as a sensitivity metric.

HAWQ: Hessian AWare Quantization
#

HAWQ (and its successors HAWQ-V2, HAWQ-V3) use Hessian information to automatically determine per-layer bit-widths.

HAWQ-V1 uses the top eigenvalue of the per-layer Hessian:

$$\Omega_l = \lambda_{\max}(H_l)$$

Layers with larger \(\Omega_l\) receive more bits. The bit-width assignment is formulated as a constrained optimization:

$$\min_{\{b_l\}} \sum_{l=1}^{L} \Omega_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \sum_{l=1}^{L} n_l \cdot b_l \leq B_{\text{total}}$$

where \(n_l\) is the number of parameters in layer \(l\), \(b_l \in {2, 4, 8}\) is the bit-width, and \(B_{\text{total}}\) is the total bit budget.

HAWQ-V2 improves by using the average Hessian trace instead of the top eigenvalue:

$$\bar{\Omega}_l = \frac{1}{n_l} \text{tr}(H_l)$$

This is more robust and cheaper to compute (via Hutchinson’s stochastic trace estimator):

$$\text{tr}(H_l) \approx \frac{1}{T} \sum_{t=1}^{T} z_t^T H_l z_t$$

where \(z_t\) are random Rademacher vectors (\(\pm 1\) with equal probability).

HAWQ-V3 extends to integer-only quantization with mixed INT4/INT8 and hardware-aware latency constraints:

$$\min_{\{b_l\}} \sum_{l=1}^{L} \bar{\Omega}_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \text{LAT}(\{b_l\}) \leq T_{\text{target}}$$

where \(\text{LAT}(\cdot)\) is the measured latency on target hardware.

HAQ: Hardware-Aware Quantization with Reinforcement Learning
#

HAQ frames mixed-precision quantization as a sequential decision problem solved by reinforcement learning.

State space: For each layer \(l\), the state encodes:

  • Layer index, type (Conv, FC, Attention, etc.)
  • Input/output channels, kernel size
  • Number of parameters
  • Computational cost (FLOPs)
  • Current model size and latency

Action space: Choose a bit-width \(b_l \in {1, 2, 3, 4, 5, 6, 7, 8}\) for layer \(l\).

Reward: After all layers are assigned, the reward is:

$$R = -\Delta \text{Accuracy} \quad \text{s.t.} \quad \text{Model size} \leq S_{\text{target}} \text{ or } \text{Latency} \leq T_{\text{target}}$$

The constraint is enforced by giving a large negative reward if violated.

HAQ Reinforcement Learning Loop:

  RL Agent (DDPG)
       |
       |  action: bit-width for layer l
       v
  [Layer 0] --> [Layer 1] --> ... --> [Layer L-1]
       |              |                     |
       | state        | state               | state
       v              v                     v
  (layer info,   (layer info,          (layer info,
   remaining      remaining             remaining
   budget)        budget)               budget)
                                            |
                                            v
                                      Evaluate accuracy
                                            |
                                            v
                                        Reward R

HAQ uses DDPG (Deep Deterministic Policy Gradient), a continuous-action RL algorithm, where the continuous action is mapped to discrete bit-widths via rounding. The agent is trained on a proxy task (e.g., a few hundred calibration samples) and generalizes well.

Key HAQ findings:

  1. On MobileNet-V2, HAQ achieves 2x compression with only 0.3% accuracy drop
  2. Depthwise separable convolutions are assigned higher bit-widths (more sensitive)
  3. The RL agent discovers hardware-specific patterns: on accelerators with efficient INT8 units, it prefers INT8 over INT4 even when INT4 would fit the size budget

Transformer and LLM-Specific Challenges
#

Activation Outliers
#

Transformers exhibit persistent activation outliers – individual features with magnitudes 10-100x larger than the rest. These outliers appear in specific hidden dimensions consistently across all tokens and layers (discovered by Dettmers et al. in the “LLM.int8()” paper).

Activation magnitude across hidden dimensions (typical LLM):

Magnitude
 100 |                    *
     |                    *
  50 |                    *
     |                    *
  10 |  * * **  *  *  *   *  * *  ** *  *  *  ** *
   5 | ** ****  ** ** ** * * ** ** ** ** ** *  ** **
   1 |*****************************************************
     +-----------------------------------------------------> Hidden dim
                          ^
                    Outlier channel(s)

These outliers cause catastrophic quantization error if quantized uniformly. Solutions include:

  1. LLM.int8(): Mixed INT8/FP16 decomposition – outlier dimensions stay in FP16
  2. SmoothQuant: Migrate quantization difficulty from activations to weights via a mathematically equivalent scaling transform
  3. Rotation-based methods: Apply Hadamard rotation to spread outliers (as in QuIP#)

KV-Cache Quantization
#

The Key-Value (KV) cache is a major memory bottleneck during autoregressive LLM inference. For each token generated, the KV cache grows by:

$$\Delta_{\text{KV}} = 2 \times L \times H \times d_h \times b$$

where \(L\) is the number of layers, \(H\) is the number of KV heads (which may differ from query heads in GQA), \(d_h\) is the head dimension, and \(b\) is the bytes per element.

Total KV-cache memory for a sequence of length \(n\):

$$M_{\text{KV}} = 2 \times L \times H \times d_h \times n \times b$$

Concrete example – LLaMA-2 70B with 32K context:

ParameterValue
Layers (\(L\))80
KV heads (\(H\), GQA)8
Head dimension (\(d_h\))128
Sequence length (\(n\))32768
$$M_{\text{KV}}^{\text{FP16}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 2 = 8.59 \text{ GB}$$$$M_{\text{KV}}^{\text{INT4}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 0.5 = 2.15 \text{ GB}$$$$M_{\text{KV}}^{\text{INT2}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 0.25 = 1.07 \text{ GB}$$

KV-cache quantization approaches:

MethodBitsKey InsightQuality
KIVIK:2, V:2Per-channel K, per-token V quantization~0.1 PPL increase
KVQuant2-4Sensitivity-aware, non-uniform< 0.1 PPL increase
Gear2-4Low-rank + sparse residualMinimal loss
CacheQuant4Outlier-aware dynamic quantization< 0.05 PPL increase

A key asymmetry: Keys and Values have different quantization sensitivities. Keys participate in the softmax attention computation where small errors can shift probability mass significantly, while Values are linearly combined. However, Keys tend to have more structured distributions (amenable to per-channel quantization), while Values have more per-token variation.

Attention Score Quantization
#

The attention mechanism involves:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Quantizing the intermediate attention scores (\(QK^T\)) and the post-softmax probabilities is challenging because:

  1. Pre-softmax scores can have large dynamic range across heads and positions
  2. Post-softmax probabilities are in \([0, 1]\) with a heavy-tailed distribution (most values near 0, a few near 1)
  3. Causal masking introduces discontinuities (negative infinity values)

Effective strategies:

  • Quantize \(Q\) and \(K\) to INT8 with per-head scaling, compute \(QK^T\) in INT32, then dequantize before softmax
  • Keep softmax computation in FP16/FP32 (numerically sensitive)
  • Quantize the attention output (\(\text{softmax} \times V\)) to INT8
Quantized Attention Computation:

  Q (INT8) x K^T (INT8) -> S (INT32) -> dequant -> S (FP16)
                                                      |
                                                   softmax (FP16)
                                                      |
                                                   P (FP16)
                                                      |
                                        P (FP16) x V (INT8) -> O (INT32)
                                                                  |
                                                               dequant -> O (FP16)

Vision Transformer Quantization
#

Vision Transformers (ViTs) present distinct quantization challenges compared to language models:

ViT-Specific Challenges
#

  1. Post-LayerNorm activations: ViTs often use post-LayerNorm, creating different activation distributions than LLMs (which typically use pre-LayerNorm or RMSNorm)

  2. Softmax attention bottleneck: ViTs process all spatial tokens simultaneously (no causal mask), leading to attention maps with very high entropy. Small quantization errors in attention probabilities can shift focus to wrong spatial regions.

  3. Patch embedding sensitivity: The initial patch embedding layer projects raw pixel values to token representations. Quantization errors here propagate through the entire network.

  4. Class token dependence: Classification ViTs rely on a single [CLS] token, making the network especially sensitive to quantization error that affects this token’s representation.

Quantization strategies for ViTs:

StrategyDescriptionTypical Accuracy Impact
PTQ4ViTTwin uniform quantization for softmax, Hessian-guided-0.5% at W4A4
FQ-ViTPower-of-two factor for LayerNorm, log2 quantizer for softmax-0.3% at W4A4
RepQ-ViTReparameterize LayerNorm and softmax to quantization-friendly forms-0.5% at W4A4
I-ViTInteger-only ViT with Shiftmax and ShiftGELU-0.2% at W8A8
NoisyQuantAdd fixed noise before quantization to break outlier structure-0.4% at W8A8

Log2 quantizer for post-softmax values:

Since attention probabilities follow a roughly log-normal distribution after softmax, a log-scale quantizer is more appropriate:

$$q = \text{clamp}\left(\lfloor -\log_2(p) \rceil, 0, 2^b - 1\right)$$$$\hat{p} = 2^{-q}$$

This places more quantization levels near zero (where most probabilities lie) and fewer near one.


Diffusion Model Quantization
#

Diffusion models (DDPM, Stable Diffusion, DALL-E, etc.) introduce unique quantization challenges due to their iterative denoising process.

Time-Step Dependent Distributions
#

The core challenge: diffusion models are evaluated at many different noise levels (time steps), and the activation distributions change dramatically across time steps.

Activation distribution at different time steps:

t = 0 (clean):     t = 500 (medium):     t = 1000 (noisy):
    ***                 ****                    *****
   *****               ******                 ********
  *******             ********               **********
 *********           **********             ************
  narrow,             moderate,               wide,
  sharp peak          broader                very spread out

A single set of quantization parameters (scale, zero-point) cannot optimally handle all time steps. Solutions include:

  1. Time-step aware quantization (TDQ): Maintain separate quantization parameters for different time-step ranges
  2. Temporal information-aware quantization: Use the time-step embedding to dynamically adjust quantization parameters
  3. PTQ4DM: Calibrate quantization parameters on a representative set of time steps

Diffusion-Specific Methods
#

MethodApproachResult
Q-DiffusionTime-step aware PTQ, shortcut-splittingW4A8 with < 0.5 FID increase
PTQDTime-step grouping, correlation-awareW4A8 competitive with FP32
TDQDedicated scales per time-step groupW8A8 near-lossless
EfficientDMQAT with quantization-aware low-rank adaptationW4A4 with minor FID increase

Error accumulation is a critical issue: in diffusion models, the output of step \(t\) becomes the input to step \(t-1\). Quantization errors accumulate across the 20-50+ denoising steps:

$$\epsilon_{\text{total}} \approx \sum_{t=T}^{1} \epsilon_t \cdot \prod_{s=1}^{t-1} (1 + \alpha_s)$$

where \(\epsilon_t\) is the per-step quantization error and \(\alpha_s\) captures error amplification. This makes diffusion models more sensitive to quantization than single-pass models.

Practical recommendation for Stable Diffusion:

  • UNet: W8A8 is safe; W4A8 is achievable with careful calibration; W4A4 requires QAT
  • VAE decoder: Keep at FP16 (highly sensitive, runs only once)
  • Text encoder (CLIP): W8A8 is typically safe
  • Time-step embedding MLP: Keep at higher precision (FP16 or INT8)

Inference Optimization and the Roofline Model
#

The Roofline Model for Quantized Inference
#

Understanding when quantization actually speeds up inference requires the roofline model, which characterizes computation as either compute-bound or memory-bound.

Arithmetic intensity (operational intensity):

$$I = \frac{\text{FLOPs}}{\text{Bytes transferred}}$$

The roofline model defines achievable performance as:

$$\text{Performance} = \min\left(\text{Peak FLOPS}, \quad I \times \text{Memory Bandwidth}\right)$$
Roofline Model with Quantization:

Performance
(TOPS)        Peak INT4
  |          /   Peak INT8
  |         /  /   Peak FP16
  |        / /  /
  |       //  /
  |      /  /
  |     / /     <-- Compute-bound region
  |    //        (quantization helps with peak TOPS)
  |   //
  |  /  <-- Memory-bound region
  | /    (quantization helps with bandwidth)
  |/
  +-----------------------------------------> Arithmetic Intensity
                                              (FLOPs/Byte)
          ^           ^
          |           |
     LLM decode   LLM prefill / CNN batch
     (batch=1)     inference

LLM inference phases:

  1. Prefill (prompt processing): High arithmetic intensity (large matrix multiplications with many tokens). Often compute-bound. Quantization helps by increasing peak throughput (INT4 Tensor Cores are 2x faster than INT8).

  2. Decode (token generation): Low arithmetic intensity (matrix-vector multiply, batch size = 1). Almost always memory-bound. Quantization helps primarily by reducing memory bandwidth requirements.

For the decode phase, the speedup from quantization is approximately:

$$\text{Speedup}_{\text{decode}} \approx \frac{b_{\text{original}}}{b_{\text{quantized}}} \times \frac{\text{BW}_{\text{quantized}}}{\text{BW}_{\text{original}}}$$

For INT4 vs FP16 on the same hardware (bandwidth ratio = 1):

$$\text{Speedup}_{\text{decode}} \approx \frac{16}{4} = 4\times$$

In practice, the speedup is lower (2-3x) due to dequantization overhead, group scale fetching, and non-weight memory accesses (KV cache, activations).

Dequantization Overhead
#

Quantized weights must be dequantized before computation (or during, in fused kernels). The dequantization cost depends on the quantization scheme:

SchemeDequant Operations per WeightRelative Overhead
Per-tensor symmetric1 multiplyVery low
Per-channel symmetric1 multiplyLow
Per-group affine (g=128)1 multiply + 1 addLow
NF4 (lookup table)1 table lookup + 1 multiplyMedium
AQLM (codebook)1-2 table lookups + 1 addMedium-High
QuIP# (E8 lattice + rotation)Lattice decode + Hadamard transformHigh

Efficient GPU kernels (e.g., from Marlin, ExLlamaV2, or TensorRT-LLM) fuse dequantization with the matrix multiply, hiding most of the overhead behind the memory latency of loading weights.

End-to-End Throughput Comparison
#

The following table compares practical inference throughput for a 7B-parameter LLM on a single NVIDIA RTX 4090 (24 GB VRAM):

QuantizationBits/WeightModel SizeTokens/sec (decode)Perplexity (WikiText-2)
FP161614.0 GB~355.68 (baseline)
GPTQ INT887.0 GB~655.69
GPTQ INT4 (g128)4.254.0 GB~1105.85
AWQ INT4 (g128)4.254.0 GB~1155.79
GGUF Q4_K_M4.854.6 GB~100 (CPU+GPU)5.82
GGUF Q3_K_M3.8753.5 GB~120 (CPU+GPU)6.15
GGUF Q2_K2.56252.5 GB~135 (CPU+GPU)7.89
QuIP# 2-bit22.0 GB~806.45
AQLM 2-bit22.0 GB~756.32
BitNet 1.58b1.58~1.6 GB~150 (specialized)~5.70 (trained)

Note: BitNet requires training from scratch with ternary weights; all others are post-training quantization applied to a pre-trained FP16 model.


State-of-the-Art Comparison (2024-2025)
#

The following table summarizes the major quantization methods, their characteristics, and results as of early 2025:

MethodYearTypeBitsCalibration DataKey InnovationLLaMA-2 7B PPLLLaMA-2 70B PPL
GPTQ2022PTQ3-8Yes (128 samples)OBQ with lazy batching6.29 (4-bit)3.85 (4-bit)
AWQ2023PTQ3-8Yes (small)Activation-aware scaling5.89 (4-bit)3.56 (4-bit)
SqueezeLLM2023PTQ3-4YesDense-and-sparse; non-uniform5.88 (4-bit)
QuIP2023PTQ2-4YesIncoherence processing6.90 (2-bit)4.55 (2-bit)
QuIP#2023PTQ2-4YesE8 lattice, Hadamard rotation6.45 (2-bit)4.15 (2-bit)
AQLM2024PTQ2-4YesMulti-codebook additive VQ6.32 (2-bit)4.02 (2-bit)
HQQ2023PTQ2-8NoHalf-quadratic optimization6.58 (4-bit)3.68 (4-bit)
GGUF IQ2_XS2024PTQ2.3YesImportance-weighted lattice7.21 (2.3-bit)4.42 (2.3-bit)
OmniQuant2023PTQ/QAT2-8YesLearnable weight clipping + equiv. transform5.86 (4-bit)3.54 (4-bit)
QLoRA NF42023QAT4Training dataNF4 + double quantization5.70* (fine-tuned)
SpQR2023PTQ3-4YesSparse outlier + dense quantized5.84 (4-bit)3.53 (4-bit)
SmoothQuant2022PTQW8A8YesSmoothing transform for activations– (W8A8)– (W8A8)
KIVI2024PTQKV:2YesAsymmetric K/V quantization~0.1 PPL increase~0.1 PPL increase
BitNet b1.582024QAT1.58Training dataTernary weights from scratch~5.7 (trained)
OneBit2024QAT1Training data1-bit with value-aware knowledge distillation~6.2 (trained)
EfficientQAT2024QAT2-4Training dataBlock-wise QAT + end-to-end5.72 (4-bit)3.42 (4-bit)

*QLoRA perplexity varies by fine-tuning task and dataset.

Key takeaways from the 2024-2025 landscape:

  1. 4-bit is the sweet spot for post-training quantization: methods like AWQ, GPTQ, and HQQ achieve near-lossless compression at 4x size reduction.

  2. 2-bit PTQ is viable for large models: QuIP#, AQLM, and GGUF IQ variants push the frontier below 3 bits, with 70B+ models maintaining reasonable quality. The larger the model, the more gracefully it quantizes.

  3. 1-2 bit requires training-aware methods: BitNet b1.58 demonstrates that training from scratch with extreme quantization can match full-precision performance, but this requires the full training compute budget.

  4. KV-cache quantization is critical: For long-context applications, KV-cache memory can exceed model weight memory. Specialized methods like KIVI enable 2-bit KV caches with minimal quality loss.

  5. Hardware support is evolving: NVIDIA Blackwell (B100/B200) adds native FP4 Tensor Cores. AMD MI300X supports FP8. Custom silicon (Groq, Cerebras) increasingly targets INT4/INT8. Software stacks (TensorRT-LLM, vLLM, llama.cpp) are key enablers.


Practical Decision Guide
#

Choosing a quantization strategy depends on your constraints. Here is a decision framework:

                         START
                           |
                    Do you have training
                    compute budget?
                      /          \
                    Yes           No
                    /               \
             Need <2 bits?     Need <3 bits?
               /    \            /        \
             Yes    No         Yes        No
             /        \        /            \
        BitNet     QLoRA    AQLM/         AWQ/GPTQ
        b1.58      NF4     QuIP#          INT4
        (train     (fine-   (2-bit        (4-bit PTQ,
        from       tune     PTQ)          best balance)
        scratch)   adapter)    |               |
                              |               |
                         Need fast        Hardware-
                         quantization?    specific?
                           /    \          /     \
                         Yes    No       Yes     No
                         /        \      /         \
                       HQQ     AQLM   HAQ       AWQ + Marlin
                      (no cal) (better (RL-based  kernel
                               quality) search)

Conclusion
#

Extreme and mixed-precision quantization has progressed from an academic curiosity to a practical necessity. The key developments of 2024-2025 demonstrate that:

  • FP8 has become the standard for training, with hardware support now widespread.
  • INT4 with group quantization (AWQ, GPTQ, GGUF K-quants) is the production standard for LLM inference.
  • 2-bit quantization (QuIP#, AQLM) is practical for the largest models (70B+), enabling single-GPU deployment of models that previously required multi-node clusters.
  • 1.58-bit (BitNet b1.58) points toward a future where extreme quantization is built into the training process, potentially eliminating floating-point multiply hardware entirely.
  • Mixed-precision strategies (HAWQ, HAQ) provide the theoretical and practical framework for optimally allocating bits across heterogeneous model components.

The field continues to advance rapidly. As new architectures (Mixture of Experts, State Space Models, hybrid designs) and new hardware (FP4 Tensor Cores, custom accelerators) emerge, the quantization landscape will continue to evolve. The fundamental principle remains: compress aggressively where the model is robust, preserve precision where it is sensitive, and always measure on your target task and hardware.


References
#

  1. Micikevicius et al., “FP8 Formats for Deep Learning,” arXiv:2209.05433 (2022)
  2. Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” NeurIPS 2022
  3. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized Language Models,” NeurIPS 2023
  4. Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” ICLR 2023
  5. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” MLSys 2024
  6. Chee et al., “QuIP: 2-Bit Quantization of Large Language Models With Guarantees,” NeurIPS 2023
  7. Chee et al., “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,” ICML 2024
  8. Egiazarian et al., “AQLM: Extreme Compression of Large Language Models via Additive Quantization,” ICML 2024
  9. Badri & Shaji, “HQQ: Half-Quadratic Quantization,” arXiv:2309.15531 (2023)
  10. Dong et al., “HAWQ: Hessian AWare Quantization of Neural Networks,” ICCV 2019
  11. Wang et al., “HAQ: Hardware-Aware Automated Quantization with Mixed Precision,” CVPR 2019
  12. Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits,” arXiv:2402.17764 (2024)
  13. Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” arXiv:2402.02750 (2024)
  14. Li et al., “Q-Diffusion: Quantizing Diffusion Models,” ICCV 2023
  15. Yuan et al., “PTQ4ViT: Post-Training Quantization for Vision Transformers,” ECCV 2022
  16. Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” ICML 2023