Table of Contents
Introduction#
Model quantization has evolved far beyond the classic INT8 regime. As large language models (LLMs) surpass hundreds of billions of parameters and vision/diffusion models demand ever-increasing computational budgets, researchers have pushed quantization to its extreme limits. This post provides a deep, technical exploration of extreme and mixed-precision quantization – from 8-bit floating point down to single-bit binary representations – along with the sophisticated algorithms that make such aggressive compression possible without catastrophic quality loss.
We will cover the full landscape: the bit-level mechanics of FP8 and INT4 formats, sub-4-bit methods including binary neural networks and BitNet, state-of-the-art algorithms such as QuIP#, AQLM, and HQQ, mixed-precision strategies driven by sensitivity analysis and reinforcement learning, domain-specific challenges for Transformers, vision models, and diffusion models, and finally the hardware-aware inference optimization perspective.
FP8: 8-Bit Floating Point#
Why Floating Point at 8 Bits?#
Traditional INT8 quantization maps floating-point values to 256 uniformly spaced integers. While effective for inference, this uniform spacing poorly represents the heavy-tailed distributions common in neural network weights and activations. FP8 retains the logarithmic spacing of floating-point arithmetic, providing higher precision near zero (where most values cluster) and coarser precision for outliers.
E4M3 and E5M2 Bit Layouts#
The IEEE working group and hardware vendors (NVIDIA, AMD, Intel) have standardized two FP8 formats, both using 8 bits total:
E4M3 Format (1 sign + 4 exponent + 3 mantissa):
+---+----+---+---+---+---+---+---+
| S | E3 | E2| E1| E0| M2| M1| M0|
+---+----+---+---+---+---+---+---+
1 4 bits exponent 3 bits mantissa
E5M2 Format (1 sign + 5 exponent + 2 mantissa):
+---+----+---+---+---+---+---+---+
| S | E4 | E3| E2| E1| E0| M1| M0|
+---+----+---+---+---+---+---+---+
1 5 bits exponent 2 bits mantissaThe value of a normal FP8 number follows the standard floating-point formula:
$$\text{value} = (-1)^S \times 2^{(E - \text{bias})} \times (1 + \frac{M}{2^{m}})$$where \(E\) is the stored exponent, \(\text{bias}\) is the exponent bias, \(M\) is the stored mantissa, and \(m\) is the number of mantissa bits.
| Property | E4M3 | E5M2 |
|---|---|---|
| Exponent bits | 4 | 5 |
| Mantissa bits | 3 | 2 |
| Exponent bias | 7 | 15 |
| Max normal value | 448 | 57344 |
| Min positive normal | \(2^{-6}\) = 0.015625 | \(2^{-14}\) = 6.1e-5 |
| Dynamic range (decades) | ~4.9 | ~9.5 |
| Precision (ULP at 1.0) | 0.125 | 0.25 |
| Special values | NaN only (no Inf) | NaN and Inf |
Numerical Examples#
E4M3 encoding of 3.5:
- \(3.5 = 1.75 \times 2^1\)
- Sign: \(S = 0\) (positive)
- Exponent: \(E = 1 + 7 = 8 = 1000_2\)
- Mantissa: \(1.75 = 1 + 0.5 + 0.25 = 1 + \frac{M}{8}\), so \(M = 6 = 110_2\)
- Final bit pattern:
0 1000 110= 0x46
E5M2 encoding of 0.1875:
- \(0.1875 = 1.5 \times 2^{-3}\)
- Sign: \(S = 0\)
- Exponent: \(E = -3 + 15 = 12 = 01100_2\)
- Mantissa: \(1.5 = 1 + 0.5 = 1 + \frac{M}{4}\), so \(M = 2 = 10_2\)
- Final bit pattern:
0 01100 10= 0x32
Quantization error comparison at value 1.3:
- E4M3: rounds to 1.25 (error = 0.05, relative = 3.8%)
- E5M2: rounds to 1.25 (error = 0.05, relative = 3.8%) – same here, but at value 5.3:
- E4M3: rounds to 5.25 (error = 0.05, relative = 0.9%)
- E5M2: rounds to 5.0 (error = 0.3, relative = 5.7%) – E4M3 wins with more mantissa bits
FP8 Training#
FP8 training uses both formats in a complementary fashion, as pioneered by NVIDIA’s Transformer Engine:
FP8 Mixed-Format Training Pipeline:
FP8 E4M3 FP8 E4M3
Weights -----> [Forward Pass] -----> Activations
(E4M3) | |
| |
v v
FP8 E5M2 FP8 E5M2
[Backward Pass] <----- [Loss Gradient]
(grad weights) (grad activations)
|
v
FP32 Master Weights (optimizer update)The key insight: E4M3 for forward pass (higher precision needed for accurate outputs) and E5M2 for backward pass (wider dynamic range needed for gradients, which can span many orders of magnitude).
Per-tensor scaling is critical for FP8 training. Each tensor maintains a scaling factor \(s\) updated via a delayed scaling strategy:
$$s_{t+1} = \frac{\text{maxval}(\text{FP8})}{\max(|X_t|)} \times \alpha$$where \(\alpha\) is a safety margin (typically 0.9) to prevent overflow, and the scaling factor is applied before casting to FP8:
$$X_{\text{FP8}} = \text{cast\_to\_fp8}(X \times s)$$NVIDIA’s H100 GPU achieves up to 2x throughput improvement with FP8 Tensor Cores compared to FP16, making FP8 training practical for models with hundreds of billions of parameters.
INT4: 4-Bit Integer Quantization#
Uniform INT4 Quantization#
At 4 bits, we have only 16 distinct values. For symmetric quantization:
$$q = \text{clamp}\left(\left\lfloor \frac{x}{s} \right\rceil, -8, 7\right), \quad s = \frac{\max(|x|)}{7}$$For asymmetric quantization:
$$q = \text{clamp}\left(\left\lfloor \frac{x - z}{s} \right\rceil, 0, 15\right), \quad s = \frac{\max(x) - \min(x)}{15}, \quad z = \min(x)$$With only 16 levels, the quantization error is significant for per-tensor quantization. This motivates group quantization.
Group Quantization#
Group quantization divides a weight tensor into small groups of \(g\) consecutive elements, each with its own scale and zero-point:
Weight tensor (1x16):
[0.1, 0.5, -0.3, 0.8, | -0.1, 0.2, 0.9, -0.7, | 0.3, -0.4, 0.6, 0.1, | -0.2, 0.7, -0.5, 0.4]
Group 0 (g=4) Group 1 (g=4) Group 2 (g=4) Group 3 (g=4)
s0, z0 s1, z1 s2, z2 s3, z3The overhead of storing per-group parameters adds bits per weight:
$$\text{effective bits} = 4 + \frac{b_s + b_z}{g}$$where \(b_s\) and \(b_z\) are the bit-widths of the scale and zero-point. For \(g = 128\) with FP16 scale and zero-point:
$$\text{effective bits} = 4 + \frac{16 + 16}{128} = 4.25 \text{ bits}$$Common group sizes in practice: 32, 64, 128, 256. Smaller groups improve accuracy but increase overhead.
NF4: NormalFloat 4-bit#
QLoRA introduced NF4 (NormalFloat4), an information-theoretically optimal data type for normally distributed weights. The key insight: neural network weights after pretraining are approximately normally distributed with zero mean.
NF4 constructs its 16 quantization levels by computing the quantiles of the standard normal distribution \(\mathcal{N}(0,1)\), ensuring each quantization bin contains equal probability mass:
$$q_i = \Phi^{-1}\left(\frac{2i + 1}{2 \times 16}\right), \quad i = 0, 1, \ldots, 15$$where \(\Phi^{-1}\) is the inverse cumulative distribution function (probit function) of the standard normal.
The resulting NF4 quantization levels (normalized to [-1, 1]):
NF4 levels (16 values):
[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]Notice the non-uniform spacing: levels are denser near zero where the normal distribution has higher probability density. This minimizes the expected quantization error:
$$\mathbb{E}[|x - Q(x)|^2] = \int_{-\infty}^{\infty} |x - Q(x)|^2 \, \phi(x) \, dx$$where \(\phi(x)\) is the standard normal PDF. NF4 achieves lower expected error than uniform INT4 for normally distributed data.
QLoRA further applies double quantization – quantizing the FP32 group scales themselves to FP8, reducing the per-parameter overhead:
$$\text{effective bits (NF4 + double quant)} = 4 + \frac{8}{64} + \frac{32}{64 \times 256} \approx 4.127 \text{ bits}$$GGUF Format and Quant Types#
The GGUF (GPT-Generated Unified Format) file format, developed by the llama.cpp community, has become the de facto standard for distributing quantized LLMs for CPU and mixed CPU/GPU inference. It supports a wide array of quantization types:
| Quant Type | Bits/Weight | Group Size | Scale Format | Description |
|---|---|---|---|---|
| Q2_K | 2.5625 | 256 (super) / 16 (sub) | FP16 + 4-bit | 2-bit with 4-bit importance-based scales |
| Q3_K_S | 3.4375 | 256 / 16 | FP16 + 4-bit | 3-bit small, fewer high-precision groups |
| Q3_K_M | 3.875 | 256 / 16 | FP16 + 4-bit | 3-bit medium |
| Q3_K_L | 4.125 | 256 / 16 | FP16 + 4-bit | 3-bit large, more high-precision groups |
| Q4_0 | 4.5 | 32 | FP16 | Basic 4-bit, per-group absmax |
| Q4_1 | 5.0 | 32 | FP16 + FP16 | 4-bit with scale + min value |
| Q4_K_S | 4.5 | 256 / 32 | FP16 + 6-bit | 4-bit K-quant small |
| Q4_K_M | 4.85 | 256 / 32 | FP16 + 6-bit | 4-bit K-quant medium, mixed precision |
| Q5_0 | 5.5 | 32 | FP16 | 5-bit per-group |
| Q5_1 | 6.0 | 32 | FP16 + FP16 | 5-bit with min |
| Q5_K_S | 5.5 | 256 / 32 | FP16 + 6-bit | 5-bit K-quant small |
| Q5_K_M | 5.75 | 256 / 32 | FP16 + 6-bit | 5-bit K-quant medium |
| Q6_K | 6.5625 | 256 / 16 | FP16 + 8-bit | 6-bit K-quant |
| Q8_0 | 8.5 | 32 | FP16 | 8-bit per-group |
| IQ1_S | 1.5625 | 256 | FP16 | 1-bit importance-weighted |
| IQ2_XXS | 2.0625 | 256 | FP16 | 2-bit ultra-extreme |
| IQ2_XS | 2.3125 | 256 | FP16 | 2-bit extreme |
| IQ2_S | 2.5 | 256 | FP16 | 2-bit |
| IQ3_XXS | 3.0625 | 256 | FP16 | 3-bit ultra-extreme |
| IQ3_XS | 3.3 | 256 | FP16 | 3-bit extreme |
| IQ4_NL | 4.5 | 32 | FP16 | 4-bit non-linear (NF4-like) |
| IQ4_XS | 4.25 | 256 / 32 | FP16 | 4-bit extreme with super-blocks |
The K-quant variants (e.g., Q4_K_M) use a two-level grouping hierarchy: super-blocks of 256 weights containing sub-blocks of 16 or 32 weights. The super-block stores a shared FP16 scale, while sub-blocks store smaller quantized scales relative to the super-block. This hierarchical approach significantly reduces overhead.
The IQ (Importance Quantization) variants use lattice-based codebooks and importance weighting (derived from the Fisher information or Hessian diagonal) to allocate bits more efficiently to important weights.
Sub-4-Bit Quantization#
INT3 and INT2#
At 3 bits (8 levels) and 2 bits (4 levels), naive uniform quantization causes severe accuracy degradation. The key challenge can be visualized:
Weight Distribution vs. Quantization Levels:
Probability
|
| ***
| *****
| *******
| *********
| ***********
|*************
+-----|---|---|---|---> value
L0 L1 L2 L3 (INT2: only 4 levels!)
Most of the distribution's probability mass falls
between L1 and L2, wasting 2 of the 4 levels on
the rarely-occupied tails.Successful INT3/INT2 methods rely on several key techniques:
- Non-uniform quantization: Place levels according to the weight distribution (as in NF4)
- Compensation: Adjust remaining FP16 weights to compensate for quantization error in quantized layers
- Learned rounding: Optimize the rounding decisions (up or down) jointly rather than independently
- Group quantization with very small groups: Groups of 8-32 to capture local statistics
- Mixed-precision residuals: Store a small FP16 or INT8 residual correction term
Binary Neural Networks (BNNs)#
Binary Neural Networks represent the extreme of quantization: weights (and optionally activations) are constrained to \({-1, +1}\), requiring only 1 bit per value.
Binarization function:
$$w_b = \text{sign}(w) = \begin{cases} +1 & \text{if } w \geq 0 \\ -1 & \text{if } w < 0 \end{cases}$$The key advantage: matrix multiplications reduce to XNOR and popcount operations:
$$y = \mathbf{w}^T \mathbf{x} \approx \alpha \cdot \text{popcount}(\text{XNOR}(\mathbf{w}_b, \mathbf{x}_b))$$where \(\alpha\) is a learned or computed scaling factor. The XNOR-popcount operation is extremely fast on modern hardware:
Binary Matrix Multiply (XNOR + Popcount):
w_b = [+1, -1, +1, +1, -1, +1, -1, -1] --> [1,0,1,1,0,1,0,0] = 0xB4
x_b = [+1, +1, -1, +1, +1, -1, +1, -1] --> [1,1,0,1,1,0,1,0] = 0xDA
XNOR(0xB4, 0xDA) = 0x91 = [1,0,0,1,0,0,0,1]
popcount(0x91) = 3
dot_product = 2 * popcount - n = 2 * 3 - 8 = -2
Verification: (+1)(+1) + (-1)(+1) + (+1)(-1) + (+1)(+1)
+ (-1)(+1) + (+1)(-1) + (-1)(+1) + (-1)(-1)
= 1 - 1 - 1 + 1 - 1 - 1 - 1 + 1 = -2 (correct)Computational savings of BNNs:
| Operation | FP32 | Binary |
|---|---|---|
| Multiply | 32-bit FPU multiply | 1-bit XNOR |
| Accumulate | 32-bit FP add | Integer popcount |
| Memory per weight | 32 bits | 1 bit (32x reduction) |
| Theoretical speedup | 1x | ~58x (on specialized hardware) |
However, BNNs suffer from severe accuracy loss. For ImageNet classification, a binary ResNet-18 typically loses 15-20% top-1 accuracy compared to the full-precision version. This limits BNNs to edge applications where extreme efficiency is paramount.
Training BNNs requires the Straight-Through Estimator (STE) because the sign function has zero gradient almost everywhere:
$$\frac{\partial L}{\partial w} \approx \frac{\partial L}{\partial w_b} \cdot \mathbb{1}_{|w| \leq 1}$$The STE passes the gradient through the sign function as if it were the identity (clipped to [-1, 1]).
BitNet b1.58#
BitNet b1.58 (Microsoft Research, 2024) represents a breakthrough in ternary quantization. Instead of binary \({-1, +1}\), it uses ternary weights \({-1, 0, +1}\), requiring \(\log_2(3) \approx 1.58\) bits per weight.
Quantization function:
$$\tilde{w} = \text{RoundClip}\left(\frac{w}{\gamma + \epsilon}, -1, 1\right)$$where \(\gamma = \frac{1}{nm}\sum_{i,j}|w_{ij}|\) is the mean absolute value of the weight matrix, and:
$$\text{RoundClip}(x, a, b) = \max(a, \min(b, \lfloor x \rceil))$$Activation quantization uses absmax quantization to \(b\)-bit integers (typically 8-bit):
$$\tilde{x} = \text{Quant}(x) = \text{clamp}\left(\left\lfloor \frac{x}{Q_b} \times (2^{b-1} - 1) \right\rceil, -(2^{b-1}-1), 2^{b-1}-1\right)$$where \(Q_b = |x|_\infty\).
The linear layer in BitNet b1.58:
$$y = \tilde{W} \tilde{x} = \sum_{j} \tilde{w}_j \tilde{x}_j$$Since \(\tilde{w}_j \in {-1, 0, +1}\), each multiply becomes:
- If \(\tilde{w} = +1\): add \(\tilde{x}\)
- If \(\tilde{w} = -1\): subtract \(\tilde{x}\)
- If \(\tilde{w} = 0\): skip (no operation)
This eliminates all floating-point multiplications entirely. The matrix multiply reduces to integer addition only.
Energy and performance comparison (from the BitNet b1.58 paper):
Energy per Operation (relative to FP16 multiply-add):
FP16 Multiply: |========================| 100%
FP16 Add: |=====| 20%
INT8 Multiply: |=======| 31%
INT8 Add: |=| 4%
1.58-bit (add): |=| 4%
Memory Footprint for a 70B model:
FP16: |========================================| 140 GB
INT8: |====================| 70 GB
INT4: |==========| 35 GB
1.58b: |====| 17.5 GB (fits single GPU!)BitNet b1.58 key results:
At the 3B parameter scale, BitNet b1.58 matches full-precision LLaMA LLM performance on perplexity benchmarks while using:
- 3.55x less memory than FP16
- 2.71x faster on a single device (latency)
- 8.9x higher throughput at batch size 1
The zero values in the ternary representation provide implicit sparsity (roughly 1/3 of weights are zero), further reducing computation.
Advanced Quantization Algorithms#
QuIP and QuIP##
QuIP (Quantization with Incoherence Processing) and its successor QuIP# achieve near-lossless 2-bit quantization by exploiting the concept of incoherence in weight matrices.
The Incoherence Principle:
Quantization error is minimized when the weight matrix and the Hessian (input correlation matrix) are “incoherent” – meaning they have no concentrated structure. Formally, if a matrix \(W\) has its entries spread uniformly rather than concentrated in a few large values, rounding errors tend to cancel out statistically.
QuIP achieves incoherence by applying random orthogonal rotations:
$$W' = U W V^T$$where \(U\) and \(V\) are random orthogonal matrices. The quantized version is:
$$\hat{W} = U^T \text{Quantize}(U W V^T) V$$The rotation spreads outlier values across all entries, making the rotated matrix more amenable to uniform quantization.
QuIP# improvements:
Kronecker product rotations: Instead of storing full random orthogonal matrices, QuIP# uses the Kronecker product of smaller Hadamard matrices: \(U = H_1 \otimes H_2\). This reduces storage from \(O(n^2)\) to \(O(n)\) and enables fast application via the Fast Walsh-Hadamard Transform in \(O(n \log n)\).
E8 Lattice Quantization: Instead of rounding each scalar independently, QuIP# quantizes vectors of 8 values jointly using the \(E_8\) lattice.
The E8 Lattice:
The \(E_8\) lattice is a mathematical structure in 8-dimensional space with remarkable properties. It is the densest sphere packing in 8 dimensions and the optimal vector quantizer for 8D uniform distributions.
The \(E_8\) lattice points can be defined as:
$$E_8 = \left\{ x \in \mathbb{Z}^8 \cup \left(\mathbb{Z} + \frac{1}{2}\right)^8 : \sum_{i=1}^{8} x_i \equiv 0 \pmod{2} \right\}$$That is, coordinates are either all integers or all half-integers, and their sum is even.
E8 Lattice Quantization (simplified 2D analogy):
Scalar Quantization: Lattice Quantization:
Each dimension independent Joint optimization in 8D
| . | . | . . .
-----+-----+-----+--- . . . . .
| . | . | . . .
-----+-----+-----+--- . . . . .
| . | . | . . .
Grid points: N^8 Lattice points: ~ N^8 / 4
(for N levels per dim) (denser packing, fewer wasted points)The lattice quantizer finds the nearest \(E_8\) lattice point to each 8-dimensional weight vector:
$$\hat{w}_{1:8} = \arg\min_{v \in E_8 \cap \mathcal{C}} \|w'_{1:8} - v\|^2$$where \(\mathcal{C}\) is the codebook subset used for 2-bit encoding. Each 8D lattice point is encoded with \(8 \times 2 = 16\) bits, yielding exactly 2 bits per weight.
QuIP# results: At 2 bits per weight on LLaMA-2 70B, QuIP# achieves a perplexity of approximately 4.15 on WikiText-2, compared to 3.32 for the FP16 baseline – a remarkably small degradation for 8x compression.
AQLM: Additive Quantization for Language Models#
AQLM applies multi-codebook quantization (a form of additive vector quantization) to LLM weight compression.
Core idea: Instead of quantizing each weight independently, AQLM groups weights into vectors and represents each vector as a sum of entries from multiple codebooks:
$$\hat{w}_{1:d} = \sum_{m=1}^{M} C_m[i_m]$$where \(C_m \in \mathbb{R}^{K \times d}\) is the \(m\)-th codebook with \(K\) entries, each of dimension \(d\), and \(i_m \in {0, 1, \ldots, K-1}\) is the index into codebook \(m\).
AQLM Multi-Codebook Quantization:
Weight vector w = [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]
Codebook 1: C1[3] = [0.1, -0.2, 0.3, 0.5, -0.6, 0.1, -0.3, 0.4]
Codebook 2: C2[7] = [0.02, -0.14, 0.26, 0.28, -0.31, 0.13, -0.15, 0.27]
-------------------------------------------------------
Approximation: [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]
Stored: indices (3, 7) + codebooks C1, C2 (shared across all vectors)Bit rate calculation:
For \(M\) codebooks, each with \(K = 2^B\) entries, quantizing vectors of dimension \(d\):
$$\text{bits per weight} = \frac{M \times B}{d} + \text{codebook overhead}$$For example, with \(M = 2\), \(B = 8\) (256 entries per codebook), \(d = 8\):
$$\text{bits per weight} = \frac{2 \times 8}{8} = 2 \text{ bits}$$The codebook overhead is amortized across the entire weight matrix and is typically negligible.
AQLM optimization uses beam search combined with fine-tuning:
- Initialize codebooks using k-means on weight vectors
- Beam search over index combinations to minimize \(|W - \hat{W}|_H^2\) (Hessian-weighted error)
- Fine-tune codebook entries end-to-end with a small calibration dataset
AQLM achieves state-of-the-art results at 2-bit precision, outperforming QuIP# on several benchmarks when both use the same bit budget.
HQQ: Half-Quadratic Quantization#
HQQ takes a fundamentally different approach to quantization by formulating it as a half-quadratic optimization problem, enabling fast, data-free quantization.
Problem formulation:
Most PTQ methods minimize the layer-wise output error:
$$\min_{\hat{W}} \|WX - \hat{W}X\|^2$$This requires calibration data \(X\). HQQ instead directly minimizes the weight reconstruction error with a sparsity-promoting penalty:
$$\min_{Q} \|W - Q\|_p^p$$where \(|\cdot|_p\) is the \(\ell_p\) norm with \(0 < p \leq 1\) (promoting sparse residuals), and \(Q\) is constrained to the quantization grid.
Half-quadratic splitting introduces an auxiliary variable \(Z\):
$$\min_{Q, Z} \|W - Z\|_p^p + \frac{\mu}{2}\|Z - Q\|_2^2$$This decouples into two tractable sub-problems that are solved alternately:
- Z-update (proximal operator of \(\ell_p\) norm): has a closed-form solution for \(p = 1\) (soft-thresholding) and \(p = 0\) (hard-thresholding)
- Q-update (nearest quantization level): simple rounding
HQQ Iteration:
Step 0: W = [0.12, -0.87, 0.34, 0.93, -0.21, 0.78, -0.56, 0.45]
Step 1 (Z-update): Apply proximal operator (soft-thresholding)
Z = [0.10, -0.85, 0.32, 0.91, -0.19, 0.76, -0.54, 0.43]
Step 2 (Q-update): Round to nearest INT4 grid point
Q = [0.13, -0.87, 0.33, 0.93, -0.20, 0.73, -0.53, 0.40]
Repeat steps 1-2 until convergence (typically 10-20 iterations)HQQ advantages:
- No calibration data needed: Works directly on weights, no forward passes required
- Extremely fast: Quantizing a 70B model takes minutes, not hours
- Strong quality: Competitive with GPTQ and AWQ at INT4, and superior at INT3/INT2
- Simple implementation: No Hessian computation, no matrix decomposition
Mixed-Precision Quantization#
Mixed-precision quantization assigns different bit-widths to different layers (or even different channels/heads) based on their sensitivity to quantization. The insight is simple: not all layers are equally sensitive. Some layers can tolerate 2-bit quantization with minimal accuracy loss, while others require 8 bits.
Layer Sensitivity Analysis#
The most straightforward approach measures each layer’s sensitivity independently:
Perturbation-based sensitivity:
For each layer \(l\), quantize it to \(b\) bits while keeping all other layers at full precision, and measure the change in task loss:
$$\Delta L_l(b) = L(\theta_1, \ldots, \theta_l^{(b)}, \ldots, \theta_N) - L(\theta_1, \ldots, \theta_N)$$Sensitivity Profile of a Typical LLM:
Sensitivity
|
|## ##
|## ##
|### ###
|### ###
|#### ####
|#### ## ## ####
|##### #### #### #####
|###### ###### ###### ######
|######## ######## ######## ########
|############################################################
+-------------------------------------------------------------> Layer
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
First & Last layers: HIGH sensitivity
Middle layers: LOW sensitivityThis U-shaped sensitivity curve is remarkably consistent across architectures. The first few layers (embedding projection, early attention) and the last few layers (final attention, output projection) are most sensitive, while middle layers are more robust to quantization.
Hessian-based sensitivity (second-order):
The sensitivity can be estimated more efficiently using the Hessian:
$$\Delta L_l \approx \frac{1}{2} \delta_l^T H_l \delta_l = \frac{1}{2} \text{tr}(\delta_l \delta_l^T H_l)$$where \(\delta_l = \theta_l - \theta_l^{(b)}\) is the quantization perturbation and \(H_l\) is the Hessian of the loss with respect to layer \(l\) parameters. The trace of the Hessian (or its top eigenvalue) serves as a sensitivity metric.
HAWQ: Hessian AWare Quantization#
HAWQ (and its successors HAWQ-V2, HAWQ-V3) use Hessian information to automatically determine per-layer bit-widths.
HAWQ-V1 uses the top eigenvalue of the per-layer Hessian:
$$\Omega_l = \lambda_{\max}(H_l)$$Layers with larger \(\Omega_l\) receive more bits. The bit-width assignment is formulated as a constrained optimization:
$$\min_{\{b_l\}} \sum_{l=1}^{L} \Omega_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \sum_{l=1}^{L} n_l \cdot b_l \leq B_{\text{total}}$$where \(n_l\) is the number of parameters in layer \(l\), \(b_l \in {2, 4, 8}\) is the bit-width, and \(B_{\text{total}}\) is the total bit budget.
HAWQ-V2 improves by using the average Hessian trace instead of the top eigenvalue:
$$\bar{\Omega}_l = \frac{1}{n_l} \text{tr}(H_l)$$This is more robust and cheaper to compute (via Hutchinson’s stochastic trace estimator):
$$\text{tr}(H_l) \approx \frac{1}{T} \sum_{t=1}^{T} z_t^T H_l z_t$$where \(z_t\) are random Rademacher vectors (\(\pm 1\) with equal probability).
HAWQ-V3 extends to integer-only quantization with mixed INT4/INT8 and hardware-aware latency constraints:
$$\min_{\{b_l\}} \sum_{l=1}^{L} \bar{\Omega}_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \text{LAT}(\{b_l\}) \leq T_{\text{target}}$$where \(\text{LAT}(\cdot)\) is the measured latency on target hardware.
HAQ: Hardware-Aware Quantization with Reinforcement Learning#
HAQ frames mixed-precision quantization as a sequential decision problem solved by reinforcement learning.
State space: For each layer \(l\), the state encodes:
- Layer index, type (Conv, FC, Attention, etc.)
- Input/output channels, kernel size
- Number of parameters
- Computational cost (FLOPs)
- Current model size and latency
Action space: Choose a bit-width \(b_l \in {1, 2, 3, 4, 5, 6, 7, 8}\) for layer \(l\).
Reward: After all layers are assigned, the reward is:
$$R = -\Delta \text{Accuracy} \quad \text{s.t.} \quad \text{Model size} \leq S_{\text{target}} \text{ or } \text{Latency} \leq T_{\text{target}}$$The constraint is enforced by giving a large negative reward if violated.
HAQ Reinforcement Learning Loop:
RL Agent (DDPG)
|
| action: bit-width for layer l
v
[Layer 0] --> [Layer 1] --> ... --> [Layer L-1]
| | |
| state | state | state
v v v
(layer info, (layer info, (layer info,
remaining remaining remaining
budget) budget) budget)
|
v
Evaluate accuracy
|
v
Reward RHAQ uses DDPG (Deep Deterministic Policy Gradient), a continuous-action RL algorithm, where the continuous action is mapped to discrete bit-widths via rounding. The agent is trained on a proxy task (e.g., a few hundred calibration samples) and generalizes well.
Key HAQ findings:
- On MobileNet-V2, HAQ achieves 2x compression with only 0.3% accuracy drop
- Depthwise separable convolutions are assigned higher bit-widths (more sensitive)
- The RL agent discovers hardware-specific patterns: on accelerators with efficient INT8 units, it prefers INT8 over INT4 even when INT4 would fit the size budget
Transformer and LLM-Specific Challenges#
Activation Outliers#
Transformers exhibit persistent activation outliers – individual features with magnitudes 10-100x larger than the rest. These outliers appear in specific hidden dimensions consistently across all tokens and layers (discovered by Dettmers et al. in the “LLM.int8()” paper).
Activation magnitude across hidden dimensions (typical LLM):
Magnitude
100 | *
| *
50 | *
| *
10 | * * ** * * * * * * ** * * * ** *
5 | ** **** ** ** ** * * ** ** ** ** ** * ** **
1 |*****************************************************
+-----------------------------------------------------> Hidden dim
^
Outlier channel(s)These outliers cause catastrophic quantization error if quantized uniformly. Solutions include:
- LLM.int8(): Mixed INT8/FP16 decomposition – outlier dimensions stay in FP16
- SmoothQuant: Migrate quantization difficulty from activations to weights via a mathematically equivalent scaling transform
- Rotation-based methods: Apply Hadamard rotation to spread outliers (as in QuIP#)
KV-Cache Quantization#
The Key-Value (KV) cache is a major memory bottleneck during autoregressive LLM inference. For each token generated, the KV cache grows by:
$$\Delta_{\text{KV}} = 2 \times L \times H \times d_h \times b$$where \(L\) is the number of layers, \(H\) is the number of KV heads (which may differ from query heads in GQA), \(d_h\) is the head dimension, and \(b\) is the bytes per element.
Total KV-cache memory for a sequence of length \(n\):
$$M_{\text{KV}} = 2 \times L \times H \times d_h \times n \times b$$Concrete example – LLaMA-2 70B with 32K context:
| Parameter | Value |
|---|---|
| Layers (\(L\)) | 80 |
| KV heads (\(H\), GQA) | 8 |
| Head dimension (\(d_h\)) | 128 |
| Sequence length (\(n\)) | 32768 |
KV-cache quantization approaches:
| Method | Bits | Key Insight | Quality |
|---|---|---|---|
| KIVI | K:2, V:2 | Per-channel K, per-token V quantization | ~0.1 PPL increase |
| KVQuant | 2-4 | Sensitivity-aware, non-uniform | < 0.1 PPL increase |
| Gear | 2-4 | Low-rank + sparse residual | Minimal loss |
| CacheQuant | 4 | Outlier-aware dynamic quantization | < 0.05 PPL increase |
A key asymmetry: Keys and Values have different quantization sensitivities. Keys participate in the softmax attention computation where small errors can shift probability mass significantly, while Values are linearly combined. However, Keys tend to have more structured distributions (amenable to per-channel quantization), while Values have more per-token variation.
Attention Score Quantization#
The attention mechanism involves:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$Quantizing the intermediate attention scores (\(QK^T\)) and the post-softmax probabilities is challenging because:
- Pre-softmax scores can have large dynamic range across heads and positions
- Post-softmax probabilities are in \([0, 1]\) with a heavy-tailed distribution (most values near 0, a few near 1)
- Causal masking introduces discontinuities (negative infinity values)
Effective strategies:
- Quantize \(Q\) and \(K\) to INT8 with per-head scaling, compute \(QK^T\) in INT32, then dequantize before softmax
- Keep softmax computation in FP16/FP32 (numerically sensitive)
- Quantize the attention output (\(\text{softmax} \times V\)) to INT8
Quantized Attention Computation:
Q (INT8) x K^T (INT8) -> S (INT32) -> dequant -> S (FP16)
|
softmax (FP16)
|
P (FP16)
|
P (FP16) x V (INT8) -> O (INT32)
|
dequant -> O (FP16)Vision Transformer Quantization#
Vision Transformers (ViTs) present distinct quantization challenges compared to language models:
ViT-Specific Challenges#
Post-LayerNorm activations: ViTs often use post-LayerNorm, creating different activation distributions than LLMs (which typically use pre-LayerNorm or RMSNorm)
Softmax attention bottleneck: ViTs process all spatial tokens simultaneously (no causal mask), leading to attention maps with very high entropy. Small quantization errors in attention probabilities can shift focus to wrong spatial regions.
Patch embedding sensitivity: The initial patch embedding layer projects raw pixel values to token representations. Quantization errors here propagate through the entire network.
Class token dependence: Classification ViTs rely on a single [CLS] token, making the network especially sensitive to quantization error that affects this token’s representation.
Quantization strategies for ViTs:
| Strategy | Description | Typical Accuracy Impact |
|---|---|---|
| PTQ4ViT | Twin uniform quantization for softmax, Hessian-guided | -0.5% at W4A4 |
| FQ-ViT | Power-of-two factor for LayerNorm, log2 quantizer for softmax | -0.3% at W4A4 |
| RepQ-ViT | Reparameterize LayerNorm and softmax to quantization-friendly forms | -0.5% at W4A4 |
| I-ViT | Integer-only ViT with Shiftmax and ShiftGELU | -0.2% at W8A8 |
| NoisyQuant | Add fixed noise before quantization to break outlier structure | -0.4% at W8A8 |
Log2 quantizer for post-softmax values:
Since attention probabilities follow a roughly log-normal distribution after softmax, a log-scale quantizer is more appropriate:
$$q = \text{clamp}\left(\lfloor -\log_2(p) \rceil, 0, 2^b - 1\right)$$$$\hat{p} = 2^{-q}$$This places more quantization levels near zero (where most probabilities lie) and fewer near one.
Diffusion Model Quantization#
Diffusion models (DDPM, Stable Diffusion, DALL-E, etc.) introduce unique quantization challenges due to their iterative denoising process.
Time-Step Dependent Distributions#
The core challenge: diffusion models are evaluated at many different noise levels (time steps), and the activation distributions change dramatically across time steps.
Activation distribution at different time steps:
t = 0 (clean): t = 500 (medium): t = 1000 (noisy):
*** **** *****
***** ****** ********
******* ******** **********
********* ********** ************
narrow, moderate, wide,
sharp peak broader very spread outA single set of quantization parameters (scale, zero-point) cannot optimally handle all time steps. Solutions include:
- Time-step aware quantization (TDQ): Maintain separate quantization parameters for different time-step ranges
- Temporal information-aware quantization: Use the time-step embedding to dynamically adjust quantization parameters
- PTQ4DM: Calibrate quantization parameters on a representative set of time steps
Diffusion-Specific Methods#
| Method | Approach | Result |
|---|---|---|
| Q-Diffusion | Time-step aware PTQ, shortcut-splitting | W4A8 with < 0.5 FID increase |
| PTQD | Time-step grouping, correlation-aware | W4A8 competitive with FP32 |
| TDQ | Dedicated scales per time-step group | W8A8 near-lossless |
| EfficientDM | QAT with quantization-aware low-rank adaptation | W4A4 with minor FID increase |
Error accumulation is a critical issue: in diffusion models, the output of step \(t\) becomes the input to step \(t-1\). Quantization errors accumulate across the 20-50+ denoising steps:
$$\epsilon_{\text{total}} \approx \sum_{t=T}^{1} \epsilon_t \cdot \prod_{s=1}^{t-1} (1 + \alpha_s)$$where \(\epsilon_t\) is the per-step quantization error and \(\alpha_s\) captures error amplification. This makes diffusion models more sensitive to quantization than single-pass models.
Practical recommendation for Stable Diffusion:
- UNet: W8A8 is safe; W4A8 is achievable with careful calibration; W4A4 requires QAT
- VAE decoder: Keep at FP16 (highly sensitive, runs only once)
- Text encoder (CLIP): W8A8 is typically safe
- Time-step embedding MLP: Keep at higher precision (FP16 or INT8)
Inference Optimization and the Roofline Model#
The Roofline Model for Quantized Inference#
Understanding when quantization actually speeds up inference requires the roofline model, which characterizes computation as either compute-bound or memory-bound.
Arithmetic intensity (operational intensity):
$$I = \frac{\text{FLOPs}}{\text{Bytes transferred}}$$The roofline model defines achievable performance as:
$$\text{Performance} = \min\left(\text{Peak FLOPS}, \quad I \times \text{Memory Bandwidth}\right)$$Roofline Model with Quantization:
Performance
(TOPS) Peak INT4
| / Peak INT8
| / / Peak FP16
| / / /
| // /
| / /
| / / <-- Compute-bound region
| // (quantization helps with peak TOPS)
| //
| / <-- Memory-bound region
| / (quantization helps with bandwidth)
|/
+-----------------------------------------> Arithmetic Intensity
(FLOPs/Byte)
^ ^
| |
LLM decode LLM prefill / CNN batch
(batch=1) inferenceLLM inference phases:
Prefill (prompt processing): High arithmetic intensity (large matrix multiplications with many tokens). Often compute-bound. Quantization helps by increasing peak throughput (INT4 Tensor Cores are 2x faster than INT8).
Decode (token generation): Low arithmetic intensity (matrix-vector multiply, batch size = 1). Almost always memory-bound. Quantization helps primarily by reducing memory bandwidth requirements.
For the decode phase, the speedup from quantization is approximately:
$$\text{Speedup}_{\text{decode}} \approx \frac{b_{\text{original}}}{b_{\text{quantized}}} \times \frac{\text{BW}_{\text{quantized}}}{\text{BW}_{\text{original}}}$$For INT4 vs FP16 on the same hardware (bandwidth ratio = 1):
$$\text{Speedup}_{\text{decode}} \approx \frac{16}{4} = 4\times$$In practice, the speedup is lower (2-3x) due to dequantization overhead, group scale fetching, and non-weight memory accesses (KV cache, activations).
Dequantization Overhead#
Quantized weights must be dequantized before computation (or during, in fused kernels). The dequantization cost depends on the quantization scheme:
| Scheme | Dequant Operations per Weight | Relative Overhead |
|---|---|---|
| Per-tensor symmetric | 1 multiply | Very low |
| Per-channel symmetric | 1 multiply | Low |
| Per-group affine (g=128) | 1 multiply + 1 add | Low |
| NF4 (lookup table) | 1 table lookup + 1 multiply | Medium |
| AQLM (codebook) | 1-2 table lookups + 1 add | Medium-High |
| QuIP# (E8 lattice + rotation) | Lattice decode + Hadamard transform | High |
Efficient GPU kernels (e.g., from Marlin, ExLlamaV2, or TensorRT-LLM) fuse dequantization with the matrix multiply, hiding most of the overhead behind the memory latency of loading weights.
End-to-End Throughput Comparison#
The following table compares practical inference throughput for a 7B-parameter LLM on a single NVIDIA RTX 4090 (24 GB VRAM):
| Quantization | Bits/Weight | Model Size | Tokens/sec (decode) | Perplexity (WikiText-2) |
|---|---|---|---|---|
| FP16 | 16 | 14.0 GB | ~35 | 5.68 (baseline) |
| GPTQ INT8 | 8 | 7.0 GB | ~65 | 5.69 |
| GPTQ INT4 (g128) | 4.25 | 4.0 GB | ~110 | 5.85 |
| AWQ INT4 (g128) | 4.25 | 4.0 GB | ~115 | 5.79 |
| GGUF Q4_K_M | 4.85 | 4.6 GB | ~100 (CPU+GPU) | 5.82 |
| GGUF Q3_K_M | 3.875 | 3.5 GB | ~120 (CPU+GPU) | 6.15 |
| GGUF Q2_K | 2.5625 | 2.5 GB | ~135 (CPU+GPU) | 7.89 |
| QuIP# 2-bit | 2 | 2.0 GB | ~80 | 6.45 |
| AQLM 2-bit | 2 | 2.0 GB | ~75 | 6.32 |
| BitNet 1.58b | 1.58 | ~1.6 GB | ~150 (specialized) | ~5.70 (trained) |
Note: BitNet requires training from scratch with ternary weights; all others are post-training quantization applied to a pre-trained FP16 model.
State-of-the-Art Comparison (2024-2025)#
The following table summarizes the major quantization methods, their characteristics, and results as of early 2025:
| Method | Year | Type | Bits | Calibration Data | Key Innovation | LLaMA-2 7B PPL | LLaMA-2 70B PPL |
|---|---|---|---|---|---|---|---|
| GPTQ | 2022 | PTQ | 3-8 | Yes (128 samples) | OBQ with lazy batching | 6.29 (4-bit) | 3.85 (4-bit) |
| AWQ | 2023 | PTQ | 3-8 | Yes (small) | Activation-aware scaling | 5.89 (4-bit) | 3.56 (4-bit) |
| SqueezeLLM | 2023 | PTQ | 3-4 | Yes | Dense-and-sparse; non-uniform | 5.88 (4-bit) | – |
| QuIP | 2023 | PTQ | 2-4 | Yes | Incoherence processing | 6.90 (2-bit) | 4.55 (2-bit) |
| QuIP# | 2023 | PTQ | 2-4 | Yes | E8 lattice, Hadamard rotation | 6.45 (2-bit) | 4.15 (2-bit) |
| AQLM | 2024 | PTQ | 2-4 | Yes | Multi-codebook additive VQ | 6.32 (2-bit) | 4.02 (2-bit) |
| HQQ | 2023 | PTQ | 2-8 | No | Half-quadratic optimization | 6.58 (4-bit) | 3.68 (4-bit) |
| GGUF IQ2_XS | 2024 | PTQ | 2.3 | Yes | Importance-weighted lattice | 7.21 (2.3-bit) | 4.42 (2.3-bit) |
| OmniQuant | 2023 | PTQ/QAT | 2-8 | Yes | Learnable weight clipping + equiv. transform | 5.86 (4-bit) | 3.54 (4-bit) |
| QLoRA NF4 | 2023 | QAT | 4 | Training data | NF4 + double quantization | 5.70* (fine-tuned) | – |
| SpQR | 2023 | PTQ | 3-4 | Yes | Sparse outlier + dense quantized | 5.84 (4-bit) | 3.53 (4-bit) |
| SmoothQuant | 2022 | PTQ | W8A8 | Yes | Smoothing transform for activations | – (W8A8) | – (W8A8) |
| KIVI | 2024 | PTQ | KV:2 | Yes | Asymmetric K/V quantization | ~0.1 PPL increase | ~0.1 PPL increase |
| BitNet b1.58 | 2024 | QAT | 1.58 | Training data | Ternary weights from scratch | ~5.7 (trained) | – |
| OneBit | 2024 | QAT | 1 | Training data | 1-bit with value-aware knowledge distillation | ~6.2 (trained) | – |
| EfficientQAT | 2024 | QAT | 2-4 | Training data | Block-wise QAT + end-to-end | 5.72 (4-bit) | 3.42 (4-bit) |
*QLoRA perplexity varies by fine-tuning task and dataset.
Key takeaways from the 2024-2025 landscape:
4-bit is the sweet spot for post-training quantization: methods like AWQ, GPTQ, and HQQ achieve near-lossless compression at 4x size reduction.
2-bit PTQ is viable for large models: QuIP#, AQLM, and GGUF IQ variants push the frontier below 3 bits, with 70B+ models maintaining reasonable quality. The larger the model, the more gracefully it quantizes.
1-2 bit requires training-aware methods: BitNet b1.58 demonstrates that training from scratch with extreme quantization can match full-precision performance, but this requires the full training compute budget.
KV-cache quantization is critical: For long-context applications, KV-cache memory can exceed model weight memory. Specialized methods like KIVI enable 2-bit KV caches with minimal quality loss.
Hardware support is evolving: NVIDIA Blackwell (B100/B200) adds native FP4 Tensor Cores. AMD MI300X supports FP8. Custom silicon (Groq, Cerebras) increasingly targets INT4/INT8. Software stacks (TensorRT-LLM, vLLM, llama.cpp) are key enablers.
Practical Decision Guide#
Choosing a quantization strategy depends on your constraints. Here is a decision framework:
START
|
Do you have training
compute budget?
/ \
Yes No
/ \
Need <2 bits? Need <3 bits?
/ \ / \
Yes No Yes No
/ \ / \
BitNet QLoRA AQLM/ AWQ/GPTQ
b1.58 NF4 QuIP# INT4
(train (fine- (2-bit (4-bit PTQ,
from tune PTQ) best balance)
scratch) adapter) | |
| |
Need fast Hardware-
quantization? specific?
/ \ / \
Yes No Yes No
/ \ / \
HQQ AQLM HAQ AWQ + Marlin
(no cal) (better (RL-based kernel
quality) search)Conclusion#
Extreme and mixed-precision quantization has progressed from an academic curiosity to a practical necessity. The key developments of 2024-2025 demonstrate that:
- FP8 has become the standard for training, with hardware support now widespread.
- INT4 with group quantization (AWQ, GPTQ, GGUF K-quants) is the production standard for LLM inference.
- 2-bit quantization (QuIP#, AQLM) is practical for the largest models (70B+), enabling single-GPU deployment of models that previously required multi-node clusters.
- 1.58-bit (BitNet b1.58) points toward a future where extreme quantization is built into the training process, potentially eliminating floating-point multiply hardware entirely.
- Mixed-precision strategies (HAWQ, HAQ) provide the theoretical and practical framework for optimally allocating bits across heterogeneous model components.
The field continues to advance rapidly. As new architectures (Mixture of Experts, State Space Models, hybrid designs) and new hardware (FP4 Tensor Cores, custom accelerators) emerge, the quantization landscape will continue to evolve. The fundamental principle remains: compress aggressively where the model is robust, preserve precision where it is sensitive, and always measure on your target task and hardware.
References#
- Micikevicius et al., “FP8 Formats for Deep Learning,” arXiv:2209.05433 (2022)
- Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” NeurIPS 2022
- Dettmers et al., “QLoRA: Efficient Finetuning of Quantized Language Models,” NeurIPS 2023
- Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” ICLR 2023
- Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” MLSys 2024
- Chee et al., “QuIP: 2-Bit Quantization of Large Language Models With Guarantees,” NeurIPS 2023
- Chee et al., “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,” ICML 2024
- Egiazarian et al., “AQLM: Extreme Compression of Large Language Models via Additive Quantization,” ICML 2024
- Badri & Shaji, “HQQ: Half-Quadratic Quantization,” arXiv:2309.15531 (2023)
- Dong et al., “HAWQ: Hessian AWare Quantization of Neural Networks,” ICCV 2019
- Wang et al., “HAQ: Hardware-Aware Automated Quantization with Mixed Precision,” CVPR 2019
- Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits,” arXiv:2402.17764 (2024)
- Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” arXiv:2402.02750 (2024)
- Li et al., “Q-Diffusion: Quantizing Diffusion Models,” ICCV 2023
- Yuan et al., “PTQ4ViT: Post-Training Quantization for Vision Transformers,” ECCV 2022
- Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” ICML 2023