Extreme and Mixed-Precision Quantization: From FP8 to Binary Neural Networks

Table of Contents

Introduction
#

Model quantization has evolved far beyond the classic INT8 regime. As large language models (LLMs) surpass hundreds of billions of parameters and vision/diffusion models demand ever-increasing computational budgets, researchers have pushed quantization to its extreme limits. This post provides a deep, technical exploration of extreme and mixed-precision quantization – from 8-bit floating point down to single-bit binary representations – along with the sophisticated algorithms that make such aggressive compression possible without catastrophic quality loss.

We will cover the full landscape: the bit-level mechanics of FP8 and INT4 formats, sub-4-bit methods including binary neural networks and BitNet, state-of-the-art algorithms such as QuIP#, AQLM, and HQQ, mixed-precision strategies driven by sensitivity analysis and reinforcement learning, domain-specific challenges for Transformers, vision models, and diffusion models, and finally the hardware-aware inference optimization perspective.

FP8: 8-Bit Floating Point
#

Why Floating Point at 8 Bits?
#

Traditional INT8 quantization maps floating-point values to 256 uniformly spaced integers. While effective for inference, this uniform spacing poorly represents the heavy-tailed distributions common in neural network weights and activations. FP8 retains the logarithmic spacing of floating-point arithmetic, providing higher precision near zero (where most values cluster) and coarser precision for outliers.

E4M3 and E5M2 Bit Layouts
#

The IEEE working group and hardware vendors (NVIDIA, AMD, Intel) have standardized two FP8 formats, both using 8 bits total:

E4M3 Format (1 sign + 4 exponent + 3 mantissa):
+---+----+---+---+---+---+---+---+
| S | E3 | E2| E1| E0| M2| M1| M0|
+---+----+---+---+---+---+---+---+
  1    4 bits exponent   3 bits mantissa

E5M2 Format (1 sign + 5 exponent + 2 mantissa):
+---+----+---+---+---+---+---+---+
| S | E4 | E3| E2| E1| E0| M1| M0|
+---+----+---+---+---+---+---+---+
  1    5 bits exponent     2 bits mantissa

The value of a normal FP8 number follows the standard floating-point formula:

$$\text{value} = (-1)^S \times 2^{(E - \text{bias})} \times (1 + \frac{M}{2^{m}})$$

where $E$ is the stored exponent, $\text{bias}$ is the exponent bias, $M$ is the stored mantissa, and $m$ is the number of mantissa bits.

Property	E4M3	E5M2
Exponent bits	4	5
Mantissa bits	3	2
Exponent bias	7	15
Max normal value	448	57344
Min positive normal	$2^{-6}$ = 0.015625	$2^{-14}$ = 6.1e-5
Dynamic range (decades)	~4.9	~9.5
Precision (ULP at 1.0)	0.125	0.25
Special values	NaN only (no Inf)	NaN and Inf

Numerical Examples
#

E4M3 encoding of 3.5:

$3.5 = 1.75 \times 2^1$
Sign: $S = 0$ (positive)
Exponent: $E = 1 + 7 = 8 = 1000_2$
Mantissa: $1.75 = 1 + 0.5 + 0.25 = 1 + \frac{M}{8}$, so $M = 6 = 110_2$
Final bit pattern: 0 1000 110 = 0x46

E5M2 encoding of 0.1875:

$0.1875 = 1.5 \times 2^{-3}$
Sign: $S = 0$
Exponent: $E = -3 + 15 = 12 = 01100_2$
Mantissa: $1.5 = 1 + 0.5 = 1 + \frac{M}{4}$, so $M = 2 = 10_2$
Final bit pattern: 0 01100 10 = 0x32

Quantization error comparison at value 1.3:

E4M3: rounds to 1.25 (error = 0.05, relative = 3.8%)
E5M2: rounds to 1.25 (error = 0.05, relative = 3.8%) – same here, but at value 5.3:
E4M3: rounds to 5.25 (error = 0.05, relative = 0.9%)
E5M2: rounds to 5.0 (error = 0.3, relative = 5.7%) – E4M3 wins with more mantissa bits

FP8 Training
#

FP8 training uses both formats in a complementary fashion, as pioneered by NVIDIA’s Transformer Engine:

FP8 Mixed-Format Training Pipeline:

                    FP8 E4M3              FP8 E4M3
  Weights -----> [Forward Pass] -----> Activations
  (E4M3)             |                    |
                     |                    |
                     v                    v
               FP8 E5M2              FP8 E5M2
            [Backward Pass] <----- [Loss Gradient]
            (grad weights)          (grad activations)
                     |
                     v
              FP32 Master Weights (optimizer update)

The key insight: E4M3 for forward pass (higher precision needed for accurate outputs) and E5M2 for backward pass (wider dynamic range needed for gradients, which can span many orders of magnitude).

Per-tensor scaling is critical for FP8 training. Each tensor maintains a scaling factor $s$ updated via a delayed scaling strategy:

$$s_{t+1} = \frac{\text{maxval}(\text{FP8})}{\max(|X_t|)} \times \alpha$$

where $\alpha$ is a safety margin (typically 0.9) to prevent overflow, and the scaling factor is applied before casting to FP8:

$$X_{\text{FP8}} = \text{cast\_to\_fp8}(X \times s)$$

NVIDIA’s H100 GPU achieves up to 2x throughput improvement with FP8 Tensor Cores compared to FP16, making FP8 training practical for models with hundreds of billions of parameters.

INT4: 4-Bit Integer Quantization
#

Uniform INT4 Quantization
#

At 4 bits, we have only 16 distinct values. For symmetric quantization:

$$q = \text{clamp}\left(\left\lfloor \frac{x}{s} \right\rceil, -8, 7\right), \quad s = \frac{\max(|x|)}{7}$$

For asymmetric quantization:

$$q = \text{clamp}\left(\left\lfloor \frac{x - z}{s} \right\rceil, 0, 15\right), \quad s = \frac{\max(x) - \min(x)}{15}, \quad z = \min(x)$$

With only 16 levels, the quantization error is significant for per-tensor quantization. This motivates group quantization.

Group Quantization
#

Group quantization divides a weight tensor into small groups of $g$ consecutive elements, each with its own scale and zero-point:

Weight tensor (1x16):
[0.1, 0.5, -0.3, 0.8, | -0.1, 0.2, 0.9, -0.7, | 0.3, -0.4, 0.6, 0.1, | -0.2, 0.7, -0.5, 0.4]
     Group 0 (g=4)          Group 1 (g=4)           Group 2 (g=4)           Group 3 (g=4)
     s0, z0                  s1, z1                  s2, z2                  s3, z3

The overhead of storing per-group parameters adds bits per weight:

$$\text{effective bits} = 4 + \frac{b_s + b_z}{g}$$

where $b_s$ and $b_z$ are the bit-widths of the scale and zero-point. For $g = 128$ with FP16 scale and zero-point:

$$\text{effective bits} = 4 + \frac{16 + 16}{128} = 4.25 \text{ bits}$$

Common group sizes in practice: 32, 64, 128, 256. Smaller groups improve accuracy but increase overhead.

NF4: NormalFloat 4-bit
#

QLoRA introduced NF4 (NormalFloat4), an information-theoretically optimal data type for normally distributed weights. The key insight: neural network weights after pretraining are approximately normally distributed with zero mean.

NF4 constructs its 16 quantization levels by computing the quantiles of the standard normal distribution $\mathcal{N}(0,1)$, ensuring each quantization bin contains equal probability mass:

$$q_i = \Phi^{-1}\left(\frac{2i + 1}{2 \times 16}\right), \quad i = 0, 1, \ldots, 15$$

where $\Phi^{-1}$ is the inverse cumulative distribution function (probit function) of the standard normal.

The resulting NF4 quantization levels (normalized to [-1, 1]):

NF4 levels (16 values):
[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
  0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230, 1.0]

Notice the non-uniform spacing: levels are denser near zero where the normal distribution has higher probability density. This minimizes the expected quantization error:

$$\mathbb{E}[|x - Q(x)|^2] = \int_{-\infty}^{\infty} |x - Q(x)|^2 \, \phi(x) \, dx$$

where $\phi(x)$ is the standard normal PDF. NF4 achieves lower expected error than uniform INT4 for normally distributed data.

QLoRA further applies double quantization – quantizing the FP32 group scales themselves to FP8, reducing the per-parameter overhead:

$$\text{effective bits (NF4 + double quant)} = 4 + \frac{8}{64} + \frac{32}{64 \times 256} \approx 4.127 \text{ bits}$$

GGUF Format and Quant Types
#

The GGUF (GPT-Generated Unified Format) file format, developed by the llama.cpp community, has become the de facto standard for distributing quantized LLMs for CPU and mixed CPU/GPU inference. It supports a wide array of quantization types:

Quant Type	Bits/Weight	Group Size	Scale Format	Description
Q2_K	2.5625	256 (super) / 16 (sub)	FP16 + 4-bit	2-bit with 4-bit importance-based scales
Q3_K_S	3.4375	256 / 16	FP16 + 4-bit	3-bit small, fewer high-precision groups
Q3_K_M	3.875	256 / 16	FP16 + 4-bit	3-bit medium
Q3_K_L	4.125	256 / 16	FP16 + 4-bit	3-bit large, more high-precision groups
Q4_0	4.5	32	FP16	Basic 4-bit, per-group absmax
Q4_1	5.0	32	FP16 + FP16	4-bit with scale + min value
Q4_K_S	4.5	256 / 32	FP16 + 6-bit	4-bit K-quant small
Q4_K_M	4.85	256 / 32	FP16 + 6-bit	4-bit K-quant medium, mixed precision
Q5_0	5.5	32	FP16	5-bit per-group
Q5_1	6.0	32	FP16 + FP16	5-bit with min
Q5_K_S	5.5	256 / 32	FP16 + 6-bit	5-bit K-quant small
Q5_K_M	5.75	256 / 32	FP16 + 6-bit	5-bit K-quant medium
Q6_K	6.5625	256 / 16	FP16 + 8-bit	6-bit K-quant
Q8_0	8.5	32	FP16	8-bit per-group
IQ1_S	1.5625	256	FP16	1-bit importance-weighted
IQ2_XXS	2.0625	256	FP16	2-bit ultra-extreme
IQ2_XS	2.3125	256	FP16	2-bit extreme
IQ2_S	2.5	256	FP16	2-bit
IQ3_XXS	3.0625	256	FP16	3-bit ultra-extreme
IQ3_XS	3.3	256	FP16	3-bit extreme
IQ4_NL	4.5	32	FP16	4-bit non-linear (NF4-like)
IQ4_XS	4.25	256 / 32	FP16	4-bit extreme with super-blocks

The K-quant variants (e.g., Q4_K_M) use a two-level grouping hierarchy: super-blocks of 256 weights containing sub-blocks of 16 or 32 weights. The super-block stores a shared FP16 scale, while sub-blocks store smaller quantized scales relative to the super-block. This hierarchical approach significantly reduces overhead.

The IQ (Importance Quantization) variants use lattice-based codebooks and importance weighting (derived from the Fisher information or Hessian diagonal) to allocate bits more efficiently to important weights.

Sub-4-Bit Quantization
#

INT3 and INT2
#

At 3 bits (8 levels) and 2 bits (4 levels), naive uniform quantization causes severe accuracy degradation. The key challenge can be visualized:

Weight Distribution vs. Quantization Levels:

Probability
  |
  |     ***
  |    *****
  |   *******
  |  *********
  | ***********
  |*************
  +-----|---|---|---|---> value
       L0  L1  L2  L3    (INT2: only 4 levels!)

Most of the distribution's probability mass falls
between L1 and L2, wasting 2 of the 4 levels on
the rarely-occupied tails.

Successful INT3/INT2 methods rely on several key techniques:

Non-uniform quantization: Place levels according to the weight distribution (as in NF4)
Compensation: Adjust remaining FP16 weights to compensate for quantization error in quantized layers
Learned rounding: Optimize the rounding decisions (up or down) jointly rather than independently
Group quantization with very small groups: Groups of 8-32 to capture local statistics
Mixed-precision residuals: Store a small FP16 or INT8 residual correction term

Binary Neural Networks (BNNs)
#

Binary Neural Networks represent the extreme of quantization: weights (and optionally activations) are constrained to ${-1, +1}$, requiring only 1 bit per value.

Binarization function:

$$w_b = \text{sign}(w) = \begin{cases} +1 & \text{if } w \geq 0 \\ -1 & \text{if } w < 0 \end{cases}$$

The key advantage: matrix multiplications reduce to XNOR and popcount operations:

$$y = \mathbf{w}^T \mathbf{x} \approx \alpha \cdot \text{popcount}(\text{XNOR}(\mathbf{w}_b, \mathbf{x}_b))$$

where $\alpha$ is a learned or computed scaling factor. The XNOR-popcount operation is extremely fast on modern hardware:

Binary Matrix Multiply (XNOR + Popcount):

w_b = [+1, -1, +1, +1, -1, +1, -1, -1]  -->  [1,0,1,1,0,1,0,0] = 0xB4
x_b = [+1, +1, -1, +1, +1, -1, +1, -1]  -->  [1,1,0,1,1,0,1,0] = 0xDA

XNOR(0xB4, 0xDA) = 0x91 = [1,0,0,1,0,0,0,1]
popcount(0x91) = 3

dot_product = 2 * popcount - n = 2 * 3 - 8 = -2

Verification: (+1)(+1) + (-1)(+1) + (+1)(-1) + (+1)(+1)
            + (-1)(+1) + (+1)(-1) + (-1)(+1) + (-1)(-1)
            = 1 - 1 - 1 + 1 - 1 - 1 - 1 + 1 = -2  (correct)

Computational savings of BNNs:

Operation	FP32	Binary
Multiply	32-bit FPU multiply	1-bit XNOR
Accumulate	32-bit FP add	Integer popcount
Memory per weight	32 bits	1 bit (32x reduction)
Theoretical speedup	1x	~58x (on specialized hardware)

However, BNNs suffer from severe accuracy loss. For ImageNet classification, a binary ResNet-18 typically loses 15-20% top-1 accuracy compared to the full-precision version. This limits BNNs to edge applications where extreme efficiency is paramount.

Training BNNs requires the Straight-Through Estimator (STE) because the sign function has zero gradient almost everywhere:

$$\frac{\partial L}{\partial w} \approx \frac{\partial L}{\partial w_b} \cdot \mathbb{1}_{|w| \leq 1}$$

The STE passes the gradient through the sign function as if it were the identity (clipped to [-1, 1]).

BitNet b1.58
#

BitNet b1.58 (Microsoft Research, 2024) represents a breakthrough in ternary quantization. Instead of binary ${-1, +1}$, it uses ternary weights ${-1, 0, +1}$, requiring $\log_2(3) \approx 1.58$ bits per weight.

Quantization function:

$$\tilde{w} = \text{RoundClip}\left(\frac{w}{\gamma + \epsilon}, -1, 1\right)$$

where $\gamma = \frac{1}{nm}\sum_{i,j}|w_{ij}|$ is the mean absolute value of the weight matrix, and:

$$\text{RoundClip}(x, a, b) = \max(a, \min(b, \lfloor x \rceil))$$

Activation quantization uses absmax quantization to $b$-bit integers (typically 8-bit):

$$\tilde{x} = \text{Quant}(x) = \text{clamp}\left(\left\lfloor \frac{x}{Q_b} \times (2^{b-1} - 1) \right\rceil, -(2^{b-1}-1), 2^{b-1}-1\right)$$

where $Q_b = |x|_\infty$.

The linear layer in BitNet b1.58:

$$y = \tilde{W} \tilde{x} = \sum_{j} \tilde{w}_j \tilde{x}_j$$

Since $\tilde{w}_j \in {-1, 0, +1}$, each multiply becomes:

If $\tilde{w} = +1$: add $\tilde{x}$
If $\tilde{w} = -1$: subtract $\tilde{x}$
If $\tilde{w} = 0$: skip (no operation)

This eliminates all floating-point multiplications entirely. The matrix multiply reduces to integer addition only.

Energy and performance comparison (from the BitNet b1.58 paper):

Energy per Operation (relative to FP16 multiply-add):

  FP16 Multiply:  |========================| 100%
  FP16 Add:       |=====|                    20%
  INT8 Multiply:  |=======|                  31%
  INT8 Add:       |=|                         4%
  1.58-bit (add): |=|                         4%

Memory Footprint for a 70B model:

  FP16:   |========================================| 140 GB
  INT8:   |====================|                     70 GB
  INT4:   |==========|                               35 GB
  1.58b:  |====|                                     17.5 GB (fits single GPU!)

BitNet b1.58 key results:

At the 3B parameter scale, BitNet b1.58 matches full-precision LLaMA LLM performance on perplexity benchmarks while using:

3.55x less memory than FP16
2.71x faster on a single device (latency)
8.9x higher throughput at batch size 1

The zero values in the ternary representation provide implicit sparsity (roughly 1/3 of weights are zero), further reducing computation.

Advanced Quantization Algorithms
#

QuIP and QuIP#
#

QuIP (Quantization with Incoherence Processing) and its successor QuIP# achieve near-lossless 2-bit quantization by exploiting the concept of incoherence in weight matrices.

The Incoherence Principle:

Quantization error is minimized when the weight matrix and the Hessian (input correlation matrix) are “incoherent” – meaning they have no concentrated structure. Formally, if a matrix $W$ has its entries spread uniformly rather than concentrated in a few large values, rounding errors tend to cancel out statistically.

QuIP achieves incoherence by applying random orthogonal rotations:

$$W' = U W V^T$$

where $U$ and $V$ are random orthogonal matrices. The quantized version is:

$$\hat{W} = U^T \text{Quantize}(U W V^T) V$$

The rotation spreads outlier values across all entries, making the rotated matrix more amenable to uniform quantization.

QuIP# improvements:

Kronecker product rotations: Instead of storing full random orthogonal matrices, QuIP# uses the Kronecker product of smaller Hadamard matrices: $U = H_1 \otimes H_2$. This reduces storage from $O(n^2)$ to $O(n)$ and enables fast application via the Fast Walsh-Hadamard Transform in $O(n \log n)$.
E8 Lattice Quantization: Instead of rounding each scalar independently, QuIP# quantizes vectors of 8 values jointly using the $E_8$ lattice.

The E8 Lattice:

The $E_8$ lattice is a mathematical structure in 8-dimensional space with remarkable properties. It is the densest sphere packing in 8 dimensions and the optimal vector quantizer for 8D uniform distributions.

The $E_8$ lattice points can be defined as:

$$E_8 = \left\{ x \in \mathbb{Z}^8 \cup \left(\mathbb{Z} + \frac{1}{2}\right)^8 : \sum_{i=1}^{8} x_i \equiv 0 \pmod{2} \right\}$$

That is, coordinates are either all integers or all half-integers, and their sum is even.

E8 Lattice Quantization (simplified 2D analogy):

  Scalar Quantization:          Lattice Quantization:
  Each dimension independent    Joint optimization in 8D

       |  .  |  .  |              .     .     .
  -----+-----+-----+---         .   .   .   .   .
       |  .  |  .  |              .     .     .
  -----+-----+-----+---         .   .   .   .   .
       |  .  |  .  |              .     .     .

  Grid points: N^8             Lattice points: ~ N^8 / 4
  (for N levels per dim)       (denser packing, fewer wasted points)

The lattice quantizer finds the nearest $E_8$ lattice point to each 8-dimensional weight vector:

$$\hat{w}_{1:8} = \arg\min_{v \in E_8 \cap \mathcal{C}} \|w'_{1:8} - v\|^2$$

where $\mathcal{C}$ is the codebook subset used for 2-bit encoding. Each 8D lattice point is encoded with $8 \times 2 = 16$ bits, yielding exactly 2 bits per weight.

QuIP# results: At 2 bits per weight on LLaMA-2 70B, QuIP# achieves a perplexity of approximately 4.15 on WikiText-2, compared to 3.32 for the FP16 baseline – a remarkably small degradation for 8x compression.

AQLM: Additive Quantization for Language Models
#

AQLM applies multi-codebook quantization (a form of additive vector quantization) to LLM weight compression.

Core idea: Instead of quantizing each weight independently, AQLM groups weights into vectors and represents each vector as a sum of entries from multiple codebooks:

$$\hat{w}_{1:d} = \sum_{m=1}^{M} C_m[i_m]$$

where $C_m \in \mathbb{R}^{K \times d}$ is the $m$-th codebook with $K$ entries, each of dimension $d$, and $i_m \in {0, 1, \ldots, K-1}$ is the index into codebook $m$.

AQLM Multi-Codebook Quantization:

Weight vector w = [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]

Codebook 1:  C1[3] = [0.1,  -0.2,  0.3,  0.5,  -0.6,  0.1,  -0.3,  0.4]
Codebook 2:  C2[7] = [0.02, -0.14, 0.26, 0.28, -0.31, 0.13, -0.15, 0.27]
                     -------------------------------------------------------
Approximation:        [0.12, -0.34, 0.56, 0.78, -0.91, 0.23, -0.45, 0.67]

Stored: indices (3, 7) + codebooks C1, C2 (shared across all vectors)

Bit rate calculation:

For $M$ codebooks, each with $K = 2^B$ entries, quantizing vectors of dimension $d$:

$$\text{bits per weight} = \frac{M \times B}{d} + \text{codebook overhead}$$

For example, with $M = 2$, $B = 8$ (256 entries per codebook), $d = 8$:

$$\text{bits per weight} = \frac{2 \times 8}{8} = 2 \text{ bits}$$

The codebook overhead is amortized across the entire weight matrix and is typically negligible.

AQLM optimization uses beam search combined with fine-tuning:

Initialize codebooks using k-means on weight vectors
Beam search over index combinations to minimize $|W - \hat{W}|_H^2$ (Hessian-weighted error)
Fine-tune codebook entries end-to-end with a small calibration dataset

AQLM achieves state-of-the-art results at 2-bit precision, outperforming QuIP# on several benchmarks when both use the same bit budget.

HQQ: Half-Quadratic Quantization
#

HQQ takes a fundamentally different approach to quantization by formulating it as a half-quadratic optimization problem, enabling fast, data-free quantization.

Problem formulation:

Most PTQ methods minimize the layer-wise output error:

$$\min_{\hat{W}} \|WX - \hat{W}X\|^2$$

This requires calibration data $X$. HQQ instead directly minimizes the weight reconstruction error with a sparsity-promoting penalty:

$$\min_{Q} \|W - Q\|_p^p$$

where $|\cdot|_p$ is the $\ell_p$ norm with $0 < p \leq 1$ (promoting sparse residuals), and $Q$ is constrained to the quantization grid.

Half-quadratic splitting introduces an auxiliary variable $Z$:

$$\min_{Q, Z} \|W - Z\|_p^p + \frac{\mu}{2}\|Z - Q\|_2^2$$

This decouples into two tractable sub-problems that are solved alternately:

Z-update (proximal operator of $\ell_p$ norm): has a closed-form solution for $p = 1$ (soft-thresholding) and $p = 0$ (hard-thresholding)

$$Z^{(k+1)} = \text{prox}_{p/\mu}\left(Q^{(k)} + \frac{1}{\mu}(W - Q^{(k)})\right)$$

Q-update (nearest quantization level): simple rounding

$$Q^{(k+1)} = \text{Quantize}(Z^{(k+1)})$$

HQQ Iteration:

Step 0:  W = [0.12, -0.87, 0.34, 0.93, -0.21, 0.78, -0.56, 0.45]

Step 1 (Z-update): Apply proximal operator (soft-thresholding)
         Z = [0.10, -0.85, 0.32, 0.91, -0.19, 0.76, -0.54, 0.43]

Step 2 (Q-update): Round to nearest INT4 grid point
         Q = [0.13, -0.87, 0.33, 0.93, -0.20, 0.73, -0.53, 0.40]

Repeat steps 1-2 until convergence (typically 10-20 iterations)

HQQ advantages:

No calibration data needed: Works directly on weights, no forward passes required
Extremely fast: Quantizing a 70B model takes minutes, not hours
Strong quality: Competitive with GPTQ and AWQ at INT4, and superior at INT3/INT2
Simple implementation: No Hessian computation, no matrix decomposition

Mixed-Precision Quantization
#

Mixed-precision quantization assigns different bit-widths to different layers (or even different channels/heads) based on their sensitivity to quantization. The insight is simple: not all layers are equally sensitive. Some layers can tolerate 2-bit quantization with minimal accuracy loss, while others require 8 bits.

Layer Sensitivity Analysis
#

The most straightforward approach measures each layer’s sensitivity independently:

Perturbation-based sensitivity:

For each layer $l$, quantize it to $b$ bits while keeping all other layers at full precision, and measure the change in task loss:

$$\Delta L_l(b) = L(\theta_1, \ldots, \theta_l^{(b)}, \ldots, \theta_N) - L(\theta_1, \ldots, \theta_N)$$

Sensitivity Profile of a Typical LLM:

Sensitivity
  |
  |##                                                    ##
  |##                                                    ##
  |###                                                  ###
  |###                                                  ###
  |####                                                ####
  |####              ##                ##              ####
  |#####            ####              ####            #####
  |######          ######            ######          ######
  |########      ########          ########        ########
  |############################################################
  +-------------------------------------------------------------> Layer
   0  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30
                      First & Last layers: HIGH sensitivity
                      Middle layers: LOW sensitivity

This U-shaped sensitivity curve is remarkably consistent across architectures. The first few layers (embedding projection, early attention) and the last few layers (final attention, output projection) are most sensitive, while middle layers are more robust to quantization.

Hessian-based sensitivity (second-order):

The sensitivity can be estimated more efficiently using the Hessian:

$$\Delta L_l \approx \frac{1}{2} \delta_l^T H_l \delta_l = \frac{1}{2} \text{tr}(\delta_l \delta_l^T H_l)$$

where $\delta_l = \theta_l - \theta_l^{(b)}$ is the quantization perturbation and $H_l$ is the Hessian of the loss with respect to layer $l$ parameters. The trace of the Hessian (or its top eigenvalue) serves as a sensitivity metric.

HAWQ: Hessian AWare Quantization
#

HAWQ (and its successors HAWQ-V2, HAWQ-V3) use Hessian information to automatically determine per-layer bit-widths.

HAWQ-V1 uses the top eigenvalue of the per-layer Hessian:

$$\Omega_l = \lambda_{\max}(H_l)$$

Layers with larger $\Omega_l$ receive more bits. The bit-width assignment is formulated as a constrained optimization:

$$\min_{\{b_l\}} \sum_{l=1}^{L} \Omega_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \sum_{l=1}^{L} n_l \cdot b_l \leq B_{\text{total}}$$

where $n_l$ is the number of parameters in layer $l$, $b_l \in {2, 4, 8}$ is the bit-width, and $B_{\text{total}}$ is the total bit budget.

HAWQ-V2 improves by using the average Hessian trace instead of the top eigenvalue:

$$\bar{\Omega}_l = \frac{1}{n_l} \text{tr}(H_l)$$

This is more robust and cheaper to compute (via Hutchinson’s stochastic trace estimator):

$$\text{tr}(H_l) \approx \frac{1}{T} \sum_{t=1}^{T} z_t^T H_l z_t$$

where $z_t$ are random Rademacher vectors ($\pm 1$ with equal probability).

HAWQ-V3 extends to integer-only quantization with mixed INT4/INT8 and hardware-aware latency constraints:

$$\min_{\{b_l\}} \sum_{l=1}^{L} \bar{\Omega}_l \cdot \mathbb{E}[\|\delta_l(b_l)\|^2] \quad \text{s.t.} \quad \text{LAT}(\{b_l\}) \leq T_{\text{target}}$$

where $\text{LAT}(\cdot)$ is the measured latency on target hardware.

HAQ: Hardware-Aware Quantization with Reinforcement Learning
#

HAQ frames mixed-precision quantization as a sequential decision problem solved by reinforcement learning.

State space: For each layer $l$, the state encodes:

Layer index, type (Conv, FC, Attention, etc.)
Input/output channels, kernel size
Number of parameters
Computational cost (FLOPs)
Current model size and latency

Action space: Choose a bit-width $b_l \in {1, 2, 3, 4, 5, 6, 7, 8}$ for layer $l$.

Reward: After all layers are assigned, the reward is:

$$R = -\Delta \text{Accuracy} \quad \text{s.t.} \quad \text{Model size} \leq S_{\text{target}} \text{ or } \text{Latency} \leq T_{\text{target}}$$

The constraint is enforced by giving a large negative reward if violated.

HAQ Reinforcement Learning Loop:

  RL Agent (DDPG)
       |
       |  action: bit-width for layer l
       v
  [Layer 0] --> [Layer 1] --> ... --> [Layer L-1]
       |              |                     |
       | state        | state               | state
       v              v                     v
  (layer info,   (layer info,          (layer info,
   remaining      remaining             remaining
   budget)        budget)               budget)
                                            |
                                            v
                                      Evaluate accuracy
                                            |
                                            v
                                        Reward R

HAQ uses DDPG (Deep Deterministic Policy Gradient), a continuous-action RL algorithm, where the continuous action is mapped to discrete bit-widths via rounding. The agent is trained on a proxy task (e.g., a few hundred calibration samples) and generalizes well.

Key HAQ findings:

On MobileNet-V2, HAQ achieves 2x compression with only 0.3% accuracy drop
Depthwise separable convolutions are assigned higher bit-widths (more sensitive)
The RL agent discovers hardware-specific patterns: on accelerators with efficient INT8 units, it prefers INT8 over INT4 even when INT4 would fit the size budget

Transformer and LLM-Specific Challenges
#

Activation Outliers
#

Transformers exhibit persistent activation outliers – individual features with magnitudes 10-100x larger than the rest. These outliers appear in specific hidden dimensions consistently across all tokens and layers (discovered by Dettmers et al. in the “LLM.int8()” paper).

Activation magnitude across hidden dimensions (typical LLM):

Magnitude
 100 |                    *
     |                    *
  50 |                    *
     |                    *
  10 |  * * **  *  *  *   *  * *  ** *  *  *  ** *
   5 | ** ****  ** ** ** * * ** ** ** ** ** *  ** **
   1 |*****************************************************
     +-----------------------------------------------------> Hidden dim
                          ^
                    Outlier channel(s)

These outliers cause catastrophic quantization error if quantized uniformly. Solutions include:

LLM.int8(): Mixed INT8/FP16 decomposition – outlier dimensions stay in FP16
SmoothQuant: Migrate quantization difficulty from activations to weights via a mathematically equivalent scaling transform
Rotation-based methods: Apply Hadamard rotation to spread outliers (as in QuIP#)

KV-Cache Quantization
#

The Key-Value (KV) cache is a major memory bottleneck during autoregressive LLM inference. For each token generated, the KV cache grows by:

$$\Delta_{\text{KV}} = 2 \times L \times H \times d_h \times b$$

where $L$ is the number of layers, $H$ is the number of KV heads (which may differ from query heads in GQA), $d_h$ is the head dimension, and $b$ is the bytes per element.

Total KV-cache memory for a sequence of length $n$:

$$M_{\text{KV}} = 2 \times L \times H \times d_h \times n \times b$$

Concrete example – LLaMA-2 70B with 32K context:

Parameter	Value
Layers ($L$)	80
KV heads ($H$, GQA)	8
Head dimension ($d_h$)	128
Sequence length ($n$)	32768

$$M_{\text{KV}}^{\text{FP16}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 2 = 8.59 \text{ GB}$$$$M_{\text{KV}}^{\text{INT4}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 0.5 = 2.15 \text{ GB}$$$$M_{\text{KV}}^{\text{INT2}} = 2 \times 80 \times 8 \times 128 \times 32768 \times 0.25 = 1.07 \text{ GB}$$

KV-cache quantization approaches:

Method	Bits	Key Insight	Quality
KIVI	K:2, V:2	Per-channel K, per-token V quantization	~0.1 PPL increase
KVQuant	2-4	Sensitivity-aware, non-uniform	< 0.1 PPL increase
Gear	2-4	Low-rank + sparse residual	Minimal loss
CacheQuant	4	Outlier-aware dynamic quantization	< 0.05 PPL increase

A key asymmetry: Keys and Values have different quantization sensitivities. Keys participate in the softmax attention computation where small errors can shift probability mass significantly, while Values are linearly combined. However, Keys tend to have more structured distributions (amenable to per-channel quantization), while Values have more per-token variation.

Attention Score Quantization
#

The attention mechanism involves:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Quantizing the intermediate attention scores ($QK^T$) and the post-softmax probabilities is challenging because:

Pre-softmax scores can have large dynamic range across heads and positions
Post-softmax probabilities are in $[0, 1]$ with a heavy-tailed distribution (most values near 0, a few near 1)
Causal masking introduces discontinuities (negative infinity values)

Effective strategies:

Quantize $Q$ and $K$ to INT8 with per-head scaling, compute $QK^T$ in INT32, then dequantize before softmax
Keep softmax computation in FP16/FP32 (numerically sensitive)
Quantize the attention output ($\text{softmax} \times V$) to INT8

Quantized Attention Computation:

  Q (INT8) x K^T (INT8) -> S (INT32) -> dequant -> S (FP16)
                                                      |
                                                   softmax (FP16)
                                                      |
                                                   P (FP16)
                                                      |
                                        P (FP16) x V (INT8) -> O (INT32)
                                                                  |
                                                               dequant -> O (FP16)

Vision Transformer Quantization
#

Vision Transformers (ViTs) present distinct quantization challenges compared to language models:

ViT-Specific Challenges
#

Post-LayerNorm activations: ViTs often use post-LayerNorm, creating different activation distributions than LLMs (which typically use pre-LayerNorm or RMSNorm)
Softmax attention bottleneck: ViTs process all spatial tokens simultaneously (no causal mask), leading to attention maps with very high entropy. Small quantization errors in attention probabilities can shift focus to wrong spatial regions.
Patch embedding sensitivity: The initial patch embedding layer projects raw pixel values to token representations. Quantization errors here propagate through the entire network.
Class token dependence: Classification ViTs rely on a single [CLS] token, making the network especially sensitive to quantization error that affects this token’s representation.

Quantization strategies for ViTs:

Strategy	Description	Typical Accuracy Impact
PTQ4ViT	Twin uniform quantization for softmax, Hessian-guided	-0.5% at W4A4
FQ-ViT	Power-of-two factor for LayerNorm, log2 quantizer for softmax	-0.3% at W4A4
RepQ-ViT	Reparameterize LayerNorm and softmax to quantization-friendly forms	-0.5% at W4A4
I-ViT	Integer-only ViT with Shiftmax and ShiftGELU	-0.2% at W8A8
NoisyQuant	Add fixed noise before quantization to break outlier structure	-0.4% at W8A8

Log2 quantizer for post-softmax values:

Since attention probabilities follow a roughly log-normal distribution after softmax, a log-scale quantizer is more appropriate:

$$q = \text{clamp}\left(\lfloor -\log_2(p) \rceil, 0, 2^b - 1\right)$$$$\hat{p} = 2^{-q}$$

This places more quantization levels near zero (where most probabilities lie) and fewer near one.

Diffusion Model Quantization
#

Diffusion models (DDPM, Stable Diffusion, DALL-E, etc.) introduce unique quantization challenges due to their iterative denoising process.

Time-Step Dependent Distributions
#

The core challenge: diffusion models are evaluated at many different noise levels (time steps), and the activation distributions change dramatically across time steps.

Activation distribution at different time steps:

t = 0 (clean):     t = 500 (medium):     t = 1000 (noisy):
    ***                 ****                    *****
   *****               ******                 ********
  *******             ********               **********
 *********           **********             ************
  narrow,             moderate,               wide,
  sharp peak          broader                very spread out

A single set of quantization parameters (scale, zero-point) cannot optimally handle all time steps. Solutions include:

Time-step aware quantization (TDQ): Maintain separate quantization parameters for different time-step ranges
Temporal information-aware quantization: Use the time-step embedding to dynamically adjust quantization parameters
PTQ4DM: Calibrate quantization parameters on a representative set of time steps

Diffusion-Specific Methods
#

Method	Approach	Result
Q-Diffusion	Time-step aware PTQ, shortcut-splitting	W4A8 with < 0.5 FID increase
PTQD	Time-step grouping, correlation-aware	W4A8 competitive with FP32
TDQ	Dedicated scales per time-step group	W8A8 near-lossless
EfficientDM	QAT with quantization-aware low-rank adaptation	W4A4 with minor FID increase

Error accumulation is a critical issue: in diffusion models, the output of step $t$ becomes the input to step $t-1$. Quantization errors accumulate across the 20-50+ denoising steps:

$$\epsilon_{\text{total}} \approx \sum_{t=T}^{1} \epsilon_t \cdot \prod_{s=1}^{t-1} (1 + \alpha_s)$$

where $\epsilon_t$ is the per-step quantization error and $\alpha_s$ captures error amplification. This makes diffusion models more sensitive to quantization than single-pass models.

Practical recommendation for Stable Diffusion:

UNet: W8A8 is safe; W4A8 is achievable with careful calibration; W4A4 requires QAT
VAE decoder: Keep at FP16 (highly sensitive, runs only once)
Text encoder (CLIP): W8A8 is typically safe
Time-step embedding MLP: Keep at higher precision (FP16 or INT8)

Inference Optimization and the Roofline Model
#

The Roofline Model for Quantized Inference
#

Understanding when quantization actually speeds up inference requires the roofline model, which characterizes computation as either compute-bound or memory-bound.

Arithmetic intensity (operational intensity):

$$I = \frac{\text{FLOPs}}{\text{Bytes transferred}}$$

The roofline model defines achievable performance as:

$$\text{Performance} = \min\left(\text{Peak FLOPS}, \quad I \times \text{Memory Bandwidth}\right)$$

Roofline Model with Quantization:

Performance
(TOPS)        Peak INT4
  |          /   Peak INT8
  |         /  /   Peak FP16
  |        / /  /
  |       //  /
  |      /  /
  |     / /     <-- Compute-bound region
  |    //        (quantization helps with peak TOPS)
  |   //
  |  /  <-- Memory-bound region
  | /    (quantization helps with bandwidth)
  |/
  +-----------------------------------------> Arithmetic Intensity
                                              (FLOPs/Byte)
          ^           ^
          |           |
     LLM decode   LLM prefill / CNN batch
     (batch=1)     inference

LLM inference phases:

Prefill (prompt processing): High arithmetic intensity (large matrix multiplications with many tokens). Often compute-bound. Quantization helps by increasing peak throughput (INT4 Tensor Cores are 2x faster than INT8).
Decode (token generation): Low arithmetic intensity (matrix-vector multiply, batch size = 1). Almost always memory-bound. Quantization helps primarily by reducing memory bandwidth requirements.

For the decode phase, the speedup from quantization is approximately:

$$\text{Speedup}_{\text{decode}} \approx \frac{b_{\text{original}}}{b_{\text{quantized}}} \times \frac{\text{BW}_{\text{quantized}}}{\text{BW}_{\text{original}}}$$

For INT4 vs FP16 on the same hardware (bandwidth ratio = 1):

$$\text{Speedup}_{\text{decode}} \approx \frac{16}{4} = 4\times$$

In practice, the speedup is lower (2-3x) due to dequantization overhead, group scale fetching, and non-weight memory accesses (KV cache, activations).

Dequantization Overhead
#

Quantized weights must be dequantized before computation (or during, in fused kernels). The dequantization cost depends on the quantization scheme:

Scheme	Dequant Operations per Weight	Relative Overhead
Per-tensor symmetric	1 multiply	Very low
Per-channel symmetric	1 multiply	Low
Per-group affine (g=128)	1 multiply + 1 add	Low
NF4 (lookup table)	1 table lookup + 1 multiply	Medium
AQLM (codebook)	1-2 table lookups + 1 add	Medium-High
QuIP# (E8 lattice + rotation)	Lattice decode + Hadamard transform	High

Efficient GPU kernels (e.g., from Marlin, ExLlamaV2, or TensorRT-LLM) fuse dequantization with the matrix multiply, hiding most of the overhead behind the memory latency of loading weights.

End-to-End Throughput Comparison
#

The following table compares practical inference throughput for a 7B-parameter LLM on a single NVIDIA RTX 4090 (24 GB VRAM):

Quantization	Bits/Weight	Model Size	Tokens/sec (decode)	Perplexity (WikiText-2)
FP16	16	14.0 GB	~35	5.68 (baseline)
GPTQ INT8	8	7.0 GB	~65	5.69
GPTQ INT4 (g128)	4.25	4.0 GB	~110	5.85
AWQ INT4 (g128)	4.25	4.0 GB	~115	5.79
GGUF Q4_K_M	4.85	4.6 GB	~100 (CPU+GPU)	5.82
GGUF Q3_K_M	3.875	3.5 GB	~120 (CPU+GPU)	6.15
GGUF Q2_K	2.5625	2.5 GB	~135 (CPU+GPU)	7.89
QuIP# 2-bit	2	2.0 GB	~80	6.45
AQLM 2-bit	2	2.0 GB	~75	6.32
BitNet 1.58b	1.58	~1.6 GB	~150 (specialized)	~5.70 (trained)

Note: BitNet requires training from scratch with ternary weights; all others are post-training quantization applied to a pre-trained FP16 model.

State-of-the-Art Comparison (2024-2025)
#

The following table summarizes the major quantization methods, their characteristics, and results as of early 2025:

Method	Year	Type	Bits	Calibration Data	Key Innovation	LLaMA-2 7B PPL	LLaMA-2 70B PPL
GPTQ	2022	PTQ	3-8	Yes (128 samples)	OBQ with lazy batching	6.29 (4-bit)	3.85 (4-bit)
AWQ	2023	PTQ	3-8	Yes (small)	Activation-aware scaling	5.89 (4-bit)	3.56 (4-bit)
SqueezeLLM	2023	PTQ	3-4	Yes	Dense-and-sparse; non-uniform	5.88 (4-bit)	–
QuIP	2023	PTQ	2-4	Yes	Incoherence processing	6.90 (2-bit)	4.55 (2-bit)
QuIP#	2023	PTQ	2-4	Yes	E8 lattice, Hadamard rotation	6.45 (2-bit)	4.15 (2-bit)
AQLM	2024	PTQ	2-4	Yes	Multi-codebook additive VQ	6.32 (2-bit)	4.02 (2-bit)
HQQ	2023	PTQ	2-8	No	Half-quadratic optimization	6.58 (4-bit)	3.68 (4-bit)
GGUF IQ2_XS	2024	PTQ	2.3	Yes	Importance-weighted lattice	7.21 (2.3-bit)	4.42 (2.3-bit)
OmniQuant	2023	PTQ/QAT	2-8	Yes	Learnable weight clipping + equiv. transform	5.86 (4-bit)	3.54 (4-bit)
QLoRA NF4	2023	QAT	4	Training data	NF4 + double quantization	5.70* (fine-tuned)	–
SpQR	2023	PTQ	3-4	Yes	Sparse outlier + dense quantized	5.84 (4-bit)	3.53 (4-bit)
SmoothQuant	2022	PTQ	W8A8	Yes	Smoothing transform for activations	– (W8A8)	– (W8A8)
KIVI	2024	PTQ	KV:2	Yes	Asymmetric K/V quantization	~0.1 PPL increase	~0.1 PPL increase
BitNet b1.58	2024	QAT	1.58	Training data	Ternary weights from scratch	~5.7 (trained)	–
OneBit	2024	QAT	1	Training data	1-bit with value-aware knowledge distillation	~6.2 (trained)	–
EfficientQAT	2024	QAT	2-4	Training data	Block-wise QAT + end-to-end	5.72 (4-bit)	3.42 (4-bit)

*QLoRA perplexity varies by fine-tuning task and dataset.

Key takeaways from the 2024-2025 landscape:

4-bit is the sweet spot for post-training quantization: methods like AWQ, GPTQ, and HQQ achieve near-lossless compression at 4x size reduction.
2-bit PTQ is viable for large models: QuIP#, AQLM, and GGUF IQ variants push the frontier below 3 bits, with 70B+ models maintaining reasonable quality. The larger the model, the more gracefully it quantizes.
1-2 bit requires training-aware methods: BitNet b1.58 demonstrates that training from scratch with extreme quantization can match full-precision performance, but this requires the full training compute budget.
KV-cache quantization is critical: For long-context applications, KV-cache memory can exceed model weight memory. Specialized methods like KIVI enable 2-bit KV caches with minimal quality loss.
Hardware support is evolving: NVIDIA Blackwell (B100/B200) adds native FP4 Tensor Cores. AMD MI300X supports FP8. Custom silicon (Groq, Cerebras) increasingly targets INT4/INT8. Software stacks (TensorRT-LLM, vLLM, llama.cpp) are key enablers.

Practical Decision Guide
#

Choosing a quantization strategy depends on your constraints. Here is a decision framework:

                         START
                           |
                    Do you have training
                    compute budget?
                      /          \
                    Yes           No
                    /               \
             Need <2 bits?     Need <3 bits?
               /    \            /        \
             Yes    No         Yes        No
             /        \        /            \
        BitNet     QLoRA    AQLM/         AWQ/GPTQ
        b1.58      NF4     QuIP#          INT4
        (train     (fine-   (2-bit        (4-bit PTQ,
        from       tune     PTQ)          best balance)
        scratch)   adapter)    |               |
                              |               |
                         Need fast        Hardware-
                         quantization?    specific?
                           /    \          /     \
                         Yes    No       Yes     No
                         /        \      /         \
                       HQQ     AQLM   HAQ       AWQ + Marlin
                      (no cal) (better (RL-based  kernel
                               quality) search)

Conclusion
#

Extreme and mixed-precision quantization has progressed from an academic curiosity to a practical necessity. The key developments of 2024-2025 demonstrate that:

FP8 has become the standard for training, with hardware support now widespread.
INT4 with group quantization (AWQ, GPTQ, GGUF K-quants) is the production standard for LLM inference.
2-bit quantization (QuIP#, AQLM) is practical for the largest models (70B+), enabling single-GPU deployment of models that previously required multi-node clusters.
1.58-bit (BitNet b1.58) points toward a future where extreme quantization is built into the training process, potentially eliminating floating-point multiply hardware entirely.
Mixed-precision strategies (HAWQ, HAQ) provide the theoretical and practical framework for optimally allocating bits across heterogeneous model components.

The field continues to advance rapidly. As new architectures (Mixture of Experts, State Space Models, hybrid designs) and new hardware (FP4 Tensor Cores, custom accelerators) emerge, the quantization landscape will continue to evolve. The fundamental principle remains: compress aggressively where the model is robust, preserve precision where it is sensitive, and always measure on your target task and hardware.

References
#

Micikevicius et al., “FP8 Formats for Deep Learning,” arXiv:2209.05433 (2022)
Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” NeurIPS 2022
Dettmers et al., “QLoRA: Efficient Finetuning of Quantized Language Models,” NeurIPS 2023
Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” ICLR 2023
Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” MLSys 2024
Chee et al., “QuIP: 2-Bit Quantization of Large Language Models With Guarantees,” NeurIPS 2023
Chee et al., “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,” ICML 2024
Egiazarian et al., “AQLM: Extreme Compression of Large Language Models via Additive Quantization,” ICML 2024
Badri & Shaji, “HQQ: Half-Quadratic Quantization,” arXiv:2309.15531 (2023)
Dong et al., “HAWQ: Hessian AWare Quantization of Neural Networks,” ICCV 2019
Wang et al., “HAQ: Hardware-Aware Automated Quantization with Mixed Precision,” CVPR 2019
Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits,” arXiv:2402.17764 (2024)
Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” arXiv:2402.02750 (2024)
Li et al., “Q-Diffusion: Quantizing Diffusion Models,” ICCV 2023
Yuan et al., “PTQ4ViT: Post-Training Quantization for Vision Transformers,” ECCV 2022
Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” ICML 2023

Parameter	Value
Layers (\(L\))	80
KV heads (\(H\), GQA)	8
Head dimension (\(d_h\))	128
Sequence length (\(n\))	32768

Introduction#

FP8: 8-Bit Floating Point#

Why Floating Point at 8 Bits?#

E4M3 and E5M2 Bit Layouts#

Numerical Examples#

FP8 Training#

INT4: 4-Bit Integer Quantization#

Uniform INT4 Quantization#

Group Quantization#

NF4: NormalFloat 4-bit#

GGUF Format and Quant Types#

Sub-4-Bit Quantization#

INT3 and INT2#

Binary Neural Networks (BNNs)#

BitNet b1.58#

Advanced Quantization Algorithms#

QuIP and QuIP##

AQLM: Additive Quantization for Language Models#

HQQ: Half-Quadratic Quantization#

Mixed-Precision Quantization#

Layer Sensitivity Analysis#

HAWQ: Hessian AWare Quantization#

HAQ: Hardware-Aware Quantization with Reinforcement Learning#

Transformer and LLM-Specific Challenges#

Activation Outliers#

KV-Cache Quantization#

Attention Score Quantization#

Vision Transformer Quantization#

ViT-Specific Challenges#

Diffusion Model Quantization#

Time-Step Dependent Distributions#

Diffusion-Specific Methods#

Inference Optimization and the Roofline Model#

The Roofline Model for Quantized Inference#

Dequantization Overhead#

End-to-End Throughput Comparison#

State-of-the-Art Comparison (2024-2025)#

Practical Decision Guide#

Conclusion#

References#