Table of Contents

Overview
#

Modern deep learning models are remarkably powerful, but their size and computational demands present serious deployment challenges. A single GPT-class large language model can exceed hundreds of billions of parameters, each stored as a 32-bit floating-point number. That translates to hundreds of gigabytes of memory just for the weights alone — before we even consider activations, gradients, or optimizer states.

Quantization is the systematic process of reducing the numerical precision of a model’s weights and activations — for example, converting from 32-bit floating point (FP32) to 8-bit integer (INT8). This seemingly simple transformation yields profound benefits across every axis that matters for deployment:

  • Memory reduction: Storing a weight in INT8 instead of FP32 cuts memory by 4x. For a 70-billion-parameter model, that is the difference between requiring 280 GB and 70 GB — the difference between a multi-GPU cluster and a single high-end GPU.
  • Compute throughput: Modern hardware (NVIDIA Tensor Cores, Google TPUs, Apple Neural Engine) provides 2x to 4x higher throughput for INT8 operations compared to FP32. Lower-precision formats like INT4 and FP8 push this even further.
  • Latency: Fewer bits means less data movement across the memory hierarchy. Since modern inference is almost always memory-bandwidth-bound, quantization directly reduces wall-clock latency.
  • Power efficiency: Smaller operands require less energy per operation. An INT8 multiply consumes roughly 18x less energy than an FP32 multiply in typical CMOS implementations.
  • Edge deployment: Microcontrollers, mobile SoCs, and dedicated AI accelerators often lack FP32 hardware entirely. Quantization is not optional for these targets — it is a hard requirement.

The Precision-Accuracy Tradeoff
#

The central tension in quantization is the precision-accuracy tradeoff. Every reduction in numerical precision introduces quantization error — a form of noise injected into the computation. The key insight is that neural networks are remarkably robust to this noise, far more so than most numerical algorithms. This robustness stems from several properties:

  1. Neural networks are trained with stochastic gradient descent, which itself injects noise. The learned representations are therefore inherently noise-tolerant.
  2. The loss landscape around a well-trained model’s minimum is typically flat, meaning small perturbations to weights do not catastrophically change outputs.
  3. Redundancy in over-parameterized networks means that many weights carry overlapping information.

The practical consequence is that we can often quantize models to INT8 with negligible accuracy loss (less than 0.1% on standard benchmarks), and even to INT4 with careful technique and modest degradation. The goal of quantization research is to push this frontier: achieve the lowest possible precision with the least possible accuracy loss.


Number Representation Basics
#

Before we can understand quantization, we must understand how numbers are represented in hardware. This section covers every format relevant to modern deep learning inference.

IEEE 754 Floating Point: FP32, FP16, BF16
#

The IEEE 754 standard defines floating-point formats as a triplet of fields: sign, exponent, and mantissa (also called significand or fraction). A floating-point number represents the value:

$$v = (-1)^{s} \times 2^{e - \text{bias}} \times (1 + m)$$

where \(s\) is the sign bit, \(e\) is the stored (biased) exponent, \(\text{bias}\) is a format-specific constant, and \(m\) is the fractional part of the mantissa (with an implicit leading 1 for normalized numbers).

FP32 (Single Precision) — 32 bits total

Bit layout (32 bits):
 31   30        23  22                              0
+----+----------+--+--------------------------------+
| S  | Exponent |  |          Mantissa              |
| 1  |  8 bits  |  |          23 bits               |
+----+----------+--+--------------------------------+

S = Sign (1 bit)
E = Exponent (8 bits), bias = 127
M = Mantissa (23 bits)

Value = (-1)^S x 2^(E-127) x (1.M)

Dynamic range: ~1.18e-38  to  ~3.40e+38
Precision:     ~7.2 decimal digits

Example: representing the number 6.625 in FP32.

  • 6.625 in binary: 110.101
  • Normalized: 1.10101 x 2^2
  • Sign = 0 (positive)
  • Exponent = 2 + 127 = 129 = 10000001 in binary
  • Mantissa = 10101000000000000000000

FP16 (Half Precision) — 16 bits total

Bit layout (16 bits):
 15   14    10   9                0
+----+--------+------------------+
| S  |  Exp   |    Mantissa      |
| 1  | 5 bits |    10 bits       |
+----+--------+------------------+

S = Sign (1 bit)
E = Exponent (5 bits), bias = 15
M = Mantissa (10 bits)

Value = (-1)^S x 2^(E-15) x (1.M)

Dynamic range: ~6.10e-5   to  ~6.55e+4
Precision:     ~3.3 decimal digits

FP16 halves the memory of FP32, but the limited dynamic range (maximum value ~65504) causes frequent overflow during training. Activations and gradients in large models routinely exceed this range, which is why FP16 training requires loss scaling.

BF16 (Brain Floating Point) — 16 bits total

Bit layout (16 bits):
 15   14       8   7             0
+----+---------+-----------------+
| S  |   Exp   |    Mantissa    |
| 1  |  8 bits |    7 bits      |
+----+---------+-----------------+

S = Sign (1 bit)
E = Exponent (8 bits), bias = 127
M = Mantissa (7 bits)

Value = (-1)^S x 2^(E-127) x (1.M)

Dynamic range: ~1.18e-38  to  ~3.40e+38  (same as FP32!)
Precision:     ~2.4 decimal digits

BF16 was designed by Google Brain specifically for deep learning. It preserves the full dynamic range of FP32 (same 8-bit exponent) while sacrificing precision (7 mantissa bits vs. 23). This is an excellent tradeoff for neural networks because:

  • The dynamic range prevents overflow/underflow without loss scaling.
  • The reduced precision is tolerable because neural network computations are noise-tolerant.
  • Conversion to/from FP32 is trivial: just truncate or zero-pad the lower 16 mantissa bits.

Fixed-Point Representation
#

Fixed-point numbers use a fixed number of integer bits and fractional bits. For a format denoted Q\(m\).\(n\) (where \(m\) is integer bits and \(n\) is fractional bits, plus one sign bit):

$$v = -s \cdot 2^{m} + \sum_{i=0}^{m-1} b_i \cdot 2^{i} + \sum_{j=1}^{n} b_{-j} \cdot 2^{-j}$$
Example: Q3.4 format (8 bits total: 1 sign + 3 integer + 4 fractional)

  Bit:    S    2^2   2^1   2^0  .  2^-1  2^-2  2^-3  2^-4
         [1]   [0]   [1]   [1]  .  [1]   [0]   [1]   [0]

Value = -1*0 + 0*4 + 1*2 + 1*1 + 1*0.5 + 0*0.25 + 1*0.125 + 0*0.0625
      = 3.625

Range:   [-8.0, +7.9375]
Step:    0.0625  (= 2^-4)

Fixed-point is heavily used in DSPs and microcontrollers. Its main advantage is that addition and subtraction use the same hardware as integer operations. Multiplication requires a post-shift to realign the radix point. The disadvantage is the rigid tradeoff between range and precision — you must choose the radix position at design time.

Integer Representation: INT8 and INT4
#

Integer formats are the most common quantization targets because integer arithmetic units are small, fast, and energy-efficient.

INT8 (signed)

Range:  [-128, +127]   (two's complement)
        [0, 255]       (unsigned)
Values: 256 discrete levels

INT4 (signed)

Range:  [-8, +7]       (two's complement)
        [0, 15]        (unsigned)
Values: 16 discrete levels

INT4 provides 8x compression over FP32 but with only 16 representable values per quantization group. This extreme compression requires sophisticated techniques (group quantization, mixed precision) to maintain accuracy.

FP8 Formats: E4M3 and E5M2
#

FP8 is a recently standardized 8-bit floating-point format (OFP specification by NVIDIA, ARM, and Intel). Two variants exist, optimized for different use cases:

E4M3 (4-bit exponent, 3-bit mantissa)

Bit layout (8 bits):
  7    6    4    3    1    0
+----+------+----+----+----+
| S  | Exp  |  Mantissa    |
| 1  |4 bits|   3 bits     |
+----+------+--------------+

Bias = 7
Dynamic range: ~1.95e-3 to 448
Precision:     ~1.0 decimal digits
Special: NaN = 0x7F (S=0,E=1111,M=111), no Inf representation

E5M2 (5-bit exponent, 2-bit mantissa)

Bit layout (8 bits):
  7    6      2    1    0
+----+--------+----+----+
| S  |  Exp   | Mantissa |
| 1  | 5 bits | 2 bits   |
+----+--------+----------+

Bias = 15
Dynamic range: ~6.10e-5 to 57344
Precision:     ~0.6 decimal digits
Special: Inf and NaN follow IEEE 754 conventions

The design philosophy is:

  • E4M3 for weights and forward activations: more precision (3 mantissa bits), moderate range.
  • E5M2 for gradients during training: wider dynamic range (5 exponent bits) to handle gradient magnitudes, accepting lower precision.

Dynamic Range vs. Precision Tradeoff
#

For any fixed bit-width \(b\), increasing the exponent bits widens the dynamic range but reduces precision (fewer mantissa bits), and vice versa. This is a fundamental tradeoff governed by:

$$\text{Dynamic Range} = 2^{2^{e}-1-\text{bias}}$$

$$\text{Precision (ULP at 1.0)} = 2^{-m}$$

where \(e\) is the number of exponent bits and \(m\) is the number of mantissa bits. The following table summarizes:

FormatTotal BitsExponentMantissaDynamic RangePrecision (digits)
FP3232823\(\pm 3.4 \times 10^{38}\)~7.2
FP1616510\(\pm 6.55 \times 10^{4}\)~3.3
BF161687\(\pm 3.4 \times 10^{38}\)~2.4
FP8 E4M3843\(\pm 448\)~1.0
FP8 E5M2852\(\pm 57344\)~0.6
INT88N/AN/A\([-128, 127]\)1 (uniform step)
INT44N/AN/A\([-8, 7]\)1 (uniform step)

What is Quantization?
#

Quantization, in the mathematical sense, is the process of mapping a continuous or high-precision set of values to a finite, discrete, lower-precision set. In the context of deep learning, we map floating-point tensors (weights and activations) to lower-precision representations.

The Quantization Function
#

The affine quantization function maps a real-valued input \(x\) to a quantized integer \(x_q\):

$$x_q = \text{clamp}\!\left(\left\lfloor \frac{x}{s} \right\rceil + z,\; q_{\min},\; q_{\max}\right)$$

where:

  • \(s\) is the scale factor (a positive real number),
  • \(z\) is the zero-point (an integer),
  • \(\lfloor \cdot \rceil\) denotes rounding to nearest integer,
  • \(q_{\min}, q_{\max}\) define the representable range (e.g., \(-128, 127\) for signed INT8).

The clamp function prevents overflow:

$$\text{clamp}(x, a, b) = \min(\max(x, a), b)$$

The Dequantization Function
#

To recover an approximate real value from the quantized representation:

$$\hat{x} = s \cdot (x_q - z)$$

Note that \(\hat{x} \neq x\) in general — quantization is a lossy transformation. The value \(\hat{x}\) is the dequantized value, which lies on the quantization grid.

Full Round-Trip Example
#

Let us quantize the value \(x = 1.572\) to signed INT8 (\(q_{\min} = -128\), \(q_{\max} = 127\)) with scale \(s = 0.02\) and zero-point \(z = 0\).

Step 1: Quantize

$$x_q = \text{clamp}\!\left(\left\lfloor \frac{1.572}{0.02} \right\rceil + 0, -128, 127\right) = \text{clamp}(\lfloor 78.6 \rceil, -128, 127) = \text{clamp}(79, -128, 127) = 79$$

Step 2: Dequantize

$$\hat{x} = 0.02 \times (79 - 0) = 1.58$$

Step 3: Quantization Error

$$\epsilon = x - \hat{x} = 1.572 - 1.58 = -0.008$$

The absolute error is bounded by half the step size: \(|\epsilon| \leq s/2 = 0.01\).

Quantization Error Analysis
#

For a uniform quantizer with step size \(s\), the rounding error \(\epsilon = x - \hat{x}\) is uniformly distributed in \([-s/2, +s/2]\) (assuming the input is not at the clipping boundaries). The statistical properties are:

$$\mathbb{E}[\epsilon] = 0$$$$\text{Var}[\epsilon] = \frac{s^2}{12}$$$$\text{MSE} = \mathbb{E}[\epsilon^2] = \frac{s^2}{12}$$

This is the classic quantization noise model from signal processing theory. The variance scales quadratically with the step size, which means halving the step size (adding one bit of precision) reduces quantization noise power by a factor of 4 (6 dB).

For \(b\)-bit quantization over a range \([\alpha, \beta]\):

$$s = \frac{\beta - \alpha}{2^b - 1}$$$$\text{MSE}_{\text{round}} = \frac{1}{12}\left(\frac{\beta - \alpha}{2^b - 1}\right)^2$$

Uniform vs. Non-Uniform Quantization
#

Uniform Quantization
#

In uniform quantization, the quantization levels are equally spaced. The step size (also called the quantization step or resolution) is constant:

$$s = \frac{\beta - \alpha}{2^b - 1}$$

where \([\alpha, \beta]\) is the clipping range and \(b\) is the number of bits. The quantization levels are:

$$\hat{x}_i = \alpha + i \cdot s, \quad i = 0, 1, \ldots, 2^b - 1$$
Uniform Quantization (3-bit unsigned, 8 levels):

  Input range:  [0.0 ──────────────────────── 7.0]
                 |                               |
  Quant levels: 0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0
                 |    |    |    |    |    |    |    |
  Codes:         0    1    2    3    4    5    6    7

  Step size s = 1.0 (uniform everywhere)

Uniform quantization is by far the most common in practice because:

  1. The quantize/dequantize operations require only multiply-add, which maps efficiently to hardware.
  2. Quantized arithmetic (especially matrix multiplication) can be performed entirely in the integer domain.
  3. All modern AI accelerators are designed around uniform quantization.

Non-Uniform Quantization
#

In non-uniform quantization, the quantization levels are not equally spaced. This allows allocating more levels to regions where the data is dense and fewer levels to sparse regions, minimizing overall distortion.

Logarithmic (Log-scale) Quantization

A common non-uniform scheme places levels on a logarithmic scale. For a positive value \(x\):

$$x_q = \text{round}\!\left(\frac{\log_2(x) - \log_2(\alpha)}{\log_2(\beta) - \log_2(\alpha)} \cdot (2^b - 1)\right)$$

This concentrates more levels near zero, which aligns well with the typical bell-shaped distribution of neural network weights (most values are small, with exponentially decaying tails).

Non-Uniform (Log) Quantization (3-bit, 8 levels):

  Input range:   [0.01 ──────────────────────── 10.0]
                  |                                |
  Quant levels:  0.01  0.03  0.1  0.3  1.0  3.0  5.6  10.0
                  | |  |  |    |     |     |      |      |
  Codes:          0  1  2  3    4     5     6      7

  Step sizes: SMALL near zero ──────> LARGE near max
  (More resolution where most weights live)

K-Means Based Quantization

Given a tensor of values \({x_1, x_2, \ldots, x_n}\), we can find the optimal non-uniform levels by running k-means clustering with \(k = 2^b\) clusters. The cluster centroids become the quantization levels, and each value is assigned to its nearest centroid.

The objective is to minimize the total squared error:

$$\min_{\{c_1, \ldots, c_k\}} \sum_{i=1}^{n} \min_{j} (x_i - c_j)^2$$

This is exactly the Lloyd-Max quantizer from information theory — the optimal non-uniform quantizer for a given distribution.

Lookup Table (LUT) Implementation

Non-uniform quantization stores a lookup table mapping each code to its corresponding dequantized value:

Code -> Value Lookup Table (example, 4 levels):

  Code:   00  ->  -0.42
  Code:   01  ->  -0.03
  Code:   10  ->  +0.05
  Code:   11  ->  +0.51

  Storage: 4 entries x FP16 = 8 bytes overhead per group

  Dequantization: value = LUT[code]   (simple table lookup)

Powers-of-Two Quantization
#

A special case of non-uniform quantization restricts all values to powers of two:

$$\hat{x} = \text{sign}(x) \cdot 2^{\text{round}(\log_2 |x|)}$$

The major advantage is that multiplication by a power-of-two is a simple bit-shift operation, eliminating the need for hardware multipliers entirely. This is extremely attractive for ultra-low-power edge devices.

Powers-of-Two levels (4-bit signed example):

  ..., -4, -2, -1, -0.5, -0.25, 0, +0.25, +0.5, +1, +2, +4, ...

  Multiply by 2^k  =  left-shift by k bits  (FREE in hardware!)

Comparison: Uniform vs. Non-Uniform Quantization
#

PropertyUniformNon-Uniform
Level spacingEqualVariable
Optimal forUniform distributionsPeaked/skewed distributions
Hardware supportNative on all acceleratorsRequires LUT or special logic
Arithmetic in quantized domainSimple integer opsComplex; usually dequantize first
Calibration costLow (just find scale + zero-point)High (k-means, profiling)
Compression ratioFixed by bit-widthSame bit-width, better accuracy
Typical use caseProduction inferenceResearch, weight-only compression
Dequantization speedMultiply-add (fast)Table lookup (cache-dependent)

Symmetric vs. Asymmetric Quantization
#

The choice of zero-point \(z\) defines two major quantization modes.

Symmetric Quantization
#

In symmetric quantization, the zero-point is fixed at zero (\(z = 0\)), and the clipping range is symmetric around the origin: \([-\alpha, +\alpha]\).

The scale factor is:

$$s = \frac{\alpha}{2^{b-1} - 1}$$

where \(\alpha = \max(|x_{\min}|, |x_{\max}|)\) and \(b\) is the bit-width. The quantization and dequantization functions simplify to:

$$x_q = \text{clamp}\!\left(\left\lfloor \frac{x}{s} \right\rceil, -2^{b-1}+1, 2^{b-1}-1\right)$$$$\hat{x} = s \cdot x_q$$

Note that for signed \(b\)-bit integers, the range \([-2^{b-1}+1, 2^{b-1}-1]\) is used instead of \([-2^{b-1}, 2^{b-1}-1]\) to maintain exact symmetry (the value \(-2^{b-1}\) has no positive counterpart).

Numerical Example (INT8 Symmetric)

Suppose the weight tensor has \(x_{\min} = -1.5\), \(x_{\max} = 0.9\).

  • \(\alpha = \max(1.5, 0.9) = 1.5\)
  • \(s = 1.5 / 127 = 0.011811\)
  • Quantize \(x = 0.45\): \(x_q = \lfloor 0.45 / 0.011811 \rceil = \lfloor 38.1 \rceil = 38\)
  • Dequantize: \(\hat{x} = 0.011811 \times 38 = 0.4488\)
  • Error: \(|0.45 - 0.4488| = 0.0012\)
Symmetric Quantization (INT8, alpha = 1.5):

  Real axis:
  -1.5                    0.0                    +1.5
    |--------|--------|--------|--------|---------|
  -127                     0                    +127

  Quantized axis (integer codes):

  Note: The range [-1.5, +1.5] maps to [-127, +127]
        Real zero maps EXACTLY to integer zero
        The range [0.9, 1.5] is "wasted" (few/no weights there)

The key property: real zero maps exactly to quantized zero. This is critical for operations like zero-padding in convolutions, where injected zeros must remain exactly zero after quantization.

Asymmetric Quantization
#

In asymmetric quantization, the clipping range \([\beta_{\min}, \beta_{\max}]\) is not necessarily symmetric around zero. Both scale and zero-point are computed:

$$s = \frac{\beta_{\max} - \beta_{\min}}{2^b - 1}$$$$z = \text{round}\!\left(q_{\min} - \frac{\beta_{\min}}{s}\right)$$

where \(q_{\min}\) is the minimum quantized value (e.g., 0 for unsigned INT8, or \(-128\) for signed INT8). The quantization function is:

$$x_q = \text{clamp}\!\left(\left\lfloor \frac{x}{s} \right\rceil + z,\; q_{\min},\; q_{\max}\right)$$$$\hat{x} = s \cdot (x_q - z)$$

Numerical Example (UINT8 Asymmetric)

Suppose an activation tensor has \(\beta_{\min} = -0.2\), \(\beta_{\max} = 5.8\). Using unsigned INT8 (\(q_{\min} = 0\), \(q_{\max} = 255\)):

  • \(s = (5.8 - (-0.2)) / 255 = 6.0 / 255 = 0.023529\)
  • \(z = \text{round}(0 - (-0.2) / 0.023529) = \text{round}(8.5) = 9\) (so integer 9 represents real zero)
  • Quantize \(x = 3.0\): \(x_q = \text{clamp}(\lfloor 3.0 / 0.023529 \rceil + 9, 0, 255) = \text{clamp}(\lfloor 127.5 \rceil + 9, 0, 255) = \text{clamp}(137, 0, 255) = 137\)
  • Dequantize: \(\hat{x} = 0.023529 \times (137 - 9) = 0.023529 \times 128 = 3.0118\)
  • Error: \(|3.0 - 3.0118| = 0.0118\)
Asymmetric Quantization (UINT8, range [-0.2, 5.8]):

  Real axis:
  -0.2             0.0                              5.8
    |--------|------+-----------|---------|----------|
    0        9                                      255

  Quantized axis (integer codes):

  Note: Real zero maps to integer 9 (the zero-point)
        The full [0, 255] range covers [-0.2, 5.8]
        No range is "wasted" — the mapping is tight

When to Use Which
#

CriterionSymmetricAsymmetric
Zero-point overheadNone (z = 0)Stored per tensor/channel/group
Range utilizationPoor if distribution is skewedOptimal — no wasted range
Computation overheadLower (no z in multiply)Higher (z term in integer GEMM)
Best for weightsYes (typically near-symmetric)Overkill — weights are ~symmetric
Best for activationsPoor (ReLU outputs are [0, +))Yes (covers one-sided ranges)
Zero-padding correctnessGuaranteed (z = 0)Must handle z carefully

Rule of thumb: Use symmetric quantization for weights (which tend to be roughly symmetric around zero) and asymmetric for activations (which are often one-sided after ReLU or have shifted distributions).


Granularity of Quantization
#

Quantization parameters (scale \(s\) and zero-point \(z\)) can be computed at different granularities. Finer granularity provides better accuracy at the cost of additional storage and computational overhead.

Per-Tensor Quantization
#

A single scale and zero-point for the entire tensor:

Weight tensor W (shape: [out_channels, in_channels]):

  +------------------------------------------+
  |                                          |
  |        All elements share one s, z       |
  |                                          |
  +------------------------------------------+

  Parameters: 1 scale + 1 zero-point = 2 values

This is the coarsest granularity. If different regions of the tensor have very different value distributions, a single scale factor will be suboptimal — it must accommodate the global extremes, leaving most values poorly utilized in the quantized range.

Per-Channel Quantization
#

A separate scale and zero-point for each output channel (row of the weight matrix or filter of a convolution):

Weight tensor W (shape: [out_channels, in_channels]):

  Channel 0: [-------- s0, z0 --------]
  Channel 1: [-------- s1, z1 --------]
  Channel 2: [-------- s2, z2 --------]
  ...
  Channel N: [-------- sN, zN --------]

  Parameters: N scales + N zero-points = 2N values

This is the de facto standard for weight quantization. Different output channels often have very different magnitude distributions, and per-channel quantization handles this gracefully. The overhead is minimal: for a matrix with 4096 output channels, we store only 4096 extra scale values — a negligible fraction of the total parameter count.

Per-Group Quantization
#

Elements within each channel are further divided into groups of size \(g\), each with its own scale and zero-point:

Weight tensor W, one channel (in_channels = 12, group_size = 4):

  [--- g0: s0,z0 ---][--- g1: s1,z1 ---][--- g2: s2,z2 ---]
  [ w0  w1  w2  w3  ][ w4  w5  w6  w7  ][ w8  w9 w10 w11 ]

  Parameters per channel: (in_channels / g) * 2
  Total parameters:       out_channels * (in_channels / g) * 2

Common group sizes are 32, 64, or 128. Per-group quantization is critical for aggressive low-bit quantization (INT4, INT3) because it allows each small group to use its own scale, dramatically reducing quantization error within each group.

Numerical Example: For a weight matrix of shape [4096, 4096] with INT4 quantization and group size 128:

  • Weight storage: 4096 x 4096 x 4 bits = 8 MB
  • Scale storage: 4096 x (4096/128) x 16 bits = 4096 x 32 x 2 bytes = 256 KB
  • Overhead: 256 KB / 8 MB = 3.1% — a small price for significantly better accuracy.

Per-Token Quantization
#

For activations in transformer models, a separate scale is computed for each token (each row of the activation matrix):

Activation tensor X (shape: [seq_len, hidden_dim]):

  Token 0: [--------- s0 ---------]
  Token 1: [--------- s1 ---------]
  Token 2: [--------- s2 ---------]
  ...
  Token T: [--------- sT ---------]

  Computed dynamically at runtime for each input

Per-token quantization is particularly useful because different tokens can have wildly different activation magnitudes. It is computed on-the-fly (no calibration needed) and adds negligible overhead since \(T \ll T \times d\).

Granularity Comparison
#

GranularityOverheadAccuracyHardware FriendlinessTypical Use
Per-tensorMinimal (2 values)LowestBestActivations (simple)
Per-channelLow (2 x C)GoodGood (standard)Weights (standard)
Per-groupModerate (2 x C x K/g)Very goodModerateINT4 / INT3 weights
Per-tokenLow (T values)GoodGoodTransformer activations

Quantization of Weights vs. Activations
#

Weights and activations present fundamentally different quantization challenges.

Why Weights Are Easier to Quantize
#

Weight distributions are static — they do not change after training. This means:

  1. We can analyze the full distribution offline during a calibration phase.
  2. Quantization parameters are computed once and stored alongside the model.
  3. Weight distributions tend to be approximately Gaussian centered near zero, which is well-suited for symmetric quantization.
  4. Outliers in weights are relatively rare and manageable.
Typical Weight Distribution:

  Frequency
    |          *****
    |        **     **
    |      **         **
    |    **             **
    |  **                 **
    |**                     **
    +--*------|---------|---*----->  Value
         -0.3    0.0    +0.3

  Nearly symmetric, bell-shaped, compact range
  => Easy to quantize with symmetric INT8

Why Activations Are Harder to Quantize
#

Activation distributions are dynamic — they change with every input. The challenges are:

  1. The distribution depends on the input data, so quantization parameters must either be precomputed from calibration data or computed at runtime.
  2. After ReLU, activations are one-sided (\([0, +\infty)\)), making asymmetric quantization necessary.
  3. Outliers are more common and more extreme. A small number of channels may have activations 10x to 100x larger than typical, forcing the scale factor to accommodate these extremes and wasting quantization range for the majority of values.
  4. Different layers and different sequence positions can have very different distributions.
Typical Activation Distribution (post-ReLU):

  Frequency
    |*
    | *
    |  *
    |   **
    |     ***
    |        ****
    |            ******
    +-----|------|---------|------->  Value
         0.0   0.5       5.0     (outlier at 50.0!)

  One-sided, heavy-tailed, with potential outliers
  => Harder to quantize; outliers waste range

Mixed Strategies
#

A common practical approach is:

  • Weights: INT8 or INT4, per-channel, symmetric, determined offline.
  • Activations: INT8, per-tensor or per-token, asymmetric, calibrated or dynamic.

This combination balances accuracy and efficiency well.

Quantized Matrix Multiplication: Full Math
#

The core operation in neural networks is matrix multiplication \(Y = XW^T\), where \(X\) is the activation matrix and \(W\) is the weight matrix. Let us derive how this works when both are quantized.

Let:

  • \(X_q = \text{round}(X / s_x) + z_x\) (quantized activations)
  • \(W_q = \text{round}(W / s_w) + z_w\) (quantized weights)

The dequantized values are:

  • \(\hat{X} = s_x (X_q - z_x)\)
  • \(\hat{W} = s_w (W_q - z_w)\)

The approximate matrix multiplication is:

$$\hat{Y} = \hat{X} \hat{W}^T = s_x(X_q - z_x) \cdot [s_w(W_q - z_w)]^T$$$$= s_x s_w (X_q - z_x)(W_q - z_w)^T$$

Expanding the product for a single output element \(\hat{Y}_{ij}\):

$$\hat{Y}_{ij} = s_x s_w \sum_{k=1}^{K} (X_{q,ik} - z_x)(W_{q,jk} - z_w)$$$$= s_x s_w \left[\sum_{k} X_{q,ik} W_{q,jk} - z_w \sum_{k} X_{q,ik} - z_x \sum_{k} W_{q,jk} + K \cdot z_x z_w\right]$$

Let us define:

$$P_{ij} = \sum_{k} X_{q,ik} \cdot W_{q,jk} \quad \text{(integer dot product — the main compute)}$$$$A_i = \sum_{k} X_{q,ik} \quad \text{(row sum of quantized activations)}$$$$B_j = \sum_{k} W_{q,jk} \quad \text{(row sum of quantized weights — precomputable!)}$$

Then:

$$\hat{Y}_{ij} = s_x s_w \left[P_{ij} - z_w A_i - z_x B_j + K \cdot z_x z_w\right]$$

Key observations:

  1. \(P_{ij}\) is a pure integer matrix multiplication — this is what the hardware accelerates.
  2. \(B_j\) is constant (weights are static) and can be precomputed.
  3. \(K \cdot z_x z_w\) is a scalar constant (can be precomputed if both zero-points are static).
  4. \(A_i\) must be computed at runtime but is just a row sum — cheap.
  5. If we use symmetric quantization for weights (\(z_w = 0\)), the formula simplifies to:
$$\hat{Y}_{ij} = s_x s_w \left[P_{ij} - z_x B_j\right]$$

And if activations are also symmetric (\(z_x = 0\)):

$$\hat{Y}_{ij} = s_x s_w \cdot P_{ij}$$

This is the simplest form: pure integer matmul followed by a single scale multiplication. This is why symmetric quantization is preferred when possible — it eliminates the zero-point correction terms.


Clipping and Calibration
#

The quantization range \([\alpha, \beta]\) need not equal the actual \([\min(x), \max(x)]\) of the data. Clipping — choosing a tighter range that excludes some extreme values — can reduce overall quantization error by trading increased clipping error for decreased rounding error.

MinMax Calibration
#

The simplest approach: set \(\alpha = \min(x)\) and \(\beta = \max(x)\) over the calibration data.

$$s = \frac{\max(x) - \min(x)}{2^b - 1}$$
  • Pros: No clipping error; every value is representable.
  • Cons: Highly sensitive to outliers. A single extreme value can stretch the range, increasing rounding error for all other values.

Percentile Calibration
#

Use the \(p\)-th and \((100-p)\)-th percentiles instead of the true min/max:

$$\alpha = \text{percentile}(x, p), \quad \beta = \text{percentile}(x, 100 - p)$$

Common choices are \(p = 0.01\) (99.99th percentile) or \(p = 0.1\) (99.9th percentile). Values outside \([\alpha, \beta]\) are clipped.

  • Pros: Robust to outliers; easy to compute.
  • Cons: The choice of \(p\) is a hyperparameter that may require tuning per layer.

MSE-Based Optimal Clipping
#

We can find the clipping range that minimizes the mean squared error between the original and dequantized values. The total MSE has two components:

$$\text{MSE}_{\text{total}}(\alpha, \beta) = \text{MSE}_{\text{round}}(\alpha, \beta) + \text{MSE}_{\text{clip}}(\alpha, \beta)$$

Rounding error (for values within \([\alpha, \beta]\)):

$$\text{MSE}_{\text{round}} = \frac{s^2}{12} \cdot P(\alpha \leq x \leq \beta) = \frac{(\beta - \alpha)^2}{12(2^b - 1)^2} \cdot P(\alpha \leq x \leq \beta)$$

Clipping error (for values outside \([\alpha, \beta]\)):

$$\text{MSE}_{\text{clip}} = \int_{-\infty}^{\alpha} (x - \alpha)^2 f(x)\,dx + \int_{\beta}^{\infty} (x - \beta)^2 f(x)\,dx$$

where \(f(x)\) is the probability density function of the data.

The optimal \(\alpha^, \beta^\) minimize \(\text{MSE}_{\text{total}}\):

$$(\alpha^*, \beta^*) = \arg\min_{\alpha, \beta} \text{MSE}_{\text{total}}(\alpha, \beta)$$

For symmetric quantization (\(\alpha = -\beta\)), this reduces to a one-dimensional search over \(\beta\). If we assume a Gaussian distribution \(x \sim \mathcal{N}(0, \sigma^2)\), the optimal clipping threshold \(\beta^*\) can be shown to satisfy:

$$\beta^* \approx \sigma \cdot c(b)$$

where \(c(b)\) is a constant that depends on the bit-width. For INT8, \(c(8) \approx 3.89\) (compared to the naive \(3\sigma\) or \(6\sigma\) rules). This slightly aggressive clipping clips about 0.01% of values but significantly reduces rounding error for the remaining 99.99%.

In practice, the MSE-optimal clipping threshold is found by grid search:

Algorithm: MSE-Based Calibration

1. Collect activation histograms from calibration data
2. For each candidate threshold t in [0, max_val]:
   a. Compute scale s = 2*t / (2^b - 1)  (symmetric)
   b. Quantize the histogram: q = round(bins / s) * s
   c. Compute MSE = mean((original_bins - q)^2 * counts)
3. Select t* = argmin_t MSE
4. Set scale = 2*t* / (2^b - 1)

KL-Divergence Calibration (Entropy-Based)
#

This method, popularized by NVIDIA’s TensorRT, finds the clipping range that minimizes the information loss between the original and quantized distributions. The Kullback-Leibler divergence is:

$$D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}$$

where \(P\) is the original (FP32) distribution and \(Q\) is the quantized distribution.

Algorithm: KL-Divergence Calibration (TensorRT style)

1. Collect a histogram of activation values with fine bins
   (e.g., 2048 bins over the full FP32 range)
2. For each candidate number of bins to keep, n = 128, 129, ..., 2048:
   a. Clip the histogram at n bins
   b. Quantize the clipped histogram into 2^b levels:
      - Merge adjacent bins to create 2^b "super-bins"
      - The quantized distribution assigns uniform
        probability within each super-bin
   c. Compute D_KL(original || quantized)
3. Select n* = argmin_n D_KL
4. Set the clipping threshold from n*

The intuition is that KL divergence measures how much information is lost when approximating \(P\) with \(Q\). Minimizing it preserves the statistical structure of the activation distribution as faithfully as possible within the quantization constraints.

Cross-Entropy Calibration
#

Cross-entropy calibration directly optimizes the task loss. Instead of minimizing a proxy (MSE or KL divergence on the distributions), it evaluates the model’s cross-entropy loss on calibration data for each candidate clipping threshold:

$$\alpha^* = \arg\min_{\alpha} \mathcal{L}_{\text{CE}}(f_{\alpha}(X_{\text{cal}}), Y_{\text{cal}})$$

where \(f_{\alpha}\) is the model with quantization using clipping threshold \(\alpha\), and \((X_{\text{cal}}, Y_{\text{cal}})\) is the calibration dataset.

  • Pros: Directly optimizes what we care about (task performance).
  • Cons: Expensive (requires forward passes for each candidate); risk of overfitting to calibration data.

Calibration Methods Comparison
#

MethodOptimizesCostOutlier RobustnessAccuracy
MinMaxNone (uses raw range)Very lowPoorBaseline
PercentileOutlier rejectionLowGoodGood
MSEReconstruction errorMediumGoodVery good
KL-DivergenceDistribution matchMediumGoodVery good
Cross-EntropyTask lossHighBestBest

Quantization Error and Its Effects
#

Rounding Error Analysis
#

For a single element quantized with step size \(s\), the rounding error is:

$$\epsilon_{\text{round}} = x - s \cdot \left\lfloor \frac{x}{s} \right\rceil$$

Under the assumption that \(x\) is uniformly distributed within a quantization bin (valid for smooth distributions and fine quantization), \(\epsilon_{\text{round}}\) is uniformly distributed on \([-s/2, s/2]\):

$$\epsilon_{\text{round}} \sim \text{Uniform}(-s/2, +s/2)$$$$\text{Var}[\epsilon_{\text{round}}] = \frac{s^2}{12}$$

For \(b\)-bit quantization over range \(R = \beta - \alpha\):

$$s = \frac{R}{2^b - 1}$$$$\text{Var}[\epsilon_{\text{round}}] = \frac{R^2}{12(2^b - 1)^2}$$

Adding one bit of precision halves \(s\) and reduces variance by a factor of 4, or equivalently provides 6.02 dB of signal-to-quantization-noise ratio (SQNR):

$$\text{SQNR (dB)} = 6.02b + 4.77 - 20\log_{10}(R / \sigma_x)$$

Clipping Error Analysis
#

For a symmetric quantizer with threshold \(\alpha\), values outside \([-\alpha, +\alpha]\) are clipped. Assuming \(x \sim \mathcal{N}(0, \sigma^2)\), the clipping MSE is:

$$\text{MSE}_{\text{clip}} = 2\int_{\alpha}^{\infty} (x - \alpha)^2 \cdot \frac{1}{\sqrt{2\pi}\sigma} e^{-x^2/(2\sigma^2)} dx$$

This integral can be expressed in terms of the Gaussian Q-function. As \(\alpha\) increases, clipping error decreases exponentially. As \(\alpha\) decreases, clipping error increases polynomially.

Total Quantization Error
#

$$\text{MSE}_{\text{total}} = \text{MSE}_{\text{round}} + \text{MSE}_{\text{clip}}$$
Error Tradeoff as a Function of Clipping Threshold alpha:

  MSE
   |
   | \                              ___---   Total Error
   |  \                         ---
   |   \       ___---*---___---      <-- Optimal alpha*
   |    \  ---       |
   |     \/          |
   |    / \          |
   |   /   \         |
   |  /     ---------.----------     Rounding Error
   | /                \
   |/                  \_________    Clipping Error
   +--------|---------|----------|-->  alpha
           small    optimal     large

   Small alpha: little rounding error, lots of clipping
   Large alpha: no clipping, but large step size => rounding
   Optimal alpha*: minimizes the sum

Error Propagation Through Layers
#

In a deep network with \(L\) layers, quantization error in layer \(l\) propagates forward through subsequent layers. Consider a simplified linear model \(y = W_L \cdot W_{L-1} \cdots W_1 \cdot x\).

If each layer introduces a multiplicative perturbation \(W_l + \Delta W_l\) where \(\Delta W_l\) is the quantization error, the output perturbation to first order is:

$$\Delta y \approx \sum_{l=1}^{L} \left(\prod_{j=l+1}^{L} W_j\right) \Delta W_l \left(\prod_{j=1}^{l-1} W_j\right) x$$

The key insight is that errors in early layers are amplified by all subsequent layers’ weight matrices. This has several practical implications:

  1. Early layers are more sensitive: Quantization error in the first layers passes through more subsequent multiplications.
  2. Narrow layers (bottlenecks) are more sensitive: They have less redundancy to absorb quantization noise.
  3. Layers with large weight norms amplify error more: The magnification factor depends on the spectral norms of the weight matrices.

Sensitivity Analysis Across Layer Types
#

Different layer types exhibit different sensitivity to quantization:

Layer TypeSensitivityReason
Embedding layersVery HighDiscrete lookup; errors directly corrupt tokens
First conv / linearHighError propagates through entire network
Attention (Q, K)HighSoftmax amplifies small differences in dot products
Attention (V, O)MediumLinear projection, more robust
Feed-forward (up/down)Medium-LowHigh redundancy, large hidden dim
Final classifier headHighDirectly impacts logits and predictions
Batch/Layer normLowRenormalization absorbs scale errors
Depthwise convolutionHighFew parameters per channel, no redundancy

A practical consequence is mixed-precision quantization: keeping sensitive layers at higher precision (e.g., INT8) while aggressively quantizing robust layers (e.g., INT4).


Hardware Support for Quantization
#

The benefits of quantization are only realized if hardware can accelerate low-precision operations. Modern AI hardware provides extensive support.

NVIDIA Tensor Cores
#

NVIDIA’s Tensor Cores, available from Volta (2017) onward, perform matrix multiply-accumulate (MMA) operations at various precisions:

GPU GenerationArchitectureSupported PrecisionsPeak INT8 TOPS
V100VoltaFP16N/A
T4TuringFP16, INT8, INT4, INT1130
A100AmpereTF32, FP16, BF16, INT8, INT4624
H100HopperFP8, FP16, BF16, INT81979
B200BlackwellFP8, FP6, FP4, INT84500+

The MMA operation computes \(D = A \times B + C\), where \(A\) and \(B\) are low-precision (e.g., INT8) and \(C, D\) are accumulated in higher precision (INT32 or FP32). This mixed-precision accumulation is critical: it prevents overflow during the summation of many low-precision products.

Tensor Core MMA Operation:

  A (INT8)     B (INT8)        C (INT32)       D (INT32)
  [m x k]   x  [k x n]    +   [m x n]     =   [m x n]

     Low-precision         High-precision     High-precision
     multiply              accumulate         result

  Typical tile: m=16, n=16, k=32 for INT8 on Ampere

Google TPU
#

Google’s Tensor Processing Units are designed from the ground up for matrix operations:

TPU VersionPrecisionsINT8 TOPSNotes
TPU v2BF16, INT845Systolic array design
TPU v3BF16, INT890Liquid cooling
TPU v4BF16, INT8, FP8275Optical interconnect
TPU v5eBF16, INT8, FP8400Optimized for inference
TPU v6BF16, INT8, FP8, INT4900+Latest generation

TPUs use a systolic array architecture that is naturally suited for quantized inference: data flows through a 2D grid of multiply-accumulate units, with low-precision inputs and high-precision accumulators.

ARM NEON and Apple Neural Engine
#

For mobile and edge deployment:

ARM NEON (available in all modern ARM Cortex-A processors):

  • SIMD operations: 16 x INT8 operations in a single 128-bit register
  • Dot-product instructions (SDOT/UDOT): 4 INT8 multiplies + accumulate in INT32 per cycle
  • Available on virtually every smartphone

Apple Neural Engine (ANE):

  • Dedicated matrix engine supporting INT8 and INT16
  • Up to 38 TOPS on M4 chip
  • Tightly integrated with the Apple ecosystem (Core ML)

Intel VNNI and AMX
#

VNNI (Vector Neural Network Instructions), available from Ice Lake onward:

  • Fuses multiply + pairwise add + accumulate for INT8/UINT8
  • 4x throughput improvement over standard SSE/AVX INT8

AMX (Advanced Matrix Extensions), available from Sapphire Rapids:

  • Dedicated tile-based matrix engine
  • Supports BF16 and INT8 tile operations
  • Similar concept to NVIDIA Tensor Cores but for x86

Dedicated Edge Accelerators
#

AcceleratorPrecision SupportPeak TOPSPower (W)TOPS/W
Google Edge TPUINT8422.0
Intel MovidiusFP16, INT841.52.7
NVIDIA Jetson OrinFP8, INT8, INT4275604.6
Qualcomm HexagonINT8, INT473154.9
Hailo-8INT8, INT4262.510.4
Syntiant NDP120INT87.70.0017700

The trend is clear: every major hardware vendor now treats INT8 as a first-class citizen, and support for INT4 and FP8 is rapidly expanding. The TOPS/W column illustrates why quantization is not merely an optimization — it fundamentally determines what computations are feasible under power and thermal constraints.

Throughput Comparison Across Precisions
#

The following table shows relative throughput on NVIDIA A100 (as a representative modern GPU):

PrecisionTheoretical TOPSRelative to FP32Memory per Parameter
FP3219.5 TFLOPS1.0x4 bytes
TF32156 TFLOPS8.0x4 bytes (internal)
FP16/BF16312 TFLOPS16.0x2 bytes
INT8624 TOPS32.0x1 byte
INT41248 TOPS64.0x0.5 bytes

The combined effect of higher compute throughput AND reduced memory bandwidth makes quantization a double win: INT8 is not just 4x less memory — it is also 2-4x more compute throughput than FP16 on the same hardware.


Summary
#

Key Takeaways
#

ConceptKey Point
Why quantize2-8x memory reduction, 2-4x compute speedup, essential for edge
Number formatsFP32 > BF16 > FP16 > FP8 > INT8 > INT4 (precision vs. efficiency)
Quantization function\(x_q = \text{round}(x/s) + z\); fully defined by scale and zero-point
Uniform vs. non-uniformUniform is standard (hardware-friendly); non-uniform for research
Symmetric vs. asymmetricSymmetric for weights (simpler math); asymmetric for activations
GranularityPer-channel for weights; per-tensor or per-token for activations
Weights vs. activationsWeights are static and easy; activations are dynamic and harder
CalibrationMSE and KL-divergence are the best general-purpose methods
Error analysisTotal error = rounding + clipping; minimize via optimal clipping
HardwareINT8 is universally supported; FP8 and INT4 are the frontier

What Comes Next
#

This post covered the fundamentals: the mathematical framework, number representations, and design decisions that underpin all quantization methods. In the next post, we will apply these fundamentals to Post-Training Quantization (PTQ) — the family of techniques that quantize a pretrained model without any retraining:

  • Naive PTQ and its limitations
  • Advanced PTQ methods: AdaRound, BRECQ, GPTQ, AWQ, SqueezeLLM
  • Weight-only quantization for large language models
  • Practical PTQ pipelines with real code examples

The theory in this post provides the vocabulary and mathematical tools you will need to understand why those methods work and when they fail.