Table of Contents

Overview
#

What Is Post-Training Quantization?
#

Post-Training Quantization (PTQ) is the process of converting a pre-trained floating-point neural network into a lower-precision representation without any retraining or fine-tuning. The core idea is straightforward: take a model that was trained in FP32 (or BF16), and map its weights and activations to INT8, INT4, or other reduced-precision formats so that inference becomes faster, smaller, and more energy-efficient.

The uniform affine quantization function that underpins nearly all PTQ methods is:

$$q = \text{clamp}\!\left(\left\lfloor \frac{x}{s} \right\rceil + z,\; 0,\; 2^b - 1\right)$$

where \(x\) is the real-valued input, \(s\) is the scale factor, \(z\) is the zero-point, \(b\) is the bit-width, and \(\lfloor \cdot \rceil\) denotes rounding to the nearest integer.

The corresponding dequantization (reconstruction) is:

$$\hat{x} = s \cdot (q - z)$$

The quantization error for a single value is therefore:

$$\epsilon = x - \hat{x} = x - s\!\left(\left\lfloor \frac{x}{s} \right\rceil + z - z\right) = x - s\left\lfloor \frac{x}{s} \right\rceil$$

This error is bounded by \(|\epsilon| \le \frac{s}{2}\), which means the scale \(s\) directly controls worst-case quantization noise.

Why PTQ?
#

PTQ offers several compelling advantages over alternatives:

AspectPTQQAT (Quantization-Aware Training)
Training requiredNo (or minimal calibration)Yes, full or partial retraining
Data required0 to ~1000 unlabeled samplesFull labeled training set
Time to quantizeMinutes to hoursHours to days
Accuracy (INT8)Typically < 1% dropTypically < 0.5% drop
Accuracy (INT4)Can degrade significantlyUsually recoverable
Engineering effortLowHigh
Applicable to closed modelsYesNo (need training pipeline)

PTQ is the go-to first approach in practice because:

  1. Speed: You can quantize and deploy a model in minutes.
  2. No training infrastructure: No GPUs for backpropagation, no hyperparameter tuning.
  3. Minimal data: Many PTQ methods need zero data (weight-only) or just a small calibration set (typically 128-1024 samples).
  4. Black-box friendly: Works even when you only have the exported model weights.

PTQ vs QAT: When to Use Which
#

The decision tree is simple:

Is INT8 PTQ accuracy acceptable?
  |-- YES --> Ship PTQ. Done.
  |-- NO  --> Is INT4 PTQ accuracy acceptable?
                |-- YES --> Ship PTQ with advanced method (GPTQ, AWQ).
                |-- NO  --> Do you have the training pipeline?
                              |-- YES --> Use QAT.
                              |-- NO  --> Use advanced PTQ (BRECQ, OmniQuant)
                                          or mixed-precision PTQ.

In the era of large language models (LLMs), PTQ has become even more critical because retraining a 70B-parameter model is prohibitively expensive, yet deployment demands INT4 or lower precision.


The PTQ Pipeline
#

A complete PTQ workflow proceeds through the following stages:

+------------------+     +-------------------+     +-------------------+
|  Pre-trained     |     |  Graph            |     |  Calibration      |
|  FP32 Model      |---->|  Optimization     |---->|  Data Collection  |
|                  |     |  (BN folding,     |     |  (128-1024        |
|                  |     |   constant fold)  |     |   samples)        |
+------------------+     +-------------------+     +-------------------+
                                                          |
                                                          v
+------------------+     +-------------------+     +-------------------+
|  Quantized       |     |  Quantization     |     |  Range            |
|  Model           |<----|  Parameter        |<----|  Estimation       |
|  (INT8/INT4)     |     |  Assignment       |     |  (MinMax, MSE,    |
|                  |     |  (scale, zp,      |     |   KL-Div, etc.)  |
+------------------+     |   bit-width)      |     +-------------------+
        |                +-------------------+
        v
+------------------+
|  Accuracy        |
|  Validation      |
|  & Deployment    |
+------------------+

Step-by-step breakdown:

  1. Load pre-trained model: Import the FP32 (or BF16/FP16) model with all trained weights.
  2. Graph optimization: Fold batch normalization layers into preceding convolutions, fuse operations (Conv+ReLU), and perform constant folding to simplify the graph.
  3. Insert observer nodes: Place quantization observers (also called “fake quantization” nodes) at strategic points: after weight tensors and after activation tensors.
  4. Run calibration: Feed a small representative dataset through the model. Observers collect statistics (min, max, histograms) for each tensor.
  5. Compute quantization parameters: Using the collected statistics, determine the optimal scale \(s\) and zero-point \(z\) for each quantized tensor.
  6. Quantize: Replace floating-point operations with their quantized counterparts, embedding the computed parameters.
  7. Validate: Measure accuracy on a held-out set to verify acceptable degradation.
  8. Deploy: Export to the target runtime (TensorRT, ONNX Runtime, etc.).

Weight Quantization in PTQ
#

Round-to-Nearest (RTN)
#

The simplest weight quantization strategy is Round-to-Nearest (RTN): compute the scale from the weight tensor’s range, then round each weight to the nearest integer grid point.

For symmetric quantization of a weight tensor \(\mathbf{W}\):

$$s = \frac{\max(|\mathbf{W}|)}{2^{b-1} - 1}$$$$q_i = \text{clamp}\!\left(\left\lfloor \frac{w_i}{s} \right\rceil,\; -2^{b-1},\; 2^{b-1} - 1\right)$$

Numerical example (INT8, symmetric):

Original weights: [0.12, -0.45, 0.78, -1.02, 0.33]
max(|W|) = 1.02
s = 1.02 / 127 = 0.008031

Quantized:
  0.12  / 0.008031 =  14.94  -> round -> 15   -> reconstruct: 15 * 0.008031 = 0.1205
 -0.45  / 0.008031 = -56.04  -> round -> -56  -> reconstruct: -56 * 0.008031 = -0.4497
  0.78  / 0.008031 =  97.12  -> round -> 97   -> reconstruct: 97 * 0.008031 = 0.7790
 -1.02  / 0.008031 = -127.0  -> round -> -127 -> reconstruct: -127 * 0.008031 = -1.0199
  0.33  / 0.008031 =  41.09  -> round -> 41   -> reconstruct: 41 * 0.008031 = 0.3293

Max absolute error: |0.78 - 0.779| = 0.001

At INT8, RTN works surprisingly well for most models because the quantization step size \(s\) is small enough that rounding errors average out.

Why RTN Fails at Low Bit-Widths
#

At INT4 (16 levels for symmetric, or 16 levels for unsigned), the step size becomes dramatically larger:

$$s_{\text{INT4}} = \frac{1.02}{7} = 0.1457$$

Now the maximum rounding error is \(\frac{s}{2} = 0.073\), which is 9x larger than the INT8 case. For a weight of 0.12, the quantized value could be 0 or 1, mapping to 0.0 or 0.1457 — both significantly off.

The problem compounds across matrix multiplications. For a layer computing \(\mathbf{y} = \mathbf{W}\mathbf{x}\), the output error is:

$$\Delta \mathbf{y} = (\mathbf{W} - \hat{\mathbf{W}})\mathbf{x} = \boldsymbol{\epsilon}_W \mathbf{x}$$

The expected squared error scales as:

$$\mathbb{E}[\|\Delta \mathbf{y}\|^2] = \|\mathbf{x}\|^2 \cdot \sum_i \text{Var}(\epsilon_{W,i}) \approx \|\mathbf{x}\|^2 \cdot n \cdot \frac{s^2}{12}$$

where \(n\) is the number of input features. Since \(s^2\) grows as \(2^{-2b}\) when reducing bit-width, going from 8 to 4 bits increases expected error by a factor of \(2^8 = 256\).

Per-Channel vs Per-Tensor Quantization
#

Per-tensor quantization uses a single scale and zero-point for an entire weight tensor:

$$s = \frac{\max(\mathbf{W}) - \min(\mathbf{W})}{2^b - 1}$$

Per-channel quantization computes separate parameters for each output channel \(c\):

$$s_c = \frac{\max(\mathbf{W}[c,:]) - \min(\mathbf{W}[c,:])}{2^b - 1}$$
Per-Tensor Quantization:          Per-Channel Quantization:
+-------------------------+       +-------------------------+
| s=0.008, z=128          |       | s0=0.003, z0=128        |
| All channels share      |       | s1=0.012, z1=128        |
| one (s, z) pair         |       | s2=0.005, z2=128        |
+-------------------------+       | Each channel has own     |
                                  | (s_c, z_c) pair          |
                                  +-------------------------+
PropertyPer-TensorPer-Channel
Parameters1 scale + 1 zero-pointC scales + C zero-points
AccuracyLower (penalized by outlier channels)Higher (adapts to each channel)
Hardware supportUniversalMost modern accelerators
OverheadMinimalNegligible (C is small vs tensor size)

Per-channel quantization is strictly superior for weights and is the default in virtually all modern PTQ frameworks. The reason is that different output channels of a convolution or linear layer can have very different weight magnitudes, and a single scale must accommodate the largest channel, wasting precision for smaller ones.

Weight Equalization
#

Weight equalization (proposed in Data-Free Quantization Through Weight Equalization and Bias Correction, Nagel et al., 2019) exploits the scale-equivariance of consecutive layers to balance weight ranges across channels.

Consider two consecutive layers without nonlinearity between them (or with ReLU, which is positive-scale-equivariant):

$$\mathbf{y} = f(\mathbf{W}_2 \cdot f(\mathbf{W}_1 \cdot \mathbf{x}))$$

We can insert a diagonal scaling matrix \(\mathbf{S}\) between the layers:

$$\mathbf{y} = f(\mathbf{W}_2 \mathbf{S}^{-1} \cdot f(\mathbf{S} \mathbf{W}_1 \cdot \mathbf{x}))$$

This does not change the output in floating-point, but it rescales the weight ranges. The optimal equalization factor for channel \(i\) is:

$$s_i = \frac{1}{\sqrt{r_i^{(1)} / r_i^{(2)}}}$$

where \(r_i^{(1)}\) is the range of the \(i\)-th output channel of \(\mathbf{W}_1\) and \(r_i^{(2)}\) is the range of the \(i\)-th input channel of \(\mathbf{W}_2\). This geometric-mean balancing minimizes the maximum quantization error across both layers.

Before equalization:

Layer 1 output channel ranges: [0.1, 5.0, 0.3, 4.8]  (highly unbalanced)
Layer 2 input channel ranges:  [4.5, 0.2, 4.0, 0.3]  (inversely unbalanced)

After equalization:

s = [1/sqrt(0.1/4.5), 1/sqrt(5.0/0.2), 1/sqrt(0.3/4.0), 1/sqrt(4.8/0.3)]
  = [6.71, 0.20, 3.65, 0.25]

New Layer 1 ranges: [0.67, 1.00, 1.10, 1.20]  (balanced!)
New Layer 2 ranges: [0.67, 1.00, 1.10, 1.20]  (balanced!)

This data-free technique can significantly improve quantization quality, especially for models with batch normalization (which tends to create unbalanced weight distributions).


Activation Quantization in PTQ
#

Dynamic Range and the Outlier Challenge
#

Unlike weights, which are fixed after training, activations depend on the input data and vary at runtime. This creates two fundamental challenges:

  1. Range determination: We must estimate the activation range before deployment.
  2. Outliers: A small fraction of activation values can have extreme magnitudes, forcing a large scale that wastes precision for the majority of values.
Typical activation distribution:

Count
  |
  |          *****
  |        **     **
  |      **         **
  |    **             **
  |  **                 **
  | *                     *                               *  <- outlier
  |*                       *         *                    *
  +-------------------------------------------------------------> Value
  -2        -1        0        1        2        3  ...  15

The outlier at 15 forces the scale to accommodate a range of [-2, 15], even though 99.9% of values lie in [-2, 3]. This means 70% of the quantization levels are wasted on the [3, 15] range that almost no values occupy.

Calibration Dataset Requirements
#

To estimate activation ranges, PTQ requires running a small calibration dataset through the model. Guidelines for calibration data:

  • Size: 128-1024 samples is typically sufficient. Diminishing returns beyond 1024.
  • Representativeness: Should reflect the actual inference distribution. For image models, use images from the target domain. For language models, use text from the target domain.
  • No labels needed: Only forward passes are required; labels are unnecessary.
  • Diversity: Include a variety of inputs to capture the full activation range. Avoid calibrating on a single class or topic.

Batch Normalization Folding
#

Batch normalization (BN) layers introduce additional scaling and shifting that interacts poorly with quantization. The solution is to fold BN parameters into the preceding convolution or linear layer before quantization.

Full mathematical derivation:

A convolution followed by batch normalization computes:

$$\mathbf{y}_{\text{BN}} = \gamma \cdot \frac{\mathbf{W}\mathbf{x} + \mathbf{b}_{\text{conv}} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

where:

  • \(\mathbf{W}, \mathbf{b}_{\text{conv}}\) are the convolution weights and bias
  • \(\mu, \sigma^2\) are the running mean and variance from BN
  • \(\gamma, \beta\) are the learned BN scale and shift
  • \(\epsilon\) is a small constant for numerical stability

We can rewrite this as a single affine transformation. Define:

$$\hat{\sigma} = \sqrt{\sigma^2 + \epsilon}$$

Then:

$$\mathbf{y}_{\text{BN}} = \frac{\gamma}{\hat{\sigma}} \mathbf{W}\mathbf{x} + \frac{\gamma}{\hat{\sigma}}(\mathbf{b}_{\text{conv}} - \mu) + \beta$$

The folded weights and bias are:

$$\mathbf{W}_{\text{fold}} = \frac{\gamma}{\hat{\sigma}} \mathbf{W}$$$$\mathbf{b}_{\text{fold}} = \frac{\gamma}{\hat{\sigma}}(\mathbf{b}_{\text{conv}} - \mu) + \beta$$

For per-channel folding (the standard approach), each output channel \(c\) gets:

$$\mathbf{W}_{\text{fold}}[c, :] = \frac{\gamma_c}{\hat{\sigma}_c} \cdot \mathbf{W}[c, :]$$$$b_{\text{fold},c} = \frac{\gamma_c}{\hat{\sigma}_c}(b_{\text{conv},c} - \mu_c) + \beta_c$$

Numerical example:

Conv weights (one output channel): W = [0.5, -0.3, 0.8]
Conv bias: b_conv = 0.1
BN parameters: gamma = 1.2, beta = 0.5, mu = 0.3, sigma^2 = 0.04, eps = 1e-5

sigma_hat = sqrt(0.04 + 1e-5) = 0.20000
scale = gamma / sigma_hat = 1.2 / 0.2 = 6.0

W_fold = 6.0 * [0.5, -0.3, 0.8] = [3.0, -1.8, 4.8]
b_fold = 6.0 * (0.1 - 0.3) + 0.5 = 6.0 * (-0.2) + 0.5 = -0.7

After folding, the BN layer is removed entirely, and the model has one fewer layer to quantize. This is essential because quantizing both the conv output and the BN output would introduce two quantization steps where only one is needed.

Important caveat: BN folding changes the weight distribution. Channels where \(\gamma / \hat{\sigma}\) is large will have amplified weights, potentially creating outliers. This is one reason weight equalization (discussed above) is performed after BN folding.


Calibration Methods (Deep Dive)
#

Calibration is the most critical step in PTQ. The choice of calibration method directly determines the quantization parameters \(s\) and \(z\), which in turn determine accuracy. This section covers all major approaches in depth.

MinMax Calibration
#

The simplest approach: use the observed minimum and maximum values.

$$s = \frac{x_{\max} - x_{\min}}{2^b - 1}, \quad z = \left\lfloor -\frac{x_{\min}}{s} \right\rceil$$

For symmetric quantization:

$$s = \frac{\max(|x_{\max}|, |x_{\min}|)}{2^{b-1} - 1}, \quad z = 0$$

Pros: Simple, deterministic, no hyperparameters. Cons: Highly sensitive to outliers. A single extreme value can ruin the scale.

Moving Average MinMax
#

Instead of taking the global min/max across all calibration batches, use an exponential moving average:

$$x_{\max}^{(t)} = \alpha \cdot x_{\max}^{(t-1)} + (1 - \alpha) \cdot \max(\mathbf{x}^{(t)})$$$$x_{\min}^{(t)} = \alpha \cdot x_{\min}^{(t-1)} + (1 - \alpha) \cdot \min(\mathbf{x}^{(t)})$$

where \(\alpha\) is typically 0.9 or 0.99. This smooths out batch-to-batch noise and reduces outlier sensitivity, though it introduces a hyperparameter and depends on calibration order.

Percentile / Histogram Calibration
#

Instead of using the absolute min/max, clip to a percentile of the distribution:

$$x_{\max} = \text{Percentile}(\mathbf{x}, p), \quad x_{\min} = \text{Percentile}(\mathbf{x}, 100 - p)$$

Typical values are \(p = 99.9\) or \(p = 99.99\). The implementation collects a histogram of activation values during calibration, then finds the percentile thresholds.

Histogram of activations:
Count
  |
  |     +--+
  |     |  |  +--+
  |  +--+  |  |  |
  |  |  |  |  |  |  +--+
  |  |  |  |  |  |  |  |
  |  |  |  |  |  |  |  |  +--+            +--+
  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--> Value
  0     1     2     3     4     5     6  ...  15

  Percentile 99.9% threshold: ~5.5
  (clips the outlier at 15, much better scale)

MSE Minimization
#

Find the clipping range \([\alpha, \beta]\) that minimizes the mean squared error between original and quantized values:

$$(\alpha^*, \beta^*) = \arg\min_{\alpha, \beta} \; \mathbb{E}\!\left[(x - Q(x; \alpha, \beta))^2\right]$$

where \(Q(x; \alpha, \beta)\) is the quantize-then-dequantize operation with clipping range \([\alpha, \beta]\).

Full derivation of the MSE objective:

The quantized reconstruction of \(x\) is:

$$\hat{x} = \begin{cases} \alpha & \text{if } x < \alpha \\ s \cdot \lfloor (x - \alpha)/s \rceil + \alpha & \text{if } \alpha \le x \le \beta \\ \beta & \text{if } x > \beta \end{cases}$$

where \(s = (\beta - \alpha) / (2^b - 1)\).

The MSE decomposes into three regions:

$$\text{MSE} = \underbrace{\int_{-\infty}^{\alpha} (x - \alpha)^2 p(x)\,dx}_{\text{clipping error (low)}} + \underbrace{\int_{\alpha}^{\beta} (x - \hat{x})^2 p(x)\,dx}_{\text{rounding error}} + \underbrace{\int_{\beta}^{\infty} (x - \beta)^2 p(x)\,dx}_{\text{clipping error (high)}}$$

As we increase the clipping range \([\alpha, \beta]\):

  • Clipping error decreases (fewer values clipped)
  • Rounding error increases (step size \(s\) grows)

The optimal range balances these two competing effects. In practice, this is solved by grid search over candidate thresholds, evaluating the MSE for each.

Grid search algorithm:

1. Collect histogram H of activation values with N bins
2. For each candidate threshold t in [t_min, t_max]:
     a. Compute scale s = 2*t / (2^b - 1)    (symmetric case)
     b. Compute clipping error: sum over bins outside [-t, t]
     c. Compute rounding error: s^2/12 * (count of bins inside [-t, t])
     d. Total MSE = clipping_error + rounding_error
3. Return t* = argmin(Total MSE)

KL Divergence (TensorRT Approach)
#

NVIDIA’s TensorRT uses KL divergence (Kullback-Leibler divergence) to find the optimal clipping range. The idea is to find the quantized distribution \(Q\) that best approximates the original distribution \(P\) in an information-theoretic sense.

$$D_{\text{KL}}(P \| Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$

The TensorRT histogram binning algorithm (step by step):

1. Collect a high-resolution histogram of activations
   - Use 2048 bins (or more) covering [0, max_abs] for ReLU outputs
   - Accumulate counts across all calibration batches

2. For each candidate number of bins T = 128, 129, ..., 2048:
   a. Reference distribution P = histogram[0:T], normalized

   b. Create quantized distribution Q:
      - Divide T bins into 2^b quantization levels
      - Each quantization level covers T/2^b consecutive bins
      - For each level, sum the counts -> assign uniform
        probability across non-zero bins in that level

   c. Compute KL(P || Q)

3. Select T* = argmin KL(P || Q)

4. Compute threshold: threshold = (T* + 0.5) * bin_width
   Set scale s = threshold / (2^(b-1) - 1)

Detailed example for INT8 symmetric with ReLU activations:

Suppose we have 2048 histogram bins, max activation = 10.0
bin_width = 10.0 / 2048 = 0.00488

For candidate T = 512 (covering range [0, 2.5]):
  - We have 512 bins to map into 128 quantization levels
  - Each level covers 512/128 = 4 bins

  Level 0: bins[0:4]   -> sum counts -> spread back
  Level 1: bins[4:8]   -> sum counts -> spread back
  ...
  Level 127: bins[504:508] -> sum counts -> spread back

  Remaining bins[512:2048] are clipped -> added to last level

  Normalize both P and Q, compute KL divergence.

For candidate T = 1024 (covering range [0, 5.0]):
  - Each level covers 1024/128 = 8 bins
  - Less clipping but coarser quantization

The T* that minimizes KL divergence is selected.

ACIQ: Analytical Clipping for Integer Quantization
#

ACIQ (Banner et al., 2019) derives closed-form optimal clipping thresholds by assuming the weight or activation distribution follows a known parametric form (Gaussian or Laplacian).

Gaussian distribution derivation:

Assume \(x \sim \mathcal{N}(0, \sigma^2)\). We seek the symmetric clipping threshold \(\alpha\) that minimizes MSE.

The MSE consists of:

$$\text{MSE}(\alpha) = \underbrace{2 \int_{\alpha}^{\infty} (x - \alpha)^2 \phi(x)\,dx}_{\text{clipping MSE}} + \underbrace{\frac{\alpha^2}{3 \cdot 2^{2b}}}_{\text{quantization MSE}}$$

where \(\phi(x)\) is the standard Gaussian PDF (with appropriate scaling for \(\sigma\)).

The clipping MSE for a Gaussian with mean 0 and variance \(\sigma^2\) evaluates to:

$$\text{MSE}_{\text{clip}} = 2\sigma^2\!\left[\left(\frac{\alpha^2}{\sigma^2} + 1\right)\!\left(1 - \Phi\!\left(\frac{\alpha}{\sigma}\right)\right) - \frac{\alpha}{\sigma}\phi\!\left(\frac{\alpha}{\sigma}\right)\right]$$

where \(\Phi\) is the Gaussian CDF and \(\phi\) is the PDF.

The quantization MSE (rounding error for uniform quantization in \([-\alpha, \alpha]\) with \(2^b\) levels) is:

$$\text{MSE}_{\text{quant}} = \frac{(2\alpha)^2}{12 \cdot 2^{2b}} \cdot \Phi\!\left(\frac{\alpha}{\sigma}\right) \approx \frac{\alpha^2}{3 \cdot 2^{2b}}$$

Setting \(\frac{d}{d\alpha}[\text{MSE}{\text{clip}} + \text{MSE}{\text{quant}}] = 0\) and solving numerically yields optimal \(\alpha^* / \sigma\) values:

Bit-widthGaussian \(\alpha^*/\sigma\)Laplacian \(\alpha^*/b_{\text{lap}}\)
21.712.83
32.153.89
42.555.03
83.898.52

For example, for INT4 Gaussian-distributed activations with \(\sigma = 1.0\), the optimal clip is \(\alpha^* = 2.55\), meaning we clip about 1.1% of values on each tail.

Laplacian distribution derivation:

For \(x \sim \text{Laplace}(0, b_{\text{lap}})\) with PDF \(p(x) = \frac{1}{2b_{\text{lap}}} e^{-|x|/b_{\text{lap}}}\), the clipping MSE is:

$$\text{MSE}_{\text{clip}} = 2b_{\text{lap}}^2 e^{-\alpha/b_{\text{lap}}}\!\left(\frac{\alpha^2}{2b_{\text{lap}}^2} + \frac{\alpha}{b_{\text{lap}}} + 1\right)$$

The total MSE is again minimized by differentiating and solving for \(\alpha\). The Laplacian model is often more appropriate for weights, while activations (especially post-ReLU) tend to be better modeled by half-Gaussian or exponential distributions.

Calibration Methods Comparison
#

MethodData NeededCompute CostOutlier RobustnessBest For
MinMaxMinimalVery lowPoorWeights (per-channel)
Moving Avg MinMaxMinimalLowModerateStreaming calibration
PercentileModerateLowGoodGeneral activations
MSEModerateMediumGoodAccuracy-sensitive tasks
KL DivergenceModerateMediumGoodTensorRT deployment
ACIQNone (analytical)Very lowModerateData-free PTQ

Advanced PTQ Techniques
#

When simple calibration-based PTQ fails (particularly at low bit-widths like INT4), more sophisticated methods are needed. These methods typically use a small calibration set and optimize quantization parameters beyond simple range estimation.

AdaRound (2020)
#

Key insight: Rounding-to-nearest is not optimal. Sometimes rounding up when the nearest integer is down (or vice versa) can reduce the overall task loss.

Problem formulation:

For a single layer with weight matrix \(\mathbf{W}\) and input \(\mathbf{X}\), the layer-wise reconstruction objective is:

$$\min_{\mathbf{V}} \; \left\| \mathbf{W}\mathbf{X} - \hat{\mathbf{W}}(\mathbf{V})\mathbf{X} \right\|_F^2$$

where \(\mathbf{V} \in [0,1]^{m \times n}\) is a matrix of continuous rounding variables. The quantized weight is:

$$\hat{w}_{ij} = s \cdot \text{clamp}\!\left(\left\lfloor \frac{w_{ij}}{s} \right\rfloor + v_{ij},\; n_{\min},\; n_{\max}\right)$$

Here, \(\lfloor \cdot \rfloor\) is the floor function (not round), and \(v_{ij} \in {0, 1}\) decides whether to round up or down. When \(v_{ij} = 0\), we round down; when \(v_{ij} = 1\), we round up.

Relaxation to continuous optimization:

Since optimizing binary \(v_{ij} \in {0, 1}\) is combinatorial, AdaRound relaxes to a continuous surrogate using a rectified sigmoid:

$$\tilde{v}_{ij} = \sigma\!\left(\theta_{ij}\right) = \text{clip}\!\left(\sigma(\theta_{ij}) \cdot (\zeta - \gamma) + \gamma, \; 0, \; 1\right)$$

where \(\theta_{ij}\) are learnable parameters and \(\zeta = 1.1, \gamma = -0.1\) are stretch parameters that allow the sigmoid to reach exactly 0 and 1.

Full loss function:

$$\mathcal{L} = \left\| \mathbf{W}\mathbf{X} - \hat{\mathbf{W}}(\tilde{\mathbf{V}})\mathbf{X} \right\|_F^2 + \lambda \sum_{i,j} \left(1 - |2\tilde{v}_{ij} - 1|^\beta\right)$$

The first term is the reconstruction loss. The second term is a regularizer that pushes \(\tilde{v}_{ij}\) toward 0 or 1 (binary), controlled by:

  • \(\lambda\): regularization strength, annealed during optimization
  • \(\beta\): starts at a large value (e.g., 20) and anneals to a small value (e.g., 2), gradually making the penalty sharper

AdaRound algorithm:

For each layer l in the network:
  1. Collect input activations X_l using calibration data
  2. Initialize theta from RTN: theta_ij = sigmoid_inv(frac(w_ij / s))
  3. For t = 1 to T iterations:
     a. Compute soft rounding: v_tilde = stretched_sigmoid(theta)
     b. Compute quantized weights: W_hat = s * clamp(floor(W/s) + v_tilde)
     c. Compute reconstruction loss: L_rec = ||WX - W_hat X||_F^2
     d. Compute regularizer: L_reg = lambda(t) * sum(1 - |2v_tilde - 1|^beta(t))
     e. Update theta via gradient descent on L_rec + L_reg
  4. Final binary rounding: v_ij = round(v_tilde_ij)

AdaRound typically requires only 1000-5000 samples and a few hundred iterations per layer, completing in minutes.

BRECQ (2021)
#

Block Reconstruction Quantization (Li et al., 2021) extends the per-layer reconstruction idea to blocks of layers, using second-order (Fisher information) to determine the optimal block structure.

Key contributions:

  1. Block-wise reconstruction: Instead of optimizing one layer at a time (AdaRound) or the entire network (too expensive), BRECQ optimizes blocks of layers (e.g., a ResNet basic block or a Transformer attention block).

  2. Fisher-weighted objective: The reconstruction loss is weighted by the Fisher information matrix, which measures how sensitive the task loss is to perturbations in each layer’s output.

The Fisher-weighted block reconstruction objective is:

$$\min_{\hat{\mathbf{W}}_1, \ldots, \hat{\mathbf{W}}_L} \; \left(\mathbf{f}(\mathbf{x}; \mathbf{W}) - \mathbf{f}(\mathbf{x}; \hat{\mathbf{W}})\right)^T \mathbf{F} \left(\mathbf{f}(\mathbf{x}; \mathbf{W}) - \mathbf{f}(\mathbf{x}; \hat{\mathbf{W}})\right)$$

where \(\mathbf{F}\) is the Fisher information matrix of the block output. In practice, this is approximated as a diagonal matrix, reducing to a channel-wise weighted MSE:

$$\mathcal{L} = \sum_c F_c \cdot \left\| \mathbf{o}_c - \hat{\mathbf{o}}_c \right\|^2$$

where \(F_c\) is the Fisher information for output channel \(c\).

  1. Cross-layer dependency: By optimizing all layers within a block jointly, BRECQ captures how rounding decisions in one layer affect the optimal rounding in subsequent layers.

QDrop (2022)
#

QDrop (Wei et al., 2022) introduces a surprisingly simple yet effective regularization technique for PTQ: randomly dropping quantization during the optimization process.

Mechanism: During the block reconstruction optimization (similar to BRECQ), QDrop randomly keeps some activations in full precision (skipping quantization) with probability \(p\). This is analogous to dropout but applied to quantization itself.

$$\hat{x}_i = \begin{cases} x_i & \text{with probability } p \\ Q(x_i) & \text{with probability } 1 - p \end{cases}$$

Why it works: Randomly mixing quantized and full-precision activations during optimization creates a flatter loss landscape around the quantized weights. This is crucial because the quantized model operates at a discrete point in weight space, and a flat minimum is more robust to the inherent discretization error.

The training objective becomes:

$$\mathcal{L} = \mathbb{E}_{\text{mask}}\!\left[\left\| \mathbf{y}_{\text{FP}} - \mathbf{y}_{\text{QDrop}} \right\|^2\right]$$

Typical drop probability is \(p = 0.5\), and it is annealed to 0 during training so the final model is fully quantized.

OmniQuant (2023)
#

OmniQuant (Shao et al., 2023) introduces two complementary learnable transformations that make weights and activations more quantization-friendly:

1. Learnable Weight Clipping (LWC):

Instead of using the full weight range, learn per-channel clipping thresholds:

$$\hat{w}_{ij} = Q\!\left(\text{clamp}(w_{ij}, -h_c, h_c)\right)$$

where \(h_c = s_c \cdot (2^{b-1} - 1)\) and \(s_c\) is a learnable per-channel scale parameter. The gradient flows through the clamp operation via straight-through estimation.

2. Learnable Equivalent Transformation (LET):

Learn channel-wise scaling and shifting parameters that transform activations into a more quantization-friendly distribution:

$$\mathbf{y} = Q(\mathbf{W} \cdot \text{diag}(\mathbf{s})^{-1}) \cdot Q(\text{diag}(\mathbf{s}) \cdot \mathbf{x} + \boldsymbol{\delta})$$

where \(\mathbf{s}\) and \(\boldsymbol{\delta}\) are learned per-channel scaling and shifting parameters. This is similar in spirit to SmoothQuant but with learned parameters optimized end-to-end.

The total loss function combines block-wise reconstruction with a small amount of task loss:

$$\mathcal{L} = \left\| \mathbf{y}_{\text{FP}} - \hat{\mathbf{y}} \right\|^2$$

OmniQuant optimizes only the clipping and transformation parameters (not the model weights themselves), requiring only ~1024 calibration samples and ~1 GPU-hour even for LLaMA-70B.


PTQ for Large Language Models
#

Large language models (LLMs) present unique challenges for PTQ due to their massive scale (billions of parameters), attention mechanisms, and peculiar activation distributions.

The Outlier Problem in LLMs
#

Dettmers et al. (2022) discovered that Transformer models develop systematic activation outliers in specific feature dimensions. These outliers:

  • Appear consistently in the same channels across all tokens and layers
  • Can be 10-100x larger than typical activations
  • Emerge during pre-training and grow with model scale
  • Are concentrated in a small fraction (<1%) of feature dimensions
LLM Activation Distribution (one hidden dimension):

     Feature dim 0-4094: values in [-1, 1]
     Feature dim 4095:    values in [-60, 60]   <-- OUTLIER CHANNEL

     +--+--+--+--+--+--+    +--+
     |  |  |  |  |  |  |    |  |  <- outlier
     |  |  |  |  |  |  |    |  |     channel
     |  |  |  |  |  |  |    |  |
     |  |  |  |  |  |  |    |  |
     +--+--+--+--+--+--+----+--+---> channel index
      0  1  2  3  4  5  ... 4095

If quantized per-tensor: scale = 60/127 = 0.472
  -> 99% of values (in [-1,1]) get only ~2 quantization levels!

This means naive per-tensor INT8 quantization can catastrophically fail for LLMs. The following methods address this challenge.

SmoothQuant (2023)
#

SmoothQuant (Xiao et al., 2023) tackles the outlier problem by migrating the quantization difficulty from activations to weights, exploiting the observation that weights are much easier to quantize.

Core idea: For a linear layer \(\mathbf{Y} = \mathbf{X}\mathbf{W}\), introduce a per-channel smoothing factor \(\mathbf{s}\):

$$\mathbf{Y} = (\mathbf{X} \text{diag}(\mathbf{s})^{-1}) \cdot (\text{diag}(\mathbf{s}) \mathbf{W}) = \hat{\mathbf{X}} \hat{\mathbf{W}}$$

This is mathematically equivalent but changes the distributions:

  • \(\hat{\mathbf{X}} = \mathbf{X} \text{diag}(\mathbf{s})^{-1}\): divides each activation channel by \(s_j\), shrinking outlier channels
  • \(\hat{\mathbf{W}} = \text{diag}(\mathbf{s}) \mathbf{W}\): multiplies each weight input channel by \(s_j\), absorbing the difficulty

Smoothing factor derivation:

The optimal \(s_j\) for channel \(j\) balances the quantization difficulty between \(\hat{\mathbf{X}}\) and \(\hat{\mathbf{W}}\):

$$s_j = \frac{\max(|\mathbf{X}_j|)^\alpha}{\max(|\mathbf{W}_j|)^{1-\alpha}}$$

where:

  • \(\max(|\mathbf{X}_j|)\) is the maximum absolute activation in channel \(j\) (across calibration data)
  • \(\max(|\mathbf{W}_j|)\) is the maximum absolute weight in input channel \(j\)
  • \(\alpha \in [0, 1]\) is a migration strength hyperparameter

Analysis of \(\alpha\):

  • \(\alpha = 0\): No smoothing (original model). \(s_j = 1/\max(|\mathbf{W}_j|)\).
  • \(\alpha = 1\): Full migration to weights. \(s_j = \max(|\mathbf{X}_j|)\).
  • \(\alpha = 0.5\): Geometric mean, equal difficulty sharing.

In practice, \(\alpha = 0.5\) works well for most LLMs, achieving W8A8 quantization with negligible accuracy loss.

Per-channel math example:

Channel j = 42 (outlier channel):
  max(|X_42|) = 60.0  (huge outlier)
  max(|W_42|) = 0.5   (normal weight)

  alpha = 0.5:
  s_42 = 60.0^0.5 / 0.5^0.5 = 7.746 / 0.707 = 10.95

  After smoothing:
  max(|X_hat_42|) = 60.0 / 10.95 = 5.48  (much more manageable)
  max(|W_hat_42|) = 0.5 * 10.95 = 5.48   (still fine for weight quant)

GPTQ (2023)
#

GPTQ (Frantar et al., 2023) is a one-shot weight quantization method based on approximate second-order information. It builds on the Optimal Brain Quantization (OBQ) framework but scales to billion-parameter models through clever algorithmic choices.

Mathematical foundation:

For a linear layer \(\mathbf{Y} = \mathbf{XW}\), quantizing the weight matrix \(\mathbf{W}\) to \(\hat{\mathbf{W}}\) introduces an error. The layer-wise objective is:

$$\min_{\hat{\mathbf{W}}} \; \left\| \mathbf{XW} - \mathbf{X}\hat{\mathbf{W}} \right\|_F^2 = \min_{\hat{\mathbf{W}}} \; \text{tr}\!\left[(\mathbf{W} - \hat{\mathbf{W}})^T \mathbf{H} (\mathbf{W} - \hat{\mathbf{W}})\right]$$

where \(\mathbf{H} = \mathbf{X}^T\mathbf{X}\) is the Hessian of the layer-wise loss with respect to the weights (for a linear layer, this equals the input correlation matrix).

OBQ update formula:

When quantizing weight \(w_q\) at position \(q\), the optimal update to the remaining (unquantized) weights is:

$$\boldsymbol{\delta}_F = -\frac{w_q - \hat{w}_q}{[\mathbf{H}^{-1}]_{qq}} \cdot (\mathbf{H}^{-1})_{:,q}$$

This compensates for the quantization error by adjusting the remaining weights using second-order information. The quantization error for weight \(q\) is:

$$E_q = \frac{(w_q - \hat{w}_q)^2}{2[\mathbf{H}^{-1}]_{qq}}$$

GPTQ’s key optimizations:

  1. Fixed quantization order: Instead of greedily selecting the weight with minimum \(E_q\) (expensive), GPTQ quantizes all weights in a fixed order (column by column), which enables batched computation.

  2. Column-wise processing with Cholesky decomposition: Process the weight matrix column by column. The Hessian inverse is updated efficiently using the Cholesky decomposition.

  3. Block processing: Process \(B = 128\) columns at a time, applying updates lazily for better GPU utilization.

Step-by-step GPTQ algorithm:

Input: Weight matrix W (d_out x d_in), Hessian H = X^T X, bit-width b

1. Compute H_inv = (H + lambda*I)^{-1}  (with damping lambda ~= 0.01 * mean(diag(H)))
2. Compute Cholesky decomposition: L such that H_inv = L L^T

3. For each column group g = 0, B, 2B, ..., d_in-B:
   a. For q = g to g+B-1:
      i.   Quantize: w_hat_q = Quantize(W[:, q])
      ii.  Compute error: delta_q = (W[:, q] - w_hat_q) / [H_inv]_{qq}
      iii. Update remaining columns in block:
           W[:, q:(g+B)] -= delta_q * H_inv[q, q:(g+B)]
   b. Update remaining unprocessed columns:
      W[:, (g+B):] -= W_error[:, g:(g+B)] * H_inv[g:(g+B), (g+B):]

Output: Quantized weight matrix W_hat

Numerical walkthrough (simplified 3x3 example):

W = [[0.12, -0.45, 0.78],      H_inv = [[2.0, 0.3, 0.1],
     [-1.02, 0.33, 0.56],               [0.3, 1.5, 0.2],
     [0.67, -0.89, 0.11]]               [0.1, 0.2, 1.8]]

INT4 symmetric, scale per-row.

Step 1: Quantize column 0
  Row 0: w=0.12, quantize to w_hat=0.14 (nearest grid point)
  Error delta = (0.12 - 0.14) / 2.0 = -0.01
  Update col 1: W[0,1] -= -0.01 * 0.3 = -0.45 + 0.003 = -0.447
  Update col 2: W[0,2] -= -0.01 * 0.1 = 0.78 + 0.001 = 0.781
  (similar for rows 1, 2)

Step 2: Quantize column 1 (with updated values)
  ...and so on

GPTQ achieves remarkable results: it can quantize a 175B-parameter model to 3-4 bits in approximately 4 GPU-hours with minimal perplexity increase.

AWQ (2024)
#

Activation-Aware Weight Quantization (Lin et al., 2024) observes that not all weights are equally important. Weights connected to large-magnitude activation channels (the outlier channels) are salient and should be protected from quantization error.

Core observation: For \(\mathbf{y} = \mathbf{Wx}\), the output error from quantizing weight column \(j\) is proportional to \(|x_j|\):

$$|\Delta y_i| = |w_{ij} - \hat{w}_{ij}| \cdot |x_j|$$

Therefore, weight columns corresponding to large \(|x_j|\) are more important.

AWQ’s approach — per-channel scaling:

Instead of keeping salient weights in higher precision (which complicates hardware), AWQ scales up salient weight channels before quantization:

$$Q(\mathbf{w} \cdot s) \cdot \frac{\mathbf{x}}{s} \approx Q(\mathbf{w}) \cdot \mathbf{x}$$

Scaling up \(\mathbf{w}\) by \(s > 1\) reduces the relative quantization error for that channel:

$$\frac{|w \cdot s - Q(w \cdot s)|}{|w \cdot s|} < \frac{|w - Q(w)|}{|w|}$$

because the quantization step size is shared across a larger range while the weight occupies a proportionally larger portion of it.

Optimal scaling factor:

AWQ searches for the optimal per-channel scale \(s_j\) by minimizing:

$$s^* = \arg\min_s \; \left\| Q(\mathbf{W} \cdot \text{diag}(\mathbf{s})) \cdot (\text{diag}(\mathbf{s})^{-1} \mathbf{X}) - \mathbf{WX} \right\|$$

In practice, AWQ uses a simple grid search over \(s_j = x_j^\alpha\) for \(\alpha \in [0, 1]\) with a step size of 0.1, where \(x_j = \text{mean}(|\mathbf{X}_{:,j}|)\) from calibration data.

SpinQuant (2024)
#

SpinQuant (Liu et al., 2024) introduces rotation matrices to transform weights and activations into distributions that are more amenable to quantization.

Key insight: Outliers in specific channels make quantization difficult, but applying a random rotation spreads energy more uniformly across all channels, reducing the dynamic range.

Mathematical formulation:

For a linear layer \(\mathbf{Y} = \mathbf{XW}\), insert orthogonal rotation matrices \(\mathbf{R}_1, \mathbf{R}_2\):

$$\mathbf{Y} = (\mathbf{X}\mathbf{R}_1)(\mathbf{R}_1^T\mathbf{W}\mathbf{R}_2)(\mathbf{R}_2^T)$$

Since \(\mathbf{R}\mathbf{R}^T = \mathbf{I}\), this is mathematically equivalent. However, the rotated weight \(\mathbf{R}_1^T\mathbf{W}\mathbf{R}_2\) and rotated activation \(\mathbf{X}\mathbf{R}_1\) have more uniform distributions.

SpinQuant goes beyond random rotations by learning the optimal rotation through Cayley parameterization, ensuring the rotation remains orthogonal throughout optimization:

$$\mathbf{R}(\mathbf{A}) = (\mathbf{I} - \mathbf{A})^{-1}(\mathbf{I} + \mathbf{A})$$

where \(\mathbf{A}\) is a learnable skew-symmetric matrix (\(\mathbf{A}^T = -\mathbf{A}\)).


Mixed-Precision PTQ
#

Not all layers in a neural network are equally sensitive to quantization. Mixed-precision quantization assigns different bit-widths to different layers (or even different channels) to maximize accuracy under a given resource budget.

Sensitivity Analysis
#

The fundamental question is: how much does quantizing layer \(l\) to \(b\) bits degrade overall accuracy?

Perturbation-based sensitivity: Quantize one layer at a time while keeping all others in FP32, and measure the accuracy drop:

$$\text{Sensitivity}(l, b) = \text{Acc}_{\text{FP32}} - \text{Acc}(\text{layer } l \text{ at } b \text{ bits, rest FP32})$$
Layer Sensitivity Analysis (example):

Layer     | INT8 drop | INT4 drop | Sensitivity
----------|-----------|-----------|------------
conv1     |  0.01%    |   0.3%    | Low
resblock1 |  0.02%    |   0.5%    | Low
resblock2 |  0.05%    |   2.1%    | MEDIUM
resblock3 |  0.01%    |   0.4%    | Low
attention |  0.15%    |   5.3%    | HIGH
head      |  0.20%    |   8.1%    | VERY HIGH

Strategy: Keep attention and head at INT8,
          quantize rest to INT4.

HAWQ: Hessian-Aware Quantization
#

HAWQ (Dong et al., 2019) and its successors use the Hessian spectrum (eigenvalues of the loss Hessian with respect to each layer’s weights) to determine quantization sensitivity without exhaustive per-layer evaluation.

The key quantity is the trace of the Hessian for each layer:

$$\Omega_l = \text{tr}(\mathbf{H}_l) = \sum_i \lambda_i^{(l)}$$

where \(\lambda_i^{(l)}\) are the eigenvalues of the Hessian block corresponding to layer \(l\). Layers with large \(\Omega_l\) are more sensitive to perturbation and should receive higher bit-widths.

HAWQ-V2 extends this to use the top Hessian eigenvalue (spectral norm) computed efficiently via the power method:

$$\lambda_{\max}^{(l)} \approx \frac{\mathbf{v}^T \mathbf{H}_l \mathbf{v}}{\mathbf{v}^T \mathbf{v}}$$

after \(k\) iterations of \(\mathbf{v} \leftarrow \mathbf{H}_l \mathbf{v} / |\mathbf{H}_l \mathbf{v}|\).

The Hessian-vector product \(\mathbf{H}_l \mathbf{v}\) is computed without forming \(\mathbf{H}_l\) explicitly, using the identity:

$$\mathbf{H}\mathbf{v} = \nabla_\theta \left(\nabla_\theta \mathcal{L} \cdot \mathbf{v}\right)$$

which requires only two backpropagation passes.

Latency-Aware Mixed-Precision
#

Sensitivity alone is insufficient; we also need to account for the hardware latency of each precision choice. A layer that is highly sensitive but also very fast at INT4 may still be worth quantizing aggressively.

The optimization problem is:

$$\max_{\{b_l\}} \; \text{Accuracy}(\{b_l\}) \quad \text{s.t.} \quad \sum_l \text{Latency}(l, b_l) \le T_{\text{budget}}$$

This is typically solved via:

  1. Integer Linear Programming (ILP): Enumerate candidate bit-widths per layer, measure latency on target hardware, and solve the ILP.
  2. Pareto frontier: Compute the Pareto-optimal set of (latency, accuracy) configurations and let the user pick their operating point.
Accuracy vs Latency Pareto Frontier:

Accuracy
  |  * FP32 baseline (100%)
  |
  |    * Mixed W8/W4 (99.5%)
  |
  |        * All W8A8 (99.2%)
  |
  |             * Mixed W4/W8 (98.5%)
  |
  |                    * All W4A8 (97.0%)
  |
  |                              * All W4A4 (94.0%)
  |
  +---------------------------------------------------> Latency
  slow                                            fast

Practical PTQ Tools and Frameworks
#

Tool Overview and Comparison
#

FrameworkPrimary UseWeight QuantActivation QuantLLM SupportTarget Hardware
TensorRTNVIDIA deploymentINT8, INT4INT8Yes (TRT-LLM)NVIDIA GPUs
ONNX RuntimeCross-platform inferenceINT8, INT4INT8YesCPU, GPU, NPU
PyTorch (ao)Research & productionINT8, INT4, NF4INT8, dynamicYes (torchao)CPU, GPU
llama.cpp / GGUFLLM on consumer HWQ2-Q8 (various)FP16/FP32Yes (primary)CPU, Apple Silicon
HuggingFace OptimumModel hub integrationINT8, INT4 (GPTQ/AWQ)INT8YesVarious
Intel Neural CompressorIntel hardwareINT8, INT4INT8YesIntel CPU/GPU
Qualcomm AIMETMobile/edgeINT8, INT4INT8, INT16LimitedSnapdragon NPU

TensorRT
#

NVIDIA TensorRT is the most mature PTQ framework for GPU deployment. Its INT8 calibration pipeline is the origin of the KL-divergence calibration method discussed earlier.

TensorRT PTQ Workflow:

  ONNX Model --> TensorRT Builder --> Calibration --> INT8 Engine
                      |                    |
                      v                    v
                 Layer fusion          KL-divergence
                 (Conv+BN+ReLU)        calibrator
                 Kernel autotuning     (128-1024 images)

Key features:

  • Automatic layer fusion and kernel selection
  • Dynamic shape support with per-profile calibration
  • Mixed-precision with layer-level control
  • Sparse tensor core support (2:4 structured sparsity + INT8)

PyTorch Native Quantization (torchao)
#

PyTorch’s quantization stack has evolved significantly. The modern approach uses torchao (Torch Architecture Optimization):

import torchao

# Weight-only INT4 quantization (for LLMs)
torchao.quantize_(model, torchao.int4_weight_only())

# Dynamic INT8 quantization (weights static, activations dynamic)
torchao.quantize_(model, torchao.int8_dynamic_activation_int8_weight())

# Static INT8 quantization (both calibrated)
torchao.quantize_(model, torchao.int8_static_activation_int8_weight(calibration_fn))

llama.cpp and GGUF Format
#

For LLM deployment on consumer hardware, llama.cpp and its GGUF format have become the de facto standard. GGUF supports numerous quantization types:

GGUF TypeBits/WeightMethodTypical Use
Q2_K~2.6K-quant mixed 2/3-bitMaximum compression
Q3_K_M~3.3K-quant mixed 3/4-bitSmall models, constrained RAM
Q4_04.0RTN, block size 32Legacy, fast
Q4_K_M~4.6K-quant mixed 4/5-bitBest 4-bit quality
Q5_K_M~5.5K-quant mixed 5/6-bitNear-lossless for most models
Q6_K6.6K-quant 6-bitHigh quality
Q8_08.0RTN, block size 32Baseline, nearly lossless

The “K-quant” variants use a sophisticated approach where each block of weights is quantized to a mixture of bit-widths, with importance weighting based on the quantization error contribution. Within each super-block of 256 weights, sub-blocks are assigned different bit-widths.

HuggingFace Optimum and AutoGPTQ/AutoAWQ
#

The HuggingFace ecosystem provides the simplest path from a pre-trained model to a quantized deployment:

from transformers import AutoModelForCausalLM

# Load a pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)

# Or quantize your own model with AWQ
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model.quantize(
    tokenizer,
    quant_config={"w_bit": 4, "q_group_size": 128, "version": "GEMM"}
)

Framework Selection Guide
#

What is your model type?
|
|-- CNN/Vision Model
|     |-- Target: NVIDIA GPU --> TensorRT
|     |-- Target: Mobile/Edge --> Qualcomm AIMET or TFLite
|     |-- Target: Intel CPU --> Intel Neural Compressor
|     |-- Target: General --> ONNX Runtime
|
|-- Large Language Model
|     |-- Serving at scale --> TensorRT-LLM or vLLM
|     |-- Consumer GPU (NVIDIA) --> AutoGPTQ or AutoAWQ
|     |-- Consumer CPU/Apple Silicon --> llama.cpp (GGUF)
|     |-- Research/Experimentation --> torchao
|
|-- Other (Audio, Multimodal, etc.)
      |-- General purpose --> ONNX Runtime or PyTorch

Evaluation and Debugging
#

Key Metrics
#

Evaluating a quantized model requires measuring both accuracy and efficiency:

Accuracy metrics:

MetricApplicable ToWhat It Measures
Top-1/Top-5 accuracy dropClassificationOverall prediction quality
Perplexity increaseLanguage modelsToken prediction quality
mAP/mIoU dropDetection/SegmentationLocalization + classification
BLEU/ROUGE dropGenerationOutput text quality
Cosine similarityEmbeddingsRepresentation fidelity
Layer-wise SNRAnyPer-layer quantization noise

Layer-wise Signal-to-Quantization-Noise Ratio (SQNR):

$$\text{SQNR}_l = 10 \log_{10} \frac{\|\mathbf{W}_l\|_F^2}{\|\mathbf{W}_l - \hat{\mathbf{W}}_l\|_F^2} \quad \text{(dB)}$$

A healthy quantized model typically has SQNR > 20 dB for all layers at INT8. Layers below 15 dB are candidates for higher precision.

Efficiency metrics:

MetricDescription
Model size reductionCompressed size vs FP32
Inference latencyWall-clock time per sample
ThroughputSamples per second
Memory footprintPeak GPU/CPU memory during inference
Energy consumptionJoules per inference (edge devices)

Sensitivity Analysis Workflow
#

A systematic debugging workflow when PTQ accuracy is unsatisfactory:

PTQ Accuracy Debugging Flowchart:

1. Measure overall accuracy drop
   |
   |-- Drop < 1% --> SHIP IT
   |
   |-- Drop 1-3% --> Layer-wise analysis
   |     |
   |     v
   |   2. Compute per-layer SQNR
   |     |
   |     v
   |   3. Identify low-SQNR layers
   |     |
   |     |-- Few layers bad --> Mixed precision (keep those FP16)
   |     |-- Many layers bad --> Better calibration method
   |     |                       (MSE or KL instead of MinMax)
   |     |
   |     v
   |   4. Re-evaluate
   |
   |-- Drop > 3% --> Advanced techniques needed
         |
         v
       5. Try advanced PTQ:
         |-- AdaRound / BRECQ for CNNs
         |-- GPTQ / AWQ for LLMs
         |-- OmniQuant for extreme compression
         |
         v
       6. Still failing?
         |-- Mixed-precision with sensitivity analysis
         |-- Consider QAT
         |-- Increase bit-width (INT4 -> INT6 or INT8)

Common Failure Modes and Solutions
#

1. Activation outliers destroying per-tensor quantization

Symptom: Large accuracy drop even at INT8, a few layers have very low SQNR. Diagnosis: Check activation histograms for extreme outliers. Solution: Use per-channel activation quantization, SmoothQuant, or dynamic quantization.

2. First and last layers are overly sensitive

Symptom: Accuracy recovers significantly when first/last layers are kept in FP16. Diagnosis: These layers often have wider value ranges and direct impact on input/output. Solution: Keep first and last layers in FP16 (common practice, negligible overhead).

3. Depthwise separable convolutions quantize poorly

Symptom: Large accuracy drop in MobileNet-like architectures at INT8. Diagnosis: Depthwise convolutions have very few weights per channel, making per-channel statistics unreliable. Solution: Use per-channel quantization with careful calibration, or keep depthwise layers at higher precision.

4. Attention layers with softmax produce narrow distributions

Symptom: Self-attention outputs have very small dynamic range after softmax, leading to poor quantization utilization. Diagnosis: Softmax outputs are in [0,1] and often concentrated near 0, wasting quantization levels. Solution: Use asymmetric quantization for post-softmax activations, or keep attention in higher precision.

5. Accumulated error across many layers

Symptom: Individual layers look fine, but end-to-end accuracy drops significantly. Diagnosis: Small per-layer errors compound through the network. Solution: Block-wise reconstruction (BRECQ), or end-to-end fine-tuning of quantization parameters.

6. Calibration data mismatch

Symptom: Good accuracy on calibration domain, poor on deployment domain. Diagnosis: Activation ranges differ between calibration and deployment distributions. Solution: Use dynamic quantization for activations, or ensure calibration data matches deployment distribution.


Summary
#

Key Takeaways
#

  1. PTQ is the first line of defense for model deployment. Always try PTQ before QAT.

  2. INT8 PTQ is largely solved: With proper calibration (MSE or KL-divergence) and per-channel weight quantization, most models lose less than 1% accuracy at INT8.

  3. INT4 PTQ requires advanced methods: Simple RTN fails; methods like GPTQ, AWQ, and AdaRound are essential for sub-8-bit quantization.

  4. LLMs have unique challenges: Systematic activation outliers require specialized approaches (SmoothQuant, rotation-based methods) rather than generic PTQ.

  5. Calibration method matters more than you think: The difference between MinMax and MSE-optimal calibration can be the difference between a usable and unusable model.

  6. Mixed precision is your safety net: When uniform quantization fails, keeping a few sensitive layers at higher precision often recovers most of the accuracy.

  7. The ecosystem is mature: Tools like TensorRT, llama.cpp, and the HuggingFace ecosystem make PTQ accessible without deep expertise.

Decision Flowchart
#

START: You have a trained model to deploy
  |
  v
Is it a Large Language Model (>1B params)?
  |
  |-- YES --> Weight-only quantization
  |     |
  |     |-- Need W8A8? --> SmoothQuant + per-channel calibration
  |     |
  |     |-- Need W4A16? --> GPTQ or AWQ
  |     |     |
  |     |     |-- Serving at scale? --> AWQ (faster inference)
  |     |     |-- Maximum accuracy? --> GPTQ (slightly better quality)
  |     |     |-- Consumer hardware? --> llama.cpp GGUF Q4_K_M
  |     |
  |     |-- Need W3 or lower? --> OmniQuant or QuIP#
  |
  |-- NO --> Standard PTQ pipeline
        |
        |-- Step 1: BN folding + weight equalization
        |-- Step 2: Calibrate with MSE or KL-divergence (512 samples)
        |-- Step 3: Evaluate INT8 accuracy
        |     |
        |     |-- Acceptable --> Deploy
        |     |-- Not acceptable -->
        |           |-- Try per-channel activations
        |           |-- Try AdaRound
        |           |-- Try BRECQ
        |           |-- Sensitivity analysis + mixed precision
        |           |-- If all fail --> QAT
        |
        |-- Step 4: Optimize for target hardware
              |-- Measure actual latency (not just model size)
              |-- Profile memory usage
              |-- Verify numerical correctness on edge cases

The Quantization Landscape in 2026
#

The field continues to advance rapidly. Key trends:

  • Sub-4-bit quantization is becoming practical for LLMs, with methods achieving reasonable quality at 2-3 bits per weight.
  • Hardware support is catching up: INT4 tensor cores, variable-precision accelerators, and lookup-table-based computation are making low-bit inference efficient.
  • Quantization-aware architectures: Models are being designed with quantization in mind from the start, featuring smoother activation distributions and more quantization-friendly attention mechanisms.
  • Rotation and transformation-based methods (SpinQuant, QuIP#) represent a paradigm shift from simply choosing clipping thresholds to actively reshaping distributions.

PTQ remains the most practical path from a trained model to an efficient deployment, and understanding its principles, limitations, and state-of-the-art methods is essential for anyone working in AI deployment and edge computing.


References:

  • Nagel et al., “Data-Free Quantization Through Weight Equalization and Bias Correction,” ICCV 2019
  • Banner et al., “Post Training 4-bit Quantization of Convolutional Networks for Rapid-Deployment,” NeurIPS 2019 (ACIQ)
  • Nagel et al., “Up or Down? Adaptive Rounding for Post-Training Quantization,” ICML 2020 (AdaRound)
  • Li et al., “BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction,” ICLR 2021
  • Wei et al., “QDrop: Randomly Dropping Quantization for Extremely Low-Bit Post-Training Quantization,” ICLR 2022
  • Dettmers et al., “GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” NeurIPS 2022
  • Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” ICML 2023
  • Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers,” ICLR 2023
  • Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” MLSys 2024
  • Shao et al., “OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models,” ICLR 2024
  • Liu et al., “SpinQuant: LLM Quantization with Learned Rotations,” arXiv 2024
  • Dong et al., “HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision,” ICCV 2019