Fast Differentiable Rasterizer for Gaussians

Table of Contents

Overview
#

The fast differentiable rasterizer is a key component enabling real-time 3D Gaussian Splatting. This GPU-based approach achieves high performance through tile-based processing and efficient sorting.

Design Goals
#

Fast overall rendering
Fast sorting for approximate α-blending
No limit on splats receiving gradients
Constant memory overhead per pixel

Tile-Based Architecture
#

Screen Division
#

Divide screen into 16×16 pixel tiles:

┌────┬────┬────┬────┐
│Tile│Tile│Tile│Tile│
│ 0  │ 1  │ 2  │ 3  │
├────┼────┼────┼────┤
│Tile│Tile│Tile│Tile│
│ 4  │ 5  │ 6  │ 7  │
├────┼────┼────┼────┤
│ ...                │

Why Tiles?
#

Benefit	Description
Parallelism	Each tile processed independently
Cache efficiency	Nearby pixels share data
GPU-friendly	Maps to thread blocks

Culling and Preprocessing
#

Frustum Culling
#

Retain only Gaussians visible in view:

$$ \text{Keep if: } \mu_{projected} + 3\sigma \text{ intersects view} $$

99% confidence interval ensures no visible Gaussians are missed.

Guard Band
#

Reject primitives near camera plane:

Avoids numerical instabilities
Handles edge cases

Sorting Strategy
#

Key Construction
#

Each Gaussian creates key combining:

Key = [Tile ID (upper bits)] | [Depth (lower bits)]

Instance Creation
#

One instance per overlapping tile:

Gaussian G overlaps tiles 5, 6, 9, 10
  → Create 4 instances with keys:
     (5, depth_G), (6, depth_G), (9, depth_G), (10, depth_G)

GPU Radix Sort
#

Single parallel sort organizes all instances:

O(n) complexity with radix sort
No per-pixel sorting needed
Approximate but fast

Forward Pass
#

Thread Block Processing
#

Each tile handled by one thread block (256 threads for 16×16):

// Pseudocode
for each Gaussian in tile (front-to-back):
    load to shared memory
    for each pixel in tile:
        compute Gaussian contribution
        accumulate color: C += α_i * T_i * c_i
        update transmittance: T *= (1 - α_i)
        if (T < threshold) mark saturated
    if (all pixels saturated) break

Early Termination
#

When all pixels in tile reach α saturation:

Stop processing remaining Gaussians
Significant speedup for dense scenes

Memory Efficiency
#

No per-pixel storage of blend lists:

Only final color accumulated
Constant memory per pixel

Backward Pass
#

Challenge
#

Need gradients for all Gaussians, but didn’t store intermediate values.

Solution: Recover from Final Values
#

Traverse Gaussians back-to-front:

$$ T_i = \frac{T_{final}}{\prod_{j=i}^{N}(1 - \alpha_j)} $$

Gradient Computation
#

For each Gaussian (reverse order):

$$ \frac{\partial L}{\partial c_i} = T_i \cdot \alpha_i \cdot \frac{\partial L}{\partial C} $$$$ \frac{\partial L}{\partial \alpha_i} = T_i \cdot c_i \cdot \frac{\partial L}{\partial C} + \text{(accumulated terms)} $$

Memory Trade-off
#

Approach	Memory	Speed
Store all	O(N×P)	Fast backward
Recompute	O(P)	Slower backward
This paper	O(P)	Fast backward

Where N = Gaussians, P = Pixels.

Key Advantages
#

No Hard Limits
#

All blended primitives receive gradients:

No arbitrary cutoff
Better optimization

Scene Agnostic
#

No hyperparameter tuning needed:

Works for dense and sparse scenes
No tile size adjustment
Automatic load balancing

Differentiability
#

Every operation is differentiable:

Sorting (discrete but gradients flow through values)
Blending
Projection

Performance
#

Typical Numbers
#

Scene	Gaussians	FPS
Indoor	500K	150+
Outdoor	1M	100+
Complex	2M+	60+

Bottlenecks
#

Sorting: O(n log n) or O(n) with radix
Rasterization: Depends on overlap
Backward pass: Similar to forward

Implementation Details
#

CUDA Considerations
#

Thread block = 16×16 = 256 threads
Shared memory for Gaussian data
Warp-level primitives for efficiency

Numerical Stability
#

Clamp α to avoid division by zero
Log-space for very small transmittance
Guard band for near-plane issues

Comparison
#

Method	Speed	Quality	Memory
Ray marching	Slow	High	High
Point splatting	Fast	Lower	Low
This rasterizer	Fast	High	Low

Overview#

Design Goals#

Tile-Based Architecture#

Screen Division#

Why Tiles?#

Culling and Preprocessing#

Frustum Culling#

Guard Band#

Sorting Strategy#

Key Construction#

Instance Creation#

GPU Radix Sort#

Forward Pass#

Thread Block Processing#

Early Termination#

Memory Efficiency#

Backward Pass#

Challenge#

Solution: Recover from Final Values#

Gradient Computation#

Memory Trade-off#

Key Advantages#

No Hard Limits#

Scene Agnostic#

Differentiability#

Performance#

Typical Numbers#

Bottlenecks#

Implementation Details#

CUDA Considerations#

Numerical Stability#

Comparison#