Table of Contents
Overview#
A GPU is not a faster CPU. It is a fundamentally different machine designed to solve a fundamentally different problem. Where a CPU excels at running a few complex tasks with low latency, a GPU excels at running thousands of simple tasks simultaneously with high throughput.
This architectural difference is not accidental. It follows directly from the workloads each processor was designed for. Understanding GPU architecture from the ground up — how the hardware is organized, how threads execute, how memory is structured — is essential for writing efficient GPU code and understanding why deep learning runs on GPUs.
CPU vs GPU: Different Problems, Different Designs#
The Design Trade-off#
A transistor budget is finite. CPU designers spend most transistors on structures that make a single thread fast: large caches, branch predictors, out-of-order execution engines. GPU designers spend most transistors on more execution units, accepting that each individual unit is simpler and slower.
CPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│ │
│ ┌───────────────────┐ ┌─────────┐ │
│ │ Control & │ │ │ │
│ │ Branch Predictor │ │ Cache │ │
│ │ (large) │ │ (large) │ │
│ └───────────────────┘ │ │ │
│ ┌──────────┐ ┌──────┐ │ │ │
│ │ Core 0 │ │Core 1│ │ │ │
│ │ (complex)│ │ │ └─────────┘ │
│ └──────────┘ └──────┘ │
│ ┌──────────┐ ┌──────┐ │
│ │ Core 2 │ │Core 3│ │
│ └──────────┘ └──────┘ │
└──────────────────────────────────────────┘
GPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐ │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│ │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘ │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐ │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│ │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘ │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐ │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│ │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘ │
│ ... (many SMs) │
│ ┌─────────────────────────────────────┐ │
│ │ Small cache / Shared Mem │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘Side-by-Side Comparison#
| Aspect | CPU | GPU |
|---|---|---|
| Core count | 4–64 large cores | Thousands of small cores |
| Clock speed | 4–6 GHz | 1.5–2.5 GHz |
| Cache size | Large (tens of MB) | Small (few MB per SM) |
| Branch prediction | Sophisticated (TAGE, BTB) | Very simple or none |
| Out-of-order execution | Yes (ROB, reservation stations) | No (in-order) |
| Latency hiding | Cache + speculation | Massive thread switching |
| Optimal workload | Sequential, branch-heavy | Parallel, data-parallel |
| Transistor priority | Make one thread fast | Make thousands of threads run |
The key question is: how does the GPU hide memory latency without caches or out-of-order execution? The answer is thread-level parallelism, and understanding that requires understanding the GPU’s hardware organization.
GPU Hardware Organization#
This section uses NVIDIA terminology, as it is the most widely documented. AMD’s architecture is structurally similar (with different names).
Hierarchical Structure#
A GPU is organized in a hierarchy: the full chip contains multiple GPCs (Graphics Processing Clusters), each GPC contains multiple SMs (Streaming Multiprocessors), and each SM contains the actual execution units.
┌──────────────────────────────────────────────────────┐
│ GPU Chip │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GPC 0 │ │ GPC 1 │ │ GPC N │ │
│ │ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │ │
│ │ │SM0│ │SM1│ │ │ │SM4│ │SM5│ │ │ │SMk│ │...│ │ │
│ │ └───┘ └───┘ │ │ └───┘ └───┘ │ │ └───┘ └───┘ │ │
│ │ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ │ │
│ │ │SM2│ │SM3│ │ │ │SM6│ │SM7│ │ │ │...│ │...│ │ │
│ │ └───┘ └───┘ │ │ └───┘ └───┘ │ │ └───┘ └───┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ L2 Cache (shared) │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ VRAM (GDDR6X or HBM) │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘Real GPU Specifications#
| GPU | Generation | SMs | CUDA Cores | L2 Cache | VRAM | Bandwidth | TDP |
|---|---|---|---|---|---|---|---|
| RTX 3090 | Ampere | 82 | 10,496 | 6 MB | 24 GB GDDR6X | 936 GB/s | 350W |
| RTX 4090 | Ada Lovelace | 128 | 16,384 | 72 MB | 24 GB GDDR6X | 1,008 GB/s | 450W |
| A100 | Ampere | 108 | 6,912 | 40 MB | 80 GB HBM2e | 2,039 GB/s | 400W |
| H100 | Hopper | 132 | 16,896 | 50 MB | 80 GB HBM3 | 3,350 GB/s | 700W |
Notice the enormous difference in core counts compared to CPUs (thousands vs. tens).
Streaming Multiprocessor (SM): The Core Building Block#
The SM is the fundamental compute unit of a GPU. Understanding the SM is the key to understanding GPU performance.
SM Internal Architecture#
Each SM contains multiple groups of execution units, shared memory, caches, and warp schedulers. Here is the layout for a modern (Ada Lovelace generation) SM:
┌────────────────────────────────────────────────────┐
│ Streaming Multiprocessor (SM) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Warp Schedulers × 4 │ │
│ │ (Each can issue 1 instruction per cycle │ │
│ │ to its own set of execution units) │ │
│ └──────────┬───────┬───────┬───────┬───────────┘ │
│ │ │ │ │ │
│ ┌──────────▼───────▼───────▼───────▼───────────┐ │
│ │ Execution Units │ │
│ │ │ │
│ │ FP32 Units × 128 (single-precision float) │ │
│ │ INT32 Units × 128 (integer) │ │
│ │ FP64 Units × 2 (double-precision) │ │
│ │ Tensor Cores × 4 (matrix multiply) │ │
│ │ Load/Store Units × 32 │ │
│ │ SFU × 16 (sin, cos, exp, rsqrt) │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ Register File │ │ Shared Memory / │ │
│ │ (256 KB) │ │ L1 Cache (128 KB) │ │
│ └─────────────────┘ └──────────────────────┘ │
└────────────────────────────────────────────────────┘What Each Component Does#
| Component | Role | Why it matters |
|---|---|---|
| Warp Scheduler | Picks a ready warp and issues its next instruction | 4 schedulers = 4 warps can issue per cycle |
| FP32 Units | Single-precision floating-point arithmetic | The main workhorse for most GPU compute |
| INT32 Units | Integer arithmetic, address calculation | Can execute in parallel with FP32 on some architectures |
| FP64 Units | Double-precision floating-point | Important for scientific computing, very few per SM |
| Tensor Cores | Hardware matrix multiply-accumulate | Dramatically accelerate deep learning (10–20× over FP32 cores) |
| LD/ST Units | Load and store data from/to memory | Memory access is often the bottleneck |
| SFU | Transcendental functions (sin, cos, exp, etc.) | Slow but necessary for certain computations |
| Register File | Per-thread fast storage | 256 KB — much larger than a CPU’s register file |
| Shared Memory | Programmer-managed on-chip memory, shared within a thread block | Critical for reducing global memory access |
The register file deserves special attention. At 256 KB per SM, it is larger than most CPU L1 caches. This is because the GPU needs to hold state for thousands of threads simultaneously. Each thread gets its own set of registers (up to 255 per thread).
SIMT Execution Model#
The GPU’s execution model is called SIMT: Single Instruction, Multiple Threads. It is similar to SIMD (Single Instruction, Multiple Data) on CPUs, but with important differences.
What Is a Warp?#
A warp is a group of 32 threads that execute the same instruction at the same time on NVIDIA GPUs. (On AMD GPUs, the equivalent is called a wavefront of 64 threads.)
When you launch a CUDA kernel with 256 threads in a block, the hardware organizes them into warps:
Thread Block (256 threads):
├── Warp 0: Thread 0 – Thread 31
├── Warp 1: Thread 32 – Thread 63
├── Warp 2: Thread 64 – Thread 95
├── Warp 3: Thread 96 – Thread 127
├── Warp 4: Thread 128 – Thread 159
├── Warp 5: Thread 160 – Thread 191
├── Warp 6: Thread 192 – Thread 223
└── Warp 7: Thread 224 – Thread 255How a Warp Executes#
All 32 threads in a warp execute in lockstep — the same instruction, at the same time, on different data.
Warp 0 execution (32 threads in lockstep):
Cycle 1: ALL 32 threads execute: ADD R1, R2, R3
Thread 0: R1 = R2 + R3 (with thread 0's data)
Thread 1: R1 = R2 + R3 (with thread 1's data)
...
Thread 31: R1 = R2 + R3 (with thread 31's data)
Cycle 2: ALL 32 threads execute: MUL R4, R1, R5
Cycle 3: ALL 32 threads execute: ST [R6], R4This is extremely efficient: one instruction fetch and decode serves 32 threads. The hardware cost is shared across all threads in the warp.
SIMD vs SIMT#
| Aspect | CPU SIMD | GPU SIMT |
|---|---|---|
| Vector width | Explicit (128/256/512 bits) | Implicit (warp of 32 threads) |
| Programming | Vector intrinsics or auto-vectorization | Write scalar code per thread |
| Branch handling | Mask register | Automatic predication (divergence) |
| Thread identity | No per-lane identity | Each thread has unique threadIdx |
The programming model difference is significant. With CPU SIMD (AVX-512), you explicitly pack data into 512-bit vectors and use special vector instructions. With GPU SIMT, you write code as if it runs on a single thread, and the hardware maps 32 threads onto the execution units automatically. This makes GPU programming more intuitive, but you need to be aware of warp divergence.
Warp Divergence: The Performance Trap#
When threads within the same warp take different branches, the warp must execute both paths sequentially, with some threads disabled on each path. This is called warp divergence.
Example code:
if (threadIdx.x < 16) {
a[idx] = x + 1; // Path A
} else {
a[idx] = x * 2; // Path B
}Step-by-step execution:
Step 1: Evaluate condition for all 32 threads
Thread 0: threadIdx.x = 0 → condition TRUE (Path A)
Thread 1: threadIdx.x = 1 → condition TRUE (Path A)
...
Thread 15: threadIdx.x = 15 → condition TRUE (Path A)
Thread 16: threadIdx.x = 16 → condition FALSE (Path B)
...
Thread 31: threadIdx.x = 31 → condition FALSE (Path B)
Step 2: Execute Path A (threads 0-15 active, threads 16-31 MASKED)
Thread 0: a[0] = x + 1 ← executes
Thread 1: a[1] = x + 1 ← executes
...
Thread 15: a[15] = x + 1 ← executes
Thread 16: — (idle) ← masked, wastes a lane
...
Thread 31: — (idle) ← masked, wastes a lane
Step 3: Execute Path B (threads 0-15 MASKED, threads 16-31 active)
Thread 0: — (idle)
...
Thread 15: — (idle)
Thread 16: a[16] = x * 2 ← executes
...
Thread 31: a[31] = x * 2 ← executes
Step 4: Reconverge — all threads active againResult: The if-else block takes 2× the time because both paths execute sequentially. Half the execution units are idle during each path.
$$ \text{SIMT Efficiency} = \frac{\text{Active threads per instruction}}{\text{Warp size (32)}} $$In this example, efficiency = 16/32 = 50%.
Key optimization rule: Design your code so that threads within the same warp take the same branch. Divergence between warps is free (different warps are independent), but divergence within a warp is costly.
Latency Hiding Through Warp Scheduling#
This is the GPU’s most important trick. A CPU hides memory latency with large caches and speculative execution. A GPU hides it by switching to a different warp — instantly, at zero cost.
How It Works Step by Step#
Suppose an SM has 32 warps assigned to it. At any moment, some warps are ready to execute and others are waiting for memory.
Warp states on an SM:
Warp 0: READY ← can execute next instruction
Warp 1: MEM_WAIT ← waiting for global memory load (400 cycles)
Warp 2: READY
Warp 3: MEM_WAIT
Warp 4: READY
Warp 5: READY
...
Warp 31: MEM_WAITThe warp scheduler picks a READY warp every cycle:
Cycle 1: Schedule Warp 0 → issues ADD instruction
Cycle 2: Warp 0 issues LD (memory request) → now MEM_WAIT
Schedule Warp 2 → issues MUL instruction
Cycle 3: Schedule Warp 4 → issues ADD instruction
Cycle 4: Schedule Warp 5 → issues SUB instruction
...
Cycle 400: Warp 0's memory data arrives → READY again
Cycle 401: Schedule Warp 0 → continues executionCritical point: Switching from one warp to another takes zero cycles. This is because every warp’s register state is always resident in the register file. There is no context switch — the scheduler just points to a different set of registers.
CPU approach to latency:
Thread runs → Cache miss (200 cycles) → STALL → Data arrives → Resume
^^^^^^
Wasted cycles (or OS context switch, ~1000+ cycles)
GPU approach to latency:
Warp 0 runs → Memory request → Switch to Warp 2 (0 cycles!) → Warp 2 runs
→ Warp 4 runs → ... → Warp 0's data arrives → Warp 0 continues
(No cycles wasted — other warps filled the gap)Occupancy#
Occupancy measures how many warps are resident on an SM relative to the maximum:
$$ \text{Occupancy} = \frac{\text{Active warps on SM}}{\text{Maximum warps supported by SM}} $$Higher occupancy means more warps available to hide latency. But occupancy is limited by per-thread resource usage:
Step by step — how occupancy is determined:
- An SM supports a maximum of, say, 48 warps (1,536 threads).
- The SM has 256 KB of registers and 128 KB of shared memory.
- Your kernel uses 64 registers per thread and 48 KB of shared memory per block.
- Register limit: 256 KB ÷ (64 regs × 4 bytes × 32 threads/warp) = 32 warps max.
- Shared memory limit: 128 KB ÷ 48 KB = 2 blocks max. If each block has 256 threads (8 warps), that is 16 warps.
- The binding constraint is shared memory: occupancy = 16/48 = 33%.
| Occupancy | Latency hiding ability | Resource flexibility |
|---|---|---|
| 100% | Maximum | Very constrained registers/shared mem |
| 50% | Usually sufficient | Moderate |
| 25% | May be insufficient | Very flexible per-thread |
In practice, 50% occupancy is often enough because each warp issues multiple independent instructions, providing latency hiding even with fewer warps. But if your kernel is very memory-bound, higher occupancy helps significantly.
GPU Memory Hierarchy#
Memory access is the dominant bottleneck in most GPU workloads. Understanding the memory hierarchy is essential for writing fast GPU code.
The Complete Memory Map#
Speed ┌──────────────────────────────────────────┐
▲ │ Per-Thread: Registers │
│ │ Up to 255 registers per thread │
│ │ Access: 0 cycles (operand read) │
│ │ Size: 256 KB per SM total │
│ ├──────────────────────────────────────────┤
│ │ Per-Block: Shared Memory │
│ │ Shared among all threads in a block │
│ │ Programmer-managed (explicit load/store)│
│ │ Access: ~20-30 cycles │
│ │ Size: up to 228 KB per SM │
│ ├──────────────────────────────────────────┤
│ │ Per-SM: L1 Cache │
│ │ Hardware-managed, automatic │
│ │ Access: ~30 cycles │
│ │ Size: configurable, shares SRAM w/ smem │
│ ├──────────────────────────────────────────┤
│ │ Chip-wide: L2 Cache │
│ │ Shared across all SMs │
│ │ Access: ~200 cycles │
│ │ Size: 6-72 MB │
│ ├──────────────────────────────────────────┤
│ │ Off-chip: Global Memory (VRAM) │
│ │ GDDR6X or HBM │
│ │ Access: ~400-600 cycles │
│ │ Size: 24-80 GB │
▼ └──────────────────────────────────────────┘
SpeedRegisters: The Fastest Storage#
Each thread can use up to 255 registers. Accessing a register takes 0 extra cycles — it is directly wired into the execution unit. But registers are a limited, shared resource. The more registers each thread uses, the fewer threads (warps) can be resident on the SM.
Trade-off example:
- 32 regs/thread → 48 warps fit → high occupancy, good latency hiding
- 128 regs/thread → 16 warps fit → lower occupancy, but each thread computes faster
- 255 regs/thread → 8 warps fit → risk of poor latency hiding
If a thread needs more than 255 registers, the compiler spills extras to local memory (actually global memory, very slow). Register spilling is a major performance killer.
Shared Memory: The Programmer-Managed Cache#
Shared memory is a block of fast on-chip SRAM that is shared among all threads in a thread block and managed explicitly by the programmer. It physically shares the same SRAM as the L1 cache, and you can configure the split.
Configurable SRAM split (Ampere/Ada):
Option A: Shared 128 KB + L1 0 KB
Option B: Shared 100 KB + L1 28 KB
Option C: Shared 64 KB + L1 64 KB
Option D: Shared 28 KB + L1 100 KBWhy use shared memory? Consider a scenario where multiple threads in a block need the same data:
Without shared memory:
Thread 0: Load A[0] from global memory — 400 cycles
Thread 1: Load A[0] from global memory — 400 cycles (same data!)
Thread 2: Load A[0] from global memory — 400 cycles (same data!)
...
Total wasted bandwidth: enormous
With shared memory:
Step 1: Thread 0 loads A[0] from global → shared memory (400 cycles, once)
Step 2: __syncthreads() (barrier — ensure all threads see the data)
Step 3: All threads read A[0] from shared memory (~20 cycles each)
Total: 400 + 20 per access — much cheaper for repeated accessStep-by-step shared memory usage pattern (tiling):
- Each thread cooperatively loads a tile of data from global memory into shared memory.
- Call
__syncthreads()to ensure all loads complete. - All threads compute using the data in shared memory (fast).
- Call
__syncthreads()again before loading the next tile. - Repeat until the full computation is done.
This pattern is fundamental in matrix multiplication, convolution, and many other GPU algorithms.
Bank Conflicts#
Shared memory is divided into 32 banks (one per warp lane). If multiple threads access different addresses that map to the same bank, the accesses are serialized.
Step by step — how bank mapping works:
Bank assignment (32 banks, 4 bytes per bank per row):
Address 0 → Bank 0
Address 4 → Bank 1
Address 8 → Bank 2
...
Address 124 → Bank 31
Address 128 → Bank 0 (wraps around)
Address 132 → Bank 1
...No conflict — all threads access different banks:
Thread 0 → Address 0 → Bank 0
Thread 1 → Address 4 → Bank 1
Thread 2 → Address 8 → Bank 2
...
Thread 31 → Address 124 → Bank 31
→ All 32 accesses happen simultaneously (1 cycle)2-way conflict — two threads hit the same bank:
Thread 0 → Address 0 → Bank 0
Thread 1 → Address 128 → Bank 0 ← same bank!
Thread 2 → Address 8 → Bank 2
...
→ Bank 0 must serve two requests sequentially (2 cycles for those threads)Broadcast — all threads access the same address:
Thread 0 → Address 0 → Bank 0
Thread 1 → Address 0 → Bank 0 (same address!)
Thread 2 → Address 0 → Bank 0 (same address!)
...
→ Hardware broadcasts — treated as 1 access (no conflict)Memory Coalescing: Getting Data Efficiently from Global Memory#
When a warp executes a load instruction, all 32 threads issue memory requests. The hardware tries to coalesce (merge) these into as few memory transactions as possible.
Coalesced access (ideal):
Thread 0 → Address 0x1000
Thread 1 → Address 0x1004
Thread 2 → Address 0x1008
...
Thread 31 → Address 0x107C
→ Addresses are consecutive → merged into ONE 128-byte transaction
→ 128 bytes transferred, 128 bytes useful = 100% efficiencyStrided access (wasteful):
Thread 0 → Address 0x1000
Thread 1 → Address 0x1100 (stride = 256 bytes)
Thread 2 → Address 0x1200
...
Thread 31 → Address 0x2F00
→ Addresses span many cache lines → up to 32 SEPARATE transactions
→ 32 × 128 bytes transferred, only 32 × 4 = 128 bytes useful
→ Efficiency: 128 / 4096 = 3.1%Random access (worst case):
Thread 0 → Address 0xA000
Thread 1 → Address 0x3400
Thread 2 → Address 0x7800
...
→ Each thread hits a different cache line → 32 transactions
→ Same 3% efficiencyOptimization rule: Structure your data so that consecutive threads (within a warp) access consecutive memory addresses. This is called a coalesced access pattern and is one of the most important GPU performance optimizations.
Tensor Cores: Hardware for Matrix Multiply#
Starting with the Volta architecture (2017), NVIDIA GPUs include Tensor Cores — specialized hardware units that perform small matrix multiply-accumulate (MMA) operations in a single cycle.
What Tensor Cores Do#
A single Tensor Core computes:
$$ D = A \times B + C $$Where \(A\), \(B\), \(C\), and \(D\) are small matrices (e.g., 16×16 for FP16).
Comparison of throughput:
Regular FP32 CUDA Cores:
1 core performs 1 multiply + 1 add = 2 FLOP per cycle
128 cores per SM → 256 FLOP/cycle/SM
4th Gen Tensor Core (Hopper, FP16):
One MMA operation: 16 × 16 × 16 × 2 = 8,192 FLOP
4 Tensor Cores per SM → ~32,768 FLOP/cycle/SM
→ Tensor Cores are ~128× more throughput for matrix opsSupported Data Types#
Different precisions trade off accuracy for speed:
| Data Type | Bits | Use Case | Relative Speed |
|---|---|---|---|
| FP64 | 64 | Scientific computing | 1× (baseline) |
| TF32 | 19 | Training (drop-in FP32 replacement) | ~8× |
| FP16 | 16 | Training and inference | ~16× |
| BF16 | 16 | Training (wider range than FP16) | ~16× |
| INT8 | 8 | Inference | ~32× |
| FP8 (E4M3/E5M2) | 8 | Inference (Hopper+) | ~32× |
| FP4 | 4 | Inference (Blackwell) | ~64× |
FP16 vs BF16: Why Two 16-bit Formats?#
FP32: [1 sign] [8 exponent] [23 mantissa]
Range: ±3.4 × 10³⁸, Precision: ~7 decimal digits
FP16: [1 sign] [5 exponent] [10 mantissa]
Range: ±65,504, Precision: ~3.3 decimal digits
BF16: [1 sign] [8 exponent] [7 mantissa]
Range: ±3.4 × 10³⁸, Precision: ~2.4 decimal digitsFP16 has better precision but a very narrow range. If gradients during training fall outside ±65,504, they overflow to infinity. BF16 has the same range as FP32 (same 8-bit exponent), so it rarely overflows, making training more stable despite the lower precision. This is why BF16 has become the default for large model training.
CUDA Programming Model → Hardware Mapping#
Understanding how software concepts map to hardware is essential for performance tuning.
The Mapping#
CUDA Software GPU Hardware
──────────── ────────────
Grid (all blocks) → Entire GPU
└── Block (group of threads) → Assigned to one SM
└── Thread → Runs on a CUDA Core (within a warp)Step by step — what happens when you launch a kernel:
- You specify a grid of blocks:
kernel<<<gridDim, blockDim>>>(...)- Example:
kernel<<<128, 256>>>(...)→ 128 blocks, each with 256 threads
- Example:
- The GigaThread Engine (top-level scheduler) distributes blocks to SMs.
- Each SM may receive multiple blocks (if resources allow).
- Within each block, threads are grouped into warps of 32.
- The SM’s warp schedulers manage all resident warps, issuing instructions each cycle.
Example: kernel<<<128, 256>>>
Grid: 128 blocks
Block 0 → assigned to SM 3
Block 1 → assigned to SM 7
Block 2 → assigned to SM 3 (SM 3 gets multiple blocks)
Block 3 → assigned to SM 12
...
Block 127 → assigned to SM 45
Within Block 0 on SM 3:
256 threads ÷ 32 = 8 warps
Warp 0: threads 0-31
Warp 1: threads 32-63
...
Warp 7: threads 224-255A block stays on the same SM for its entire lifetime. It cannot migrate. Threads within a block can synchronize (__syncthreads()) and share data via shared memory. Threads in different blocks cannot directly communicate during execution.
Choosing Block Size#
| Block Size | Warps/Block | Typical Use Case |
|---|---|---|
| 32 | 1 | Very simple kernels, debugging |
| 128 | 4 | Good general default |
| 256 | 8 | Most common choice |
| 512 | 16 | When more shared memory per block is needed |
| 1024 | 32 | Maximum allowed (use sparingly) |
128 or 256 threads per block works well in most cases. Too small means underutilizing the SM. Too large may cause resource pressure (registers, shared memory) that reduces occupancy.
GPU Memory Bandwidth: GDDR vs HBM#
GPU workloads are often memory-bandwidth limited, meaning the compute units are starved for data. This is why GPU memory bandwidth is so important.
GDDR6X vs HBM3#
GDDR6X (consumer GPUs, e.g., RTX 4090):
┌──────┐ ┌────┐ ┌────┐
│ GPU │──[384-bit bus]────→│GDDR│ │GDDR│ ... (12 chips on PCB)
└──────┘ └────┘ └────┘
Bus width: 384 bits
Data rate: 21 Gbps
Bandwidth: 384 × 21 / 8 = 1,008 GB/s
HBM3 (datacenter GPUs, e.g., H100):
┌─────────────────────┐
│ ┌─────┐ ┌─────┐ │
│ │ HBM │ │ HBM │ │ HBM stacks sit next to GPU die
│ │stack│ │stack│ │ on a silicon interposer
│ └─────┘ └─────┘ │
│ GPU Die │
│ ┌─────┐ ┌─────┐ │
│ │ HBM │ │ HBM │ │
│ │stack│ │stack│ │
│ └─────┘ └─────┘ │
└─────────────────────┘
Bus width: 5,120 bits (massively wide)
Data rate: 6.4 Gbps (lower, but width compensates)
Bandwidth: 5,120 × 6.4 / 8 = 3,350 GB/s| Aspect | GDDR6X | HBM3 |
|---|---|---|
| Bandwidth | ~1 TB/s | ~3.4 TB/s |
| Power efficiency | Moderate | High (short wires on interposer) |
| Capacity | 24 GB typical | 80 GB typical |
| Cost | Lower | Much higher |
| Physical design | Chips on PCB edge | Stacks on silicon interposer |
| Typical use | Gaming, workstations | Data center, AI training |
HBM achieves higher bandwidth by using a very wide bus (5,120 bits vs 384 bits) at a lower clock speed. The short wires on the silicon interposer also use less power per bit, which matters at data center scale.
Roofline Model: Understanding Performance Limits#
Every GPU kernel is limited by one of two factors: compute throughput or memory bandwidth. The Roofline Model visualizes this.
$$ \text{Attainable Performance} = \min\left(\text{Peak FLOP/s}, \quad \text{Memory Bandwidth} \times \text{Arithmetic Intensity}\right) $$$$ \text{Arithmetic Intensity (AI)} = \frac{\text{FLOPs performed}}{\text{Bytes transferred from memory}} $$Performance
(FLOP/s)
│ ╱ Peak Compute ─────────────
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱ ← Memory-bound region
│ ╱
│ ╱ Compute-bound region →
│ ╱
│ ╱
│ ╱
│ ╱
└───────────────────────────────────────
Arithmetic Intensity (FLOP/Byte)
↑
Ridge PointReading the roofline:
- Calculate your kernel’s arithmetic intensity: count FLOPs and bytes transferred.
- If AI is left of the ridge point: you are memory-bound. Optimize memory access (coalescing, shared memory, caching).
- If AI is right of the ridge point: you are compute-bound. Optimize computation (Tensor Cores, reduced precision).
Examples:
| Operation | AI (FLOP/Byte) | Typically |
|---|---|---|
| Vector addition | 0.25 | Memory-bound |
| Matrix-vector multiply | ~1 | Memory-bound |
| Matrix-matrix multiply | ~N/8 | Compute-bound (for large N) |
| Convolution (large) | ~100+ | Compute-bound |
Matrix multiplication is compute-bound because it reuses each loaded element \(O(N)\) times. Vector addition loads two elements, does one add, stores one result — almost no reuse, so it is entirely bandwidth-limited.
GPU Architecture Evolution#
| Generation | Year | Key Innovation |
|---|---|---|
| Tesla (G80) | 2006 | Unified shaders, CUDA introduced — GPUs become programmable for general compute |
| Fermi | 2010 | L1/L2 caches added, ECC memory, true IEEE FP |
| Kepler | 2012 | Dynamic parallelism (kernels launch kernels), Hyper-Q |
| Maxwell | 2014 | Major power efficiency improvement (~2× perf/watt) |
| Pascal | 2016 | HBM2 (P100), NVLink, native FP16 support |
| Volta | 2017 | Tensor Cores introduced — first hardware matrix multiply acceleration |
| Turing | 2018 | RT Cores (hardware ray tracing), INT8/INT4 inference |
| Ampere | 2020 | 3rd gen Tensor Cores, TF32, sparsity support, 80GB HBM2e |
| Hopper | 2022 | Transformer Engine, FP8, DPX instructions, 3TB/s HBM3 |
| Blackwell | 2024 | 2nd gen Transformer Engine, FP4, dual-die design, 8TB/s HBM3e |
Three inflection points:
G80 / CUDA (2006): Transformed GPUs from graphics-only to general-purpose parallel processors. Made GPU computing accessible to non-graphics programmers.
Volta / Tensor Cores (2017): Purpose-built hardware for matrix multiplication gave deep learning training a ~10× speedup over regular CUDA cores. This is when GPU = AI training hardware became firmly established.
Hopper / Transformer Engine (2022): Hardware-level support for Transformer-specific operations (attention, layer norm) with automatic FP8 precision management. Acknowledged that Transformers are the dominant AI architecture and optimized silicon for them.
Summary#
| Concept | Key Takeaway |
|---|---|
| GPU vs CPU | GPU trades single-thread speed for massive parallelism |
| SM | The fundamental compute unit — contains cores, schedulers, shared memory, register file |
| Warp (32 threads) | Executes in lockstep — same instruction, different data |
| SIMT | Write scalar code, hardware executes across 32 threads |
| Warp divergence | Different branches within a warp → both paths execute serially → wasted lanes |
| Latency hiding | Zero-cost warp switching — when one warp stalls, another runs |
| Occupancy | More resident warps = more latency hiding potential |
| Shared memory | Fast on-chip, programmer-managed — essential for data reuse |
| Bank conflicts | Same-bank accesses are serialized — design access patterns to avoid |
| Memory coalescing | Consecutive threads accessing consecutive addresses → single transaction |
| Tensor Cores | Hardware matrix multiply — 10–100× faster than regular cores for matrix ops |
| HBM | Wide bus, high bandwidth, on-interposer — enables feeding compute-hungry GPUs |
| Roofline model | Performance limited by min(compute ceiling, bandwidth × arithmetic intensity) |
The fundamental principle of GPU computing is: trade latency for throughput. A single GPU thread is much slower than a single CPU thread. But by running thousands of threads and using their collective activity to hide memory latency, the GPU achieves orders-of-magnitude higher throughput for data-parallel workloads. This is why deep learning, scientific simulation, and graphics rendering all live on GPUs.