Table of Contents

Overview
#

A GPU is not a faster CPU. It is a fundamentally different machine designed to solve a fundamentally different problem. Where a CPU excels at running a few complex tasks with low latency, a GPU excels at running thousands of simple tasks simultaneously with high throughput.

This architectural difference is not accidental. It follows directly from the workloads each processor was designed for. Understanding GPU architecture from the ground up — how the hardware is organized, how threads execute, how memory is structured — is essential for writing efficient GPU code and understanding why deep learning runs on GPUs.

CPU vs GPU: Different Problems, Different Designs
#

The Design Trade-off
#

A transistor budget is finite. CPU designers spend most transistors on structures that make a single thread fast: large caches, branch predictors, out-of-order execution engines. GPU designers spend most transistors on more execution units, accepting that each individual unit is simpler and slower.

CPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│                                          │
│   ┌───────────────────┐   ┌─────────┐   │
│   │     Control &     │   │         │   │
│   │  Branch Predictor │   │  Cache  │   │
│   │     (large)       │   │ (large) │   │
│   └───────────────────┘   │         │   │
│   ┌──────────┐  ┌──────┐ │         │   │
│   │  Core 0  │  │Core 1│ │         │   │
│   │ (complex)│  │      │ └─────────┘   │
│   └──────────┘  └──────┘               │
│   ┌──────────┐  ┌──────┐               │
│   │  Core 2  │  │Core 3│               │
│   └──────────┘  └──────┘               │
└──────────────────────────────────────────┘

GPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│            ... (many SMs)               │
│ ┌─────────────────────────────────────┐ │
│ │       Small cache / Shared Mem      │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘

Side-by-Side Comparison
#

AspectCPUGPU
Core count4–64 large coresThousands of small cores
Clock speed4–6 GHz1.5–2.5 GHz
Cache sizeLarge (tens of MB)Small (few MB per SM)
Branch predictionSophisticated (TAGE, BTB)Very simple or none
Out-of-order executionYes (ROB, reservation stations)No (in-order)
Latency hidingCache + speculationMassive thread switching
Optimal workloadSequential, branch-heavyParallel, data-parallel
Transistor priorityMake one thread fastMake thousands of threads run

The key question is: how does the GPU hide memory latency without caches or out-of-order execution? The answer is thread-level parallelism, and understanding that requires understanding the GPU’s hardware organization.


GPU Hardware Organization
#

This section uses NVIDIA terminology, as it is the most widely documented. AMD’s architecture is structurally similar (with different names).

Hierarchical Structure
#

A GPU is organized in a hierarchy: the full chip contains multiple GPCs (Graphics Processing Clusters), each GPC contains multiple SMs (Streaming Multiprocessors), and each SM contains the actual execution units.

┌──────────────────────────────────────────────────────┐
│                        GPU Chip                      │
│                                                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │
│  │   GPC 0     │  │   GPC 1     │  │   GPC N     │  │
│  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │
│  │ │SM0│ │SM1│ │  │ │SM4│ │SM5│ │  │ │SMk│ │...│ │  │
│  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │
│  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │
│  │ │SM2│ │SM3│ │  │ │SM6│ │SM7│ │  │ │...│ │...│ │  │
│  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │
│  └─────────────┘  └─────────────┘  └─────────────┘  │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              L2 Cache (shared)                │    │
│  └──────────────────────────────────────────────┘    │
│  ┌──────────────────────────────────────────────┐    │
│  │          VRAM (GDDR6X or HBM)                │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘

Real GPU Specifications
#

GPUGenerationSMsCUDA CoresL2 CacheVRAMBandwidthTDP
RTX 3090Ampere8210,4966 MB24 GB GDDR6X936 GB/s350W
RTX 4090Ada Lovelace12816,38472 MB24 GB GDDR6X1,008 GB/s450W
A100Ampere1086,91240 MB80 GB HBM2e2,039 GB/s400W
H100Hopper13216,89650 MB80 GB HBM33,350 GB/s700W

Notice the enormous difference in core counts compared to CPUs (thousands vs. tens).


Streaming Multiprocessor (SM): The Core Building Block
#

The SM is the fundamental compute unit of a GPU. Understanding the SM is the key to understanding GPU performance.

SM Internal Architecture
#

Each SM contains multiple groups of execution units, shared memory, caches, and warp schedulers. Here is the layout for a modern (Ada Lovelace generation) SM:

┌────────────────────────────────────────────────────┐
│              Streaming Multiprocessor (SM)          │
│                                                    │
│  ┌──────────────────────────────────────────────┐  │
│  │          Warp Schedulers × 4                 │  │
│  │  (Each can issue 1 instruction per cycle     │  │
│  │   to its own set of execution units)         │  │
│  └──────────┬───────┬───────┬───────┬───────────┘  │
│             │       │       │       │              │
│  ┌──────────▼───────▼───────▼───────▼───────────┐  │
│  │              Execution Units                  │  │
│  │                                               │  │
│  │  FP32 Units × 128    (single-precision float) │  │
│  │  INT32 Units × 128   (integer)                │  │
│  │  FP64 Units × 2      (double-precision)       │  │
│  │  Tensor Cores × 4    (matrix multiply)        │  │
│  │  Load/Store Units × 32                        │  │
│  │  SFU × 16            (sin, cos, exp, rsqrt)   │  │
│  └───────────────────────────────────────────────┘  │
│                                                    │
│  ┌─────────────────┐  ┌──────────────────────┐     │
│  │  Register File  │  │   Shared Memory /    │     │
│  │   (256 KB)      │  │   L1 Cache (128 KB)  │     │
│  └─────────────────┘  └──────────────────────┘     │
└────────────────────────────────────────────────────┘

What Each Component Does
#

ComponentRoleWhy it matters
Warp SchedulerPicks a ready warp and issues its next instruction4 schedulers = 4 warps can issue per cycle
FP32 UnitsSingle-precision floating-point arithmeticThe main workhorse for most GPU compute
INT32 UnitsInteger arithmetic, address calculationCan execute in parallel with FP32 on some architectures
FP64 UnitsDouble-precision floating-pointImportant for scientific computing, very few per SM
Tensor CoresHardware matrix multiply-accumulateDramatically accelerate deep learning (10–20× over FP32 cores)
LD/ST UnitsLoad and store data from/to memoryMemory access is often the bottleneck
SFUTranscendental functions (sin, cos, exp, etc.)Slow but necessary for certain computations
Register FilePer-thread fast storage256 KB — much larger than a CPU’s register file
Shared MemoryProgrammer-managed on-chip memory, shared within a thread blockCritical for reducing global memory access

The register file deserves special attention. At 256 KB per SM, it is larger than most CPU L1 caches. This is because the GPU needs to hold state for thousands of threads simultaneously. Each thread gets its own set of registers (up to 255 per thread).


SIMT Execution Model
#

The GPU’s execution model is called SIMT: Single Instruction, Multiple Threads. It is similar to SIMD (Single Instruction, Multiple Data) on CPUs, but with important differences.

What Is a Warp?
#

A warp is a group of 32 threads that execute the same instruction at the same time on NVIDIA GPUs. (On AMD GPUs, the equivalent is called a wavefront of 64 threads.)

When you launch a CUDA kernel with 256 threads in a block, the hardware organizes them into warps:

Thread Block (256 threads):
├── Warp 0:  Thread   0 – Thread  31
├── Warp 1:  Thread  32 – Thread  63
├── Warp 2:  Thread  64 – Thread  95
├── Warp 3:  Thread  96 – Thread 127
├── Warp 4:  Thread 128 – Thread 159
├── Warp 5:  Thread 160 – Thread 191
├── Warp 6:  Thread 192 – Thread 223
└── Warp 7:  Thread 224 – Thread 255

How a Warp Executes
#

All 32 threads in a warp execute in lockstep — the same instruction, at the same time, on different data.

Warp 0 execution (32 threads in lockstep):

Cycle 1:  ALL 32 threads execute:  ADD R1, R2, R3
          Thread 0: R1 = R2 + R3  (with thread 0's data)
          Thread 1: R1 = R2 + R3  (with thread 1's data)
          ...
          Thread 31: R1 = R2 + R3 (with thread 31's data)

Cycle 2:  ALL 32 threads execute:  MUL R4, R1, R5

Cycle 3:  ALL 32 threads execute:  ST [R6], R4

This is extremely efficient: one instruction fetch and decode serves 32 threads. The hardware cost is shared across all threads in the warp.

SIMD vs SIMT
#

AspectCPU SIMDGPU SIMT
Vector widthExplicit (128/256/512 bits)Implicit (warp of 32 threads)
ProgrammingVector intrinsics or auto-vectorizationWrite scalar code per thread
Branch handlingMask registerAutomatic predication (divergence)
Thread identityNo per-lane identityEach thread has unique threadIdx

The programming model difference is significant. With CPU SIMD (AVX-512), you explicitly pack data into 512-bit vectors and use special vector instructions. With GPU SIMT, you write code as if it runs on a single thread, and the hardware maps 32 threads onto the execution units automatically. This makes GPU programming more intuitive, but you need to be aware of warp divergence.

Warp Divergence: The Performance Trap
#

When threads within the same warp take different branches, the warp must execute both paths sequentially, with some threads disabled on each path. This is called warp divergence.

Example code:

if (threadIdx.x < 16) {
    a[idx] = x + 1;      // Path A
} else {
    a[idx] = x * 2;      // Path B
}

Step-by-step execution:

Step 1: Evaluate condition for all 32 threads
  Thread  0: threadIdx.x =  0  → condition TRUE  (Path A)
  Thread  1: threadIdx.x =  1  → condition TRUE  (Path A)
  ...
  Thread 15: threadIdx.x = 15  → condition TRUE  (Path A)
  Thread 16: threadIdx.x = 16  → condition FALSE (Path B)
  ...
  Thread 31: threadIdx.x = 31  → condition FALSE (Path B)

Step 2: Execute Path A (threads 0-15 active, threads 16-31 MASKED)
  Thread  0: a[0] = x + 1     ← executes
  Thread  1: a[1] = x + 1     ← executes
  ...
  Thread 15: a[15] = x + 1    ← executes
  Thread 16: — (idle)          ← masked, wastes a lane
  ...
  Thread 31: — (idle)          ← masked, wastes a lane

Step 3: Execute Path B (threads 0-15 MASKED, threads 16-31 active)
  Thread  0: — (idle)
  ...
  Thread 15: — (idle)
  Thread 16: a[16] = x * 2    ← executes
  ...
  Thread 31: a[31] = x * 2    ← executes

Step 4: Reconverge — all threads active again

Result: The if-else block takes 2× the time because both paths execute sequentially. Half the execution units are idle during each path.

$$ \text{SIMT Efficiency} = \frac{\text{Active threads per instruction}}{\text{Warp size (32)}} $$

In this example, efficiency = 16/32 = 50%.

Key optimization rule: Design your code so that threads within the same warp take the same branch. Divergence between warps is free (different warps are independent), but divergence within a warp is costly.


Latency Hiding Through Warp Scheduling
#

This is the GPU’s most important trick. A CPU hides memory latency with large caches and speculative execution. A GPU hides it by switching to a different warp — instantly, at zero cost.

How It Works Step by Step
#

Suppose an SM has 32 warps assigned to it. At any moment, some warps are ready to execute and others are waiting for memory.

Warp states on an SM:
  Warp  0: READY      ← can execute next instruction
  Warp  1: MEM_WAIT   ← waiting for global memory load (400 cycles)
  Warp  2: READY
  Warp  3: MEM_WAIT
  Warp  4: READY
  Warp  5: READY
  ...
  Warp 31: MEM_WAIT

The warp scheduler picks a READY warp every cycle:

Cycle 1:   Schedule Warp 0 → issues ADD instruction
Cycle 2:   Warp 0 issues LD (memory request) → now MEM_WAIT
           Schedule Warp 2 → issues MUL instruction
Cycle 3:   Schedule Warp 4 → issues ADD instruction
Cycle 4:   Schedule Warp 5 → issues SUB instruction
...
Cycle 400: Warp 0's memory data arrives → READY again
Cycle 401: Schedule Warp 0 → continues execution

Critical point: Switching from one warp to another takes zero cycles. This is because every warp’s register state is always resident in the register file. There is no context switch — the scheduler just points to a different set of registers.

CPU approach to latency:
  Thread runs → Cache miss (200 cycles) → STALL → Data arrives → Resume
                                          ^^^^^^
                                     Wasted cycles (or OS context switch, ~1000+ cycles)

GPU approach to latency:
  Warp 0 runs → Memory request → Switch to Warp 2 (0 cycles!) → Warp 2 runs
  → Warp 4 runs → ... → Warp 0's data arrives → Warp 0 continues
  (No cycles wasted — other warps filled the gap)

Occupancy
#

Occupancy measures how many warps are resident on an SM relative to the maximum:

$$ \text{Occupancy} = \frac{\text{Active warps on SM}}{\text{Maximum warps supported by SM}} $$

Higher occupancy means more warps available to hide latency. But occupancy is limited by per-thread resource usage:

Step by step — how occupancy is determined:

  1. An SM supports a maximum of, say, 48 warps (1,536 threads).
  2. The SM has 256 KB of registers and 128 KB of shared memory.
  3. Your kernel uses 64 registers per thread and 48 KB of shared memory per block.
  4. Register limit: 256 KB ÷ (64 regs × 4 bytes × 32 threads/warp) = 32 warps max.
  5. Shared memory limit: 128 KB ÷ 48 KB = 2 blocks max. If each block has 256 threads (8 warps), that is 16 warps.
  6. The binding constraint is shared memory: occupancy = 16/48 = 33%.
OccupancyLatency hiding abilityResource flexibility
100%MaximumVery constrained registers/shared mem
50%Usually sufficientModerate
25%May be insufficientVery flexible per-thread

In practice, 50% occupancy is often enough because each warp issues multiple independent instructions, providing latency hiding even with fewer warps. But if your kernel is very memory-bound, higher occupancy helps significantly.


GPU Memory Hierarchy
#

Memory access is the dominant bottleneck in most GPU workloads. Understanding the memory hierarchy is essential for writing fast GPU code.

The Complete Memory Map
#

Speed   ┌──────────────────────────────────────────┐
  ▲     │         Per-Thread: Registers            │
  │     │  Up to 255 registers per thread          │
  │     │  Access: 0 cycles (operand read)         │
  │     │  Size: 256 KB per SM total               │
  │     ├──────────────────────────────────────────┤
  │     │         Per-Block: Shared Memory         │
  │     │  Shared among all threads in a block     │
  │     │  Programmer-managed (explicit load/store)│
  │     │  Access: ~20-30 cycles                   │
  │     │  Size: up to 228 KB per SM               │
  │     ├──────────────────────────────────────────┤
  │     │         Per-SM: L1 Cache                 │
  │     │  Hardware-managed, automatic             │
  │     │  Access: ~30 cycles                      │
  │     │  Size: configurable, shares SRAM w/ smem │
  │     ├──────────────────────────────────────────┤
  │     │         Chip-wide: L2 Cache              │
  │     │  Shared across all SMs                   │
  │     │  Access: ~200 cycles                     │
  │     │  Size: 6-72 MB                           │
  │     ├──────────────────────────────────────────┤
  │     │         Off-chip: Global Memory (VRAM)   │
  │     │  GDDR6X or HBM                          │
  │     │  Access: ~400-600 cycles                 │
  │     │  Size: 24-80 GB                          │
  ▼     └──────────────────────────────────────────┘
Speed

Registers: The Fastest Storage
#

Each thread can use up to 255 registers. Accessing a register takes 0 extra cycles — it is directly wired into the execution unit. But registers are a limited, shared resource. The more registers each thread uses, the fewer threads (warps) can be resident on the SM.

Trade-off example:

  • 32 regs/thread → 48 warps fit → high occupancy, good latency hiding
  • 128 regs/thread → 16 warps fit → lower occupancy, but each thread computes faster
  • 255 regs/thread → 8 warps fit → risk of poor latency hiding

If a thread needs more than 255 registers, the compiler spills extras to local memory (actually global memory, very slow). Register spilling is a major performance killer.

Shared Memory: The Programmer-Managed Cache
#

Shared memory is a block of fast on-chip SRAM that is shared among all threads in a thread block and managed explicitly by the programmer. It physically shares the same SRAM as the L1 cache, and you can configure the split.

Configurable SRAM split (Ampere/Ada):
  Option A:  Shared 128 KB  +  L1  0 KB
  Option B:  Shared 100 KB  +  L1 28 KB
  Option C:  Shared  64 KB  +  L1 64 KB
  Option D:  Shared  28 KB  +  L1 100 KB

Why use shared memory? Consider a scenario where multiple threads in a block need the same data:

Without shared memory:
  Thread 0: Load A[0] from global memory — 400 cycles
  Thread 1: Load A[0] from global memory — 400 cycles (same data!)
  Thread 2: Load A[0] from global memory — 400 cycles (same data!)
  ...
  Total wasted bandwidth: enormous

With shared memory:
  Step 1: Thread 0 loads A[0] from global → shared memory (400 cycles, once)
  Step 2: __syncthreads()  (barrier — ensure all threads see the data)
  Step 3: All threads read A[0] from shared memory (~20 cycles each)
  Total: 400 + 20 per access — much cheaper for repeated access

Step-by-step shared memory usage pattern (tiling):

  1. Each thread cooperatively loads a tile of data from global memory into shared memory.
  2. Call __syncthreads() to ensure all loads complete.
  3. All threads compute using the data in shared memory (fast).
  4. Call __syncthreads() again before loading the next tile.
  5. Repeat until the full computation is done.

This pattern is fundamental in matrix multiplication, convolution, and many other GPU algorithms.

Bank Conflicts
#

Shared memory is divided into 32 banks (one per warp lane). If multiple threads access different addresses that map to the same bank, the accesses are serialized.

Step by step — how bank mapping works:

Bank assignment (32 banks, 4 bytes per bank per row):
  Address 0   → Bank 0
  Address 4   → Bank 1
  Address 8   → Bank 2
  ...
  Address 124 → Bank 31
  Address 128 → Bank 0  (wraps around)
  Address 132 → Bank 1
  ...

No conflict — all threads access different banks:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 4   → Bank 1
Thread 2  → Address 8   → Bank 2
...
Thread 31 → Address 124 → Bank 31
→ All 32 accesses happen simultaneously (1 cycle)

2-way conflict — two threads hit the same bank:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 128 → Bank 0  ← same bank!
Thread 2  → Address 8   → Bank 2
...
→ Bank 0 must serve two requests sequentially (2 cycles for those threads)

Broadcast — all threads access the same address:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 0   → Bank 0  (same address!)
Thread 2  → Address 0   → Bank 0  (same address!)
...
→ Hardware broadcasts — treated as 1 access (no conflict)

Memory Coalescing: Getting Data Efficiently from Global Memory
#

When a warp executes a load instruction, all 32 threads issue memory requests. The hardware tries to coalesce (merge) these into as few memory transactions as possible.

Coalesced access (ideal):

Thread 0  → Address 0x1000
Thread 1  → Address 0x1004
Thread 2  → Address 0x1008
...
Thread 31 → Address 0x107C

→ Addresses are consecutive → merged into ONE 128-byte transaction
→ 128 bytes transferred, 128 bytes useful = 100% efficiency

Strided access (wasteful):

Thread 0  → Address 0x1000
Thread 1  → Address 0x1100  (stride = 256 bytes)
Thread 2  → Address 0x1200
...
Thread 31 → Address 0x2F00

→ Addresses span many cache lines → up to 32 SEPARATE transactions
→ 32 × 128 bytes transferred, only 32 × 4 = 128 bytes useful
→ Efficiency: 128 / 4096 = 3.1%

Random access (worst case):

Thread 0  → Address 0xA000
Thread 1  → Address 0x3400
Thread 2  → Address 0x7800
...
→ Each thread hits a different cache line → 32 transactions
→ Same 3% efficiency
$$ \text{Memory Efficiency} = \frac{\text{Useful bytes loaded}}{\text{Total bytes transferred}} $$

Optimization rule: Structure your data so that consecutive threads (within a warp) access consecutive memory addresses. This is called a coalesced access pattern and is one of the most important GPU performance optimizations.


Tensor Cores: Hardware for Matrix Multiply
#

Starting with the Volta architecture (2017), NVIDIA GPUs include Tensor Cores — specialized hardware units that perform small matrix multiply-accumulate (MMA) operations in a single cycle.

What Tensor Cores Do
#

A single Tensor Core computes:

$$ D = A \times B + C $$

Where \(A\), \(B\), \(C\), and \(D\) are small matrices (e.g., 16×16 for FP16).

Comparison of throughput:

Regular FP32 CUDA Cores:
  1 core performs 1 multiply + 1 add = 2 FLOP per cycle
  128 cores per SM → 256 FLOP/cycle/SM

4th Gen Tensor Core (Hopper, FP16):
  One MMA operation: 16 × 16 × 16 × 2 = 8,192 FLOP
  4 Tensor Cores per SM → ~32,768 FLOP/cycle/SM

  → Tensor Cores are ~128× more throughput for matrix ops

Supported Data Types
#

Different precisions trade off accuracy for speed:

Data TypeBitsUse CaseRelative Speed
FP6464Scientific computing1× (baseline)
TF3219Training (drop-in FP32 replacement)~8×
FP1616Training and inference~16×
BF1616Training (wider range than FP16)~16×
INT88Inference~32×
FP8 (E4M3/E5M2)8Inference (Hopper+)~32×
FP44Inference (Blackwell)~64×

FP16 vs BF16: Why Two 16-bit Formats?
#

FP32:   [1 sign] [8 exponent] [23 mantissa]
        Range: ±3.4 × 10³⁸, Precision: ~7 decimal digits

FP16:   [1 sign] [5 exponent] [10 mantissa]
        Range: ±65,504, Precision: ~3.3 decimal digits

BF16:   [1 sign] [8 exponent] [7 mantissa]
        Range: ±3.4 × 10³⁸, Precision: ~2.4 decimal digits

FP16 has better precision but a very narrow range. If gradients during training fall outside ±65,504, they overflow to infinity. BF16 has the same range as FP32 (same 8-bit exponent), so it rarely overflows, making training more stable despite the lower precision. This is why BF16 has become the default for large model training.


CUDA Programming Model → Hardware Mapping
#

Understanding how software concepts map to hardware is essential for performance tuning.

The Mapping
#

CUDA Software                    GPU Hardware
────────────                     ────────────
Grid (all blocks)           →    Entire GPU
  └── Block (group of threads) → Assigned to one SM
        └── Thread            → Runs on a CUDA Core (within a warp)

Step by step — what happens when you launch a kernel:

  1. You specify a grid of blocks: kernel<<<gridDim, blockDim>>>(...)
    • Example: kernel<<<128, 256>>>(...) → 128 blocks, each with 256 threads
  2. The GigaThread Engine (top-level scheduler) distributes blocks to SMs.
  3. Each SM may receive multiple blocks (if resources allow).
  4. Within each block, threads are grouped into warps of 32.
  5. The SM’s warp schedulers manage all resident warps, issuing instructions each cycle.
Example: kernel<<<128, 256>>>

Grid: 128 blocks
  Block 0   → assigned to SM 3
  Block 1   → assigned to SM 7
  Block 2   → assigned to SM 3  (SM 3 gets multiple blocks)
  Block 3   → assigned to SM 12
  ...
  Block 127 → assigned to SM 45

Within Block 0 on SM 3:
  256 threads ÷ 32 = 8 warps
  Warp 0: threads 0-31
  Warp 1: threads 32-63
  ...
  Warp 7: threads 224-255

A block stays on the same SM for its entire lifetime. It cannot migrate. Threads within a block can synchronize (__syncthreads()) and share data via shared memory. Threads in different blocks cannot directly communicate during execution.

Choosing Block Size
#

Block SizeWarps/BlockTypical Use Case
321Very simple kernels, debugging
1284Good general default
2568Most common choice
51216When more shared memory per block is needed
102432Maximum allowed (use sparingly)

128 or 256 threads per block works well in most cases. Too small means underutilizing the SM. Too large may cause resource pressure (registers, shared memory) that reduces occupancy.


GPU Memory Bandwidth: GDDR vs HBM
#

GPU workloads are often memory-bandwidth limited, meaning the compute units are starved for data. This is why GPU memory bandwidth is so important.

GDDR6X vs HBM3
#

GDDR6X (consumer GPUs, e.g., RTX 4090):
  ┌──────┐                    ┌────┐ ┌────┐
  │ GPU  │──[384-bit bus]────→│GDDR│ │GDDR│  ... (12 chips on PCB)
  └──────┘                    └────┘ └────┘

  Bus width: 384 bits
  Data rate: 21 Gbps
  Bandwidth: 384 × 21 / 8 = 1,008 GB/s

HBM3 (datacenter GPUs, e.g., H100):
  ┌─────────────────────┐
  │   ┌─────┐ ┌─────┐  │
  │   │ HBM │ │ HBM │  │  HBM stacks sit next to GPU die
  │   │stack│ │stack│  │  on a silicon interposer
  │   └─────┘ └─────┘  │
  │      GPU Die        │
  │   ┌─────┐ ┌─────┐  │
  │   │ HBM │ │ HBM │  │
  │   │stack│ │stack│  │
  │   └─────┘ └─────┘  │
  └─────────────────────┘

  Bus width: 5,120 bits (massively wide)
  Data rate: 6.4 Gbps (lower, but width compensates)
  Bandwidth: 5,120 × 6.4 / 8 = 3,350 GB/s
AspectGDDR6XHBM3
Bandwidth~1 TB/s~3.4 TB/s
Power efficiencyModerateHigh (short wires on interposer)
Capacity24 GB typical80 GB typical
CostLowerMuch higher
Physical designChips on PCB edgeStacks on silicon interposer
Typical useGaming, workstationsData center, AI training

HBM achieves higher bandwidth by using a very wide bus (5,120 bits vs 384 bits) at a lower clock speed. The short wires on the silicon interposer also use less power per bit, which matters at data center scale.

Roofline Model: Understanding Performance Limits
#

Every GPU kernel is limited by one of two factors: compute throughput or memory bandwidth. The Roofline Model visualizes this.

$$ \text{Attainable Performance} = \min\left(\text{Peak FLOP/s}, \quad \text{Memory Bandwidth} \times \text{Arithmetic Intensity}\right) $$$$ \text{Arithmetic Intensity (AI)} = \frac{\text{FLOPs performed}}{\text{Bytes transferred from memory}} $$
Performance
(FLOP/s)
    │               ╱ Peak Compute ─────────────
    │              ╱
    │             ╱
    │            ╱
    │           ╱
    │          ╱       ← Memory-bound region
    │         ╱
    │        ╱            Compute-bound region →
    │       ╱
    │      ╱
    │     ╱
    │    ╱
    └───────────────────────────────────────
        Arithmetic Intensity (FLOP/Byte)
            Ridge Point

Reading the roofline:

  1. Calculate your kernel’s arithmetic intensity: count FLOPs and bytes transferred.
  2. If AI is left of the ridge point: you are memory-bound. Optimize memory access (coalescing, shared memory, caching).
  3. If AI is right of the ridge point: you are compute-bound. Optimize computation (Tensor Cores, reduced precision).

Examples:

OperationAI (FLOP/Byte)Typically
Vector addition0.25Memory-bound
Matrix-vector multiply~1Memory-bound
Matrix-matrix multiply~N/8Compute-bound (for large N)
Convolution (large)~100+Compute-bound

Matrix multiplication is compute-bound because it reuses each loaded element \(O(N)\) times. Vector addition loads two elements, does one add, stores one result — almost no reuse, so it is entirely bandwidth-limited.


GPU Architecture Evolution
#

GenerationYearKey Innovation
Tesla (G80)2006Unified shaders, CUDA introduced — GPUs become programmable for general compute
Fermi2010L1/L2 caches added, ECC memory, true IEEE FP
Kepler2012Dynamic parallelism (kernels launch kernels), Hyper-Q
Maxwell2014Major power efficiency improvement (~2× perf/watt)
Pascal2016HBM2 (P100), NVLink, native FP16 support
Volta2017Tensor Cores introduced — first hardware matrix multiply acceleration
Turing2018RT Cores (hardware ray tracing), INT8/INT4 inference
Ampere20203rd gen Tensor Cores, TF32, sparsity support, 80GB HBM2e
Hopper2022Transformer Engine, FP8, DPX instructions, 3TB/s HBM3
Blackwell20242nd gen Transformer Engine, FP4, dual-die design, 8TB/s HBM3e

Three inflection points:

  1. G80 / CUDA (2006): Transformed GPUs from graphics-only to general-purpose parallel processors. Made GPU computing accessible to non-graphics programmers.

  2. Volta / Tensor Cores (2017): Purpose-built hardware for matrix multiplication gave deep learning training a ~10× speedup over regular CUDA cores. This is when GPU = AI training hardware became firmly established.

  3. Hopper / Transformer Engine (2022): Hardware-level support for Transformer-specific operations (attention, layer norm) with automatic FP8 precision management. Acknowledged that Transformers are the dominant AI architecture and optimized silicon for them.


Summary
#

ConceptKey Takeaway
GPU vs CPUGPU trades single-thread speed for massive parallelism
SMThe fundamental compute unit — contains cores, schedulers, shared memory, register file
Warp (32 threads)Executes in lockstep — same instruction, different data
SIMTWrite scalar code, hardware executes across 32 threads
Warp divergenceDifferent branches within a warp → both paths execute serially → wasted lanes
Latency hidingZero-cost warp switching — when one warp stalls, another runs
OccupancyMore resident warps = more latency hiding potential
Shared memoryFast on-chip, programmer-managed — essential for data reuse
Bank conflictsSame-bank accesses are serialized — design access patterns to avoid
Memory coalescingConsecutive threads accessing consecutive addresses → single transaction
Tensor CoresHardware matrix multiply — 10–100× faster than regular cores for matrix ops
HBMWide bus, high bandwidth, on-interposer — enables feeding compute-hungry GPUs
Roofline modelPerformance limited by min(compute ceiling, bandwidth × arithmetic intensity)

The fundamental principle of GPU computing is: trade latency for throughput. A single GPU thread is much slower than a single CPU thread. But by running thousands of threads and using their collective activity to hide memory latency, the GPU achieves orders-of-magnitude higher throughput for data-parallel workloads. This is why deep learning, scientific simulation, and graphics rendering all live on GPUs.