GPU Architecture: The Engine Behind Parallel Computing

Table of Contents

Overview
#

A GPU is not a faster CPU. It is a fundamentally different machine designed to solve a fundamentally different problem. Where a CPU excels at running a few complex tasks with low latency, a GPU excels at running thousands of simple tasks simultaneously with high throughput.

This architectural difference is not accidental. It follows directly from the workloads each processor was designed for. Understanding GPU architecture from the ground up — how the hardware is organized, how threads execute, how memory is structured — is essential for writing efficient GPU code and understanding why deep learning runs on GPUs.

CPU vs GPU: Different Problems, Different Designs
#

The Design Trade-off
#

A transistor budget is finite. CPU designers spend most transistors on structures that make a single thread fast: large caches, branch predictors, out-of-order execution engines. GPU designers spend most transistors on more execution units, accepting that each individual unit is simpler and slower.

CPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│                                          │
│   ┌───────────────────┐   ┌─────────┐   │
│   │     Control &     │   │         │   │
│   │  Branch Predictor │   │  Cache  │   │
│   │     (large)       │   │ (large) │   │
│   └───────────────────┘   │         │   │
│   ┌──────────┐  ┌──────┐ │         │   │
│   │  Core 0  │  │Core 1│ │         │   │
│   │ (complex)│  │      │ └─────────┘   │
│   └──────────┘  └──────┘               │
│   ┌──────────┐  ┌──────┐               │
│   │  Core 2  │  │Core 3│               │
│   └──────────┘  └──────┘               │
└──────────────────────────────────────────┘

GPU Die Area (conceptual):
┌──────────────────────────────────────────┐
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│ ┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐┌──┐  │
│ │SM││SM││SM││SM││SM││SM││SM││SM││SM│  │
│ └──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘└──┘  │
│            ... (many SMs)               │
│ ┌─────────────────────────────────────┐ │
│ │       Small cache / Shared Mem      │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────┘

Side-by-Side Comparison
#

Aspect	CPU	GPU
Core count	4–64 large cores	Thousands of small cores
Clock speed	4–6 GHz	1.5–2.5 GHz
Cache size	Large (tens of MB)	Small (few MB per SM)
Branch prediction	Sophisticated (TAGE, BTB)	Very simple or none
Out-of-order execution	Yes (ROB, reservation stations)	No (in-order)
Latency hiding	Cache + speculation	Massive thread switching
Optimal workload	Sequential, branch-heavy	Parallel, data-parallel
Transistor priority	Make one thread fast	Make thousands of threads run

The key question is: how does the GPU hide memory latency without caches or out-of-order execution? The answer is thread-level parallelism, and understanding that requires understanding the GPU’s hardware organization.

GPU Hardware Organization
#

This section uses NVIDIA terminology, as it is the most widely documented. AMD’s architecture is structurally similar (with different names).

Hierarchical Structure
#

A GPU is organized in a hierarchy: the full chip contains multiple GPCs (Graphics Processing Clusters), each GPC contains multiple SMs (Streaming Multiprocessors), and each SM contains the actual execution units.

┌──────────────────────────────────────────────────────┐
│                        GPU Chip                      │
│                                                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │
│  │   GPC 0     │  │   GPC 1     │  │   GPC N     │  │
│  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │
│  │ │SM0│ │SM1│ │  │ │SM4│ │SM5│ │  │ │SMk│ │...│ │  │
│  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │
│  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │ ┌───┐ ┌───┐ │  │
│  │ │SM2│ │SM3│ │  │ │SM6│ │SM7│ │  │ │...│ │...│ │  │
│  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │ └───┘ └───┘ │  │
│  └─────────────┘  └─────────────┘  └─────────────┘  │
│                                                      │
│  ┌──────────────────────────────────────────────┐    │
│  │              L2 Cache (shared)                │    │
│  └──────────────────────────────────────────────┘    │
│  ┌──────────────────────────────────────────────┐    │
│  │          VRAM (GDDR6X or HBM)                │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘

Real GPU Specifications
#

GPU	Generation	SMs	CUDA Cores	L2 Cache	VRAM	Bandwidth	TDP
RTX 3090	Ampere	82	10,496	6 MB	24 GB GDDR6X	936 GB/s	350W
RTX 4090	Ada Lovelace	128	16,384	72 MB	24 GB GDDR6X	1,008 GB/s	450W
A100	Ampere	108	6,912	40 MB	80 GB HBM2e	2,039 GB/s	400W
H100	Hopper	132	16,896	50 MB	80 GB HBM3	3,350 GB/s	700W

Notice the enormous difference in core counts compared to CPUs (thousands vs. tens).

Streaming Multiprocessor (SM): The Core Building Block
#

The SM is the fundamental compute unit of a GPU. Understanding the SM is the key to understanding GPU performance.

SM Internal Architecture
#

Each SM contains multiple groups of execution units, shared memory, caches, and warp schedulers. Here is the layout for a modern (Ada Lovelace generation) SM:

┌────────────────────────────────────────────────────┐
│              Streaming Multiprocessor (SM)          │
│                                                    │
│  ┌──────────────────────────────────────────────┐  │
│  │          Warp Schedulers × 4                 │  │
│  │  (Each can issue 1 instruction per cycle     │  │
│  │   to its own set of execution units)         │  │
│  └──────────┬───────┬───────┬───────┬───────────┘  │
│             │       │       │       │              │
│  ┌──────────▼───────▼───────▼───────▼───────────┐  │
│  │              Execution Units                  │  │
│  │                                               │  │
│  │  FP32 Units × 128    (single-precision float) │  │
│  │  INT32 Units × 128   (integer)                │  │
│  │  FP64 Units × 2      (double-precision)       │  │
│  │  Tensor Cores × 4    (matrix multiply)        │  │
│  │  Load/Store Units × 32                        │  │
│  │  SFU × 16            (sin, cos, exp, rsqrt)   │  │
│  └───────────────────────────────────────────────┘  │
│                                                    │
│  ┌─────────────────┐  ┌──────────────────────┐     │
│  │  Register File  │  │   Shared Memory /    │     │
│  │   (256 KB)      │  │   L1 Cache (128 KB)  │     │
│  └─────────────────┘  └──────────────────────┘     │
└────────────────────────────────────────────────────┘

What Each Component Does
#

Component	Role	Why it matters
Warp Scheduler	Picks a ready warp and issues its next instruction	4 schedulers = 4 warps can issue per cycle
FP32 Units	Single-precision floating-point arithmetic	The main workhorse for most GPU compute
INT32 Units	Integer arithmetic, address calculation	Can execute in parallel with FP32 on some architectures
FP64 Units	Double-precision floating-point	Important for scientific computing, very few per SM
Tensor Cores	Hardware matrix multiply-accumulate	Dramatically accelerate deep learning (10–20× over FP32 cores)
LD/ST Units	Load and store data from/to memory	Memory access is often the bottleneck
SFU	Transcendental functions (sin, cos, exp, etc.)	Slow but necessary for certain computations
Register File	Per-thread fast storage	256 KB — much larger than a CPU’s register file
Shared Memory	Programmer-managed on-chip memory, shared within a thread block	Critical for reducing global memory access

The register file deserves special attention. At 256 KB per SM, it is larger than most CPU L1 caches. This is because the GPU needs to hold state for thousands of threads simultaneously. Each thread gets its own set of registers (up to 255 per thread).

SIMT Execution Model
#

The GPU’s execution model is called SIMT: Single Instruction, Multiple Threads. It is similar to SIMD (Single Instruction, Multiple Data) on CPUs, but with important differences.

What Is a Warp?
#

A warp is a group of 32 threads that execute the same instruction at the same time on NVIDIA GPUs. (On AMD GPUs, the equivalent is called a wavefront of 64 threads.)

When you launch a CUDA kernel with 256 threads in a block, the hardware organizes them into warps:

Thread Block (256 threads):
├── Warp 0:  Thread   0 – Thread  31
├── Warp 1:  Thread  32 – Thread  63
├── Warp 2:  Thread  64 – Thread  95
├── Warp 3:  Thread  96 – Thread 127
├── Warp 4:  Thread 128 – Thread 159
├── Warp 5:  Thread 160 – Thread 191
├── Warp 6:  Thread 192 – Thread 223
└── Warp 7:  Thread 224 – Thread 255

How a Warp Executes
#

All 32 threads in a warp execute in lockstep — the same instruction, at the same time, on different data.

Warp 0 execution (32 threads in lockstep):

Cycle 1:  ALL 32 threads execute:  ADD R1, R2, R3
          Thread 0: R1 = R2 + R3  (with thread 0's data)
          Thread 1: R1 = R2 + R3  (with thread 1's data)
          ...
          Thread 31: R1 = R2 + R3 (with thread 31's data)

Cycle 2:  ALL 32 threads execute:  MUL R4, R1, R5

Cycle 3:  ALL 32 threads execute:  ST [R6], R4

This is extremely efficient: one instruction fetch and decode serves 32 threads. The hardware cost is shared across all threads in the warp.

SIMD vs SIMT
#

Aspect	CPU SIMD	GPU SIMT
Vector width	Explicit (128/256/512 bits)	Implicit (warp of 32 threads)
Programming	Vector intrinsics or auto-vectorization	Write scalar code per thread
Branch handling	Mask register	Automatic predication (divergence)
Thread identity	No per-lane identity	Each thread has unique threadIdx

The programming model difference is significant. With CPU SIMD (AVX-512), you explicitly pack data into 512-bit vectors and use special vector instructions. With GPU SIMT, you write code as if it runs on a single thread, and the hardware maps 32 threads onto the execution units automatically. This makes GPU programming more intuitive, but you need to be aware of warp divergence.

Warp Divergence: The Performance Trap
#

When threads within the same warp take different branches, the warp must execute both paths sequentially, with some threads disabled on each path. This is called warp divergence.

Example code:

if (threadIdx.x < 16) {
    a[idx] = x + 1;      // Path A
} else {
    a[idx] = x * 2;      // Path B
}

Step-by-step execution:

Step 1: Evaluate condition for all 32 threads
  Thread  0: threadIdx.x =  0  → condition TRUE  (Path A)
  Thread  1: threadIdx.x =  1  → condition TRUE  (Path A)
  ...
  Thread 15: threadIdx.x = 15  → condition TRUE  (Path A)
  Thread 16: threadIdx.x = 16  → condition FALSE (Path B)
  ...
  Thread 31: threadIdx.x = 31  → condition FALSE (Path B)

Step 2: Execute Path A (threads 0-15 active, threads 16-31 MASKED)
  Thread  0: a[0] = x + 1     ← executes
  Thread  1: a[1] = x + 1     ← executes
  ...
  Thread 15: a[15] = x + 1    ← executes
  Thread 16: — (idle)          ← masked, wastes a lane
  ...
  Thread 31: — (idle)          ← masked, wastes a lane

Step 3: Execute Path B (threads 0-15 MASKED, threads 16-31 active)
  Thread  0: — (idle)
  ...
  Thread 15: — (idle)
  Thread 16: a[16] = x * 2    ← executes
  ...
  Thread 31: a[31] = x * 2    ← executes

Step 4: Reconverge — all threads active again

Result: The if-else block takes 2× the time because both paths execute sequentially. Half the execution units are idle during each path.

$$ \text{SIMT Efficiency} = \frac{\text{Active threads per instruction}}{\text{Warp size (32)}} $$

In this example, efficiency = 16/32 = 50%.

Key optimization rule: Design your code so that threads within the same warp take the same branch. Divergence between warps is free (different warps are independent), but divergence within a warp is costly.

Latency Hiding Through Warp Scheduling
#

This is the GPU’s most important trick. A CPU hides memory latency with large caches and speculative execution. A GPU hides it by switching to a different warp — instantly, at zero cost.

How It Works Step by Step
#

Suppose an SM has 32 warps assigned to it. At any moment, some warps are ready to execute and others are waiting for memory.

Warp states on an SM:
  Warp  0: READY      ← can execute next instruction
  Warp  1: MEM_WAIT   ← waiting for global memory load (400 cycles)
  Warp  2: READY
  Warp  3: MEM_WAIT
  Warp  4: READY
  Warp  5: READY
  ...
  Warp 31: MEM_WAIT

The warp scheduler picks a READY warp every cycle:

Cycle 1:   Schedule Warp 0 → issues ADD instruction
Cycle 2:   Warp 0 issues LD (memory request) → now MEM_WAIT
           Schedule Warp 2 → issues MUL instruction
Cycle 3:   Schedule Warp 4 → issues ADD instruction
Cycle 4:   Schedule Warp 5 → issues SUB instruction
...
Cycle 400: Warp 0's memory data arrives → READY again
Cycle 401: Schedule Warp 0 → continues execution

Critical point: Switching from one warp to another takes zero cycles. This is because every warp’s register state is always resident in the register file. There is no context switch — the scheduler just points to a different set of registers.

CPU approach to latency:
  Thread runs → Cache miss (200 cycles) → STALL → Data arrives → Resume
                                          ^^^^^^
                                     Wasted cycles (or OS context switch, ~1000+ cycles)

GPU approach to latency:
  Warp 0 runs → Memory request → Switch to Warp 2 (0 cycles!) → Warp 2 runs
  → Warp 4 runs → ... → Warp 0's data arrives → Warp 0 continues
  (No cycles wasted — other warps filled the gap)

Occupancy
#

Occupancy measures how many warps are resident on an SM relative to the maximum:

$$ \text{Occupancy} = \frac{\text{Active warps on SM}}{\text{Maximum warps supported by SM}} $$

Higher occupancy means more warps available to hide latency. But occupancy is limited by per-thread resource usage:

Step by step — how occupancy is determined:

An SM supports a maximum of, say, 48 warps (1,536 threads).
The SM has 256 KB of registers and 128 KB of shared memory.
Your kernel uses 64 registers per thread and 48 KB of shared memory per block.
Register limit: 256 KB ÷ (64 regs × 4 bytes × 32 threads/warp) = 32 warps max.
Shared memory limit: 128 KB ÷ 48 KB = 2 blocks max. If each block has 256 threads (8 warps), that is 16 warps.
The binding constraint is shared memory: occupancy = 16/48 = 33%.

Occupancy	Latency hiding ability	Resource flexibility
100%	Maximum	Very constrained registers/shared mem
50%	Usually sufficient	Moderate
25%	May be insufficient	Very flexible per-thread

In practice, 50% occupancy is often enough because each warp issues multiple independent instructions, providing latency hiding even with fewer warps. But if your kernel is very memory-bound, higher occupancy helps significantly.

GPU Memory Hierarchy
#

Memory access is the dominant bottleneck in most GPU workloads. Understanding the memory hierarchy is essential for writing fast GPU code.

The Complete Memory Map
#

Speed   ┌──────────────────────────────────────────┐
  ▲     │         Per-Thread: Registers            │
  │     │  Up to 255 registers per thread          │
  │     │  Access: 0 cycles (operand read)         │
  │     │  Size: 256 KB per SM total               │
  │     ├──────────────────────────────────────────┤
  │     │         Per-Block: Shared Memory         │
  │     │  Shared among all threads in a block     │
  │     │  Programmer-managed (explicit load/store)│
  │     │  Access: ~20-30 cycles                   │
  │     │  Size: up to 228 KB per SM               │
  │     ├──────────────────────────────────────────┤
  │     │         Per-SM: L1 Cache                 │
  │     │  Hardware-managed, automatic             │
  │     │  Access: ~30 cycles                      │
  │     │  Size: configurable, shares SRAM w/ smem │
  │     ├──────────────────────────────────────────┤
  │     │         Chip-wide: L2 Cache              │
  │     │  Shared across all SMs                   │
  │     │  Access: ~200 cycles                     │
  │     │  Size: 6-72 MB                           │
  │     ├──────────────────────────────────────────┤
  │     │         Off-chip: Global Memory (VRAM)   │
  │     │  GDDR6X or HBM                          │
  │     │  Access: ~400-600 cycles                 │
  │     │  Size: 24-80 GB                          │
  ▼     └──────────────────────────────────────────┘
Speed

Registers: The Fastest Storage
#

Each thread can use up to 255 registers. Accessing a register takes 0 extra cycles — it is directly wired into the execution unit. But registers are a limited, shared resource. The more registers each thread uses, the fewer threads (warps) can be resident on the SM.

Trade-off example:

32 regs/thread → 48 warps fit → high occupancy, good latency hiding
128 regs/thread → 16 warps fit → lower occupancy, but each thread computes faster
255 regs/thread → 8 warps fit → risk of poor latency hiding

If a thread needs more than 255 registers, the compiler spills extras to local memory (actually global memory, very slow). Register spilling is a major performance killer.

Shared Memory: The Programmer-Managed Cache
#

Shared memory is a block of fast on-chip SRAM that is shared among all threads in a thread block and managed explicitly by the programmer. It physically shares the same SRAM as the L1 cache, and you can configure the split.

Configurable SRAM split (Ampere/Ada):
  Option A:  Shared 128 KB  +  L1  0 KB
  Option B:  Shared 100 KB  +  L1 28 KB
  Option C:  Shared  64 KB  +  L1 64 KB
  Option D:  Shared  28 KB  +  L1 100 KB

Why use shared memory? Consider a scenario where multiple threads in a block need the same data:

Without shared memory:
  Thread 0: Load A[0] from global memory — 400 cycles
  Thread 1: Load A[0] from global memory — 400 cycles (same data!)
  Thread 2: Load A[0] from global memory — 400 cycles (same data!)
  ...
  Total wasted bandwidth: enormous

With shared memory:
  Step 1: Thread 0 loads A[0] from global → shared memory (400 cycles, once)
  Step 2: __syncthreads()  (barrier — ensure all threads see the data)
  Step 3: All threads read A[0] from shared memory (~20 cycles each)
  Total: 400 + 20 per access — much cheaper for repeated access

Step-by-step shared memory usage pattern (tiling):

Each thread cooperatively loads a tile of data from global memory into shared memory.
Call __syncthreads() to ensure all loads complete.
All threads compute using the data in shared memory (fast).
Call __syncthreads() again before loading the next tile.
Repeat until the full computation is done.

This pattern is fundamental in matrix multiplication, convolution, and many other GPU algorithms.

Bank Conflicts
#

Shared memory is divided into 32 banks (one per warp lane). If multiple threads access different addresses that map to the same bank, the accesses are serialized.

Step by step — how bank mapping works:

Bank assignment (32 banks, 4 bytes per bank per row):
  Address 0   → Bank 0
  Address 4   → Bank 1
  Address 8   → Bank 2
  ...
  Address 124 → Bank 31
  Address 128 → Bank 0  (wraps around)
  Address 132 → Bank 1
  ...

No conflict — all threads access different banks:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 4   → Bank 1
Thread 2  → Address 8   → Bank 2
...
Thread 31 → Address 124 → Bank 31
→ All 32 accesses happen simultaneously (1 cycle)

2-way conflict — two threads hit the same bank:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 128 → Bank 0  ← same bank!
Thread 2  → Address 8   → Bank 2
...
→ Bank 0 must serve two requests sequentially (2 cycles for those threads)

Broadcast — all threads access the same address:

Thread 0  → Address 0   → Bank 0
Thread 1  → Address 0   → Bank 0  (same address!)
Thread 2  → Address 0   → Bank 0  (same address!)
...
→ Hardware broadcasts — treated as 1 access (no conflict)

Memory Coalescing: Getting Data Efficiently from Global Memory
#

When a warp executes a load instruction, all 32 threads issue memory requests. The hardware tries to coalesce (merge) these into as few memory transactions as possible.

Coalesced access (ideal):

Thread 0  → Address 0x1000
Thread 1  → Address 0x1004
Thread 2  → Address 0x1008
...
Thread 31 → Address 0x107C

→ Addresses are consecutive → merged into ONE 128-byte transaction
→ 128 bytes transferred, 128 bytes useful = 100% efficiency

Strided access (wasteful):

Thread 0  → Address 0x1000
Thread 1  → Address 0x1100  (stride = 256 bytes)
Thread 2  → Address 0x1200
...
Thread 31 → Address 0x2F00

→ Addresses span many cache lines → up to 32 SEPARATE transactions
→ 32 × 128 bytes transferred, only 32 × 4 = 128 bytes useful
→ Efficiency: 128 / 4096 = 3.1%

Random access (worst case):

Thread 0  → Address 0xA000
Thread 1  → Address 0x3400
Thread 2  → Address 0x7800
...
→ Each thread hits a different cache line → 32 transactions
→ Same 3% efficiency

$$ \text{Memory Efficiency} = \frac{\text{Useful bytes loaded}}{\text{Total bytes transferred}} $$

Optimization rule: Structure your data so that consecutive threads (within a warp) access consecutive memory addresses. This is called a coalesced access pattern and is one of the most important GPU performance optimizations.

Tensor Cores: Hardware for Matrix Multiply
#

Starting with the Volta architecture (2017), NVIDIA GPUs include Tensor Cores — specialized hardware units that perform small matrix multiply-accumulate (MMA) operations in a single cycle.

What Tensor Cores Do
#

A single Tensor Core computes:

$$ D = A \times B + C $$

Where $A$, $B$, $C$, and $D$ are small matrices (e.g., 16×16 for FP16).

Comparison of throughput:

Regular FP32 CUDA Cores:
  1 core performs 1 multiply + 1 add = 2 FLOP per cycle
  128 cores per SM → 256 FLOP/cycle/SM

4th Gen Tensor Core (Hopper, FP16):
  One MMA operation: 16 × 16 × 16 × 2 = 8,192 FLOP
  4 Tensor Cores per SM → ~32,768 FLOP/cycle/SM

  → Tensor Cores are ~128× more throughput for matrix ops

Supported Data Types
#

Different precisions trade off accuracy for speed:

Data Type	Bits	Use Case	Relative Speed
FP64	64	Scientific computing	1× (baseline)
TF32	19	Training (drop-in FP32 replacement)	~8×
FP16	16	Training and inference	~16×
BF16	16	Training (wider range than FP16)	~16×
INT8	8	Inference	~32×
FP8 (E4M3/E5M2)	8	Inference (Hopper+)	~32×
FP4	4	Inference (Blackwell)	~64×

FP16 vs BF16: Why Two 16-bit Formats?
#

FP32:   [1 sign] [8 exponent] [23 mantissa]
        Range: ±3.4 × 10³⁸, Precision: ~7 decimal digits

FP16:   [1 sign] [5 exponent] [10 mantissa]
        Range: ±65,504, Precision: ~3.3 decimal digits

BF16:   [1 sign] [8 exponent] [7 mantissa]
        Range: ±3.4 × 10³⁸, Precision: ~2.4 decimal digits

FP16 has better precision but a very narrow range. If gradients during training fall outside ±65,504, they overflow to infinity. BF16 has the same range as FP32 (same 8-bit exponent), so it rarely overflows, making training more stable despite the lower precision. This is why BF16 has become the default for large model training.

CUDA Programming Model → Hardware Mapping
#

Understanding how software concepts map to hardware is essential for performance tuning.

The Mapping
#

CUDA Software                    GPU Hardware
────────────                     ────────────
Grid (all blocks)           →    Entire GPU
  └── Block (group of threads) → Assigned to one SM
        └── Thread            → Runs on a CUDA Core (within a warp)

Step by step — what happens when you launch a kernel:

You specify a grid of blocks: kernel<<<gridDim, blockDim>>>(...)
- Example: kernel<<<128, 256>>>(...) → 128 blocks, each with 256 threads
The GigaThread Engine (top-level scheduler) distributes blocks to SMs.
Each SM may receive multiple blocks (if resources allow).
Within each block, threads are grouped into warps of 32.
The SM’s warp schedulers manage all resident warps, issuing instructions each cycle.

Example: kernel<<<128, 256>>>

Grid: 128 blocks
  Block 0   → assigned to SM 3
  Block 1   → assigned to SM 7
  Block 2   → assigned to SM 3  (SM 3 gets multiple blocks)
  Block 3   → assigned to SM 12
  ...
  Block 127 → assigned to SM 45

Within Block 0 on SM 3:
  256 threads ÷ 32 = 8 warps
  Warp 0: threads 0-31
  Warp 1: threads 32-63
  ...
  Warp 7: threads 224-255

A block stays on the same SM for its entire lifetime. It cannot migrate. Threads within a block can synchronize (__syncthreads()) and share data via shared memory. Threads in different blocks cannot directly communicate during execution.

Choosing Block Size
#

Block Size	Warps/Block	Typical Use Case
32	1	Very simple kernels, debugging
128	4	Good general default
256	8	Most common choice
512	16	When more shared memory per block is needed
1024	32	Maximum allowed (use sparingly)

128 or 256 threads per block works well in most cases. Too small means underutilizing the SM. Too large may cause resource pressure (registers, shared memory) that reduces occupancy.

GPU Memory Bandwidth: GDDR vs HBM
#

GPU workloads are often memory-bandwidth limited, meaning the compute units are starved for data. This is why GPU memory bandwidth is so important.

GDDR6X vs HBM3
#

GDDR6X (consumer GPUs, e.g., RTX 4090):
  ┌──────┐                    ┌────┐ ┌────┐
  │ GPU  │──[384-bit bus]────→│GDDR│ │GDDR│  ... (12 chips on PCB)
  └──────┘                    └────┘ └────┘

  Bus width: 384 bits
  Data rate: 21 Gbps
  Bandwidth: 384 × 21 / 8 = 1,008 GB/s

HBM3 (datacenter GPUs, e.g., H100):
  ┌─────────────────────┐
  │   ┌─────┐ ┌─────┐  │
  │   │ HBM │ │ HBM │  │  HBM stacks sit next to GPU die
  │   │stack│ │stack│  │  on a silicon interposer
  │   └─────┘ └─────┘  │
  │      GPU Die        │
  │   ┌─────┐ ┌─────┐  │
  │   │ HBM │ │ HBM │  │
  │   │stack│ │stack│  │
  │   └─────┘ └─────┘  │
  └─────────────────────┘

  Bus width: 5,120 bits (massively wide)
  Data rate: 6.4 Gbps (lower, but width compensates)
  Bandwidth: 5,120 × 6.4 / 8 = 3,350 GB/s

Aspect	GDDR6X	HBM3
Bandwidth	~1 TB/s	~3.4 TB/s
Power efficiency	Moderate	High (short wires on interposer)
Capacity	24 GB typical	80 GB typical
Cost	Lower	Much higher
Physical design	Chips on PCB edge	Stacks on silicon interposer
Typical use	Gaming, workstations	Data center, AI training

HBM achieves higher bandwidth by using a very wide bus (5,120 bits vs 384 bits) at a lower clock speed. The short wires on the silicon interposer also use less power per bit, which matters at data center scale.

Roofline Model: Understanding Performance Limits
#

Every GPU kernel is limited by one of two factors: compute throughput or memory bandwidth. The Roofline Model visualizes this.

$$ \text{Attainable Performance} = \min\left(\text{Peak FLOP/s}, \quad \text{Memory Bandwidth} \times \text{Arithmetic Intensity}\right) $$$$ \text{Arithmetic Intensity (AI)} = \frac{\text{FLOPs performed}}{\text{Bytes transferred from memory}} $$

Performance
(FLOP/s)
    │               ╱ Peak Compute ─────────────
    │              ╱
    │             ╱
    │            ╱
    │           ╱
    │          ╱       ← Memory-bound region
    │         ╱
    │        ╱            Compute-bound region →
    │       ╱
    │      ╱
    │     ╱
    │    ╱
    └───────────────────────────────────────
        Arithmetic Intensity (FLOP/Byte)
                 ↑
            Ridge Point

Reading the roofline:

Calculate your kernel’s arithmetic intensity: count FLOPs and bytes transferred.
If AI is left of the ridge point: you are memory-bound. Optimize memory access (coalescing, shared memory, caching).
If AI is right of the ridge point: you are compute-bound. Optimize computation (Tensor Cores, reduced precision).

Examples:

Operation	AI (FLOP/Byte)	Typically
Vector addition	0.25	Memory-bound
Matrix-vector multiply	~1	Memory-bound
Matrix-matrix multiply	~N/8	Compute-bound (for large N)
Convolution (large)	~100+	Compute-bound

Matrix multiplication is compute-bound because it reuses each loaded element $O(N)$ times. Vector addition loads two elements, does one add, stores one result — almost no reuse, so it is entirely bandwidth-limited.

GPU Architecture Evolution
#

Generation	Year	Key Innovation
Tesla (G80)	2006	Unified shaders, CUDA introduced — GPUs become programmable for general compute
Fermi	2010	L1/L2 caches added, ECC memory, true IEEE FP
Kepler	2012	Dynamic parallelism (kernels launch kernels), Hyper-Q
Maxwell	2014	Major power efficiency improvement (~2× perf/watt)
Pascal	2016	HBM2 (P100), NVLink, native FP16 support
Volta	2017	Tensor Cores introduced — first hardware matrix multiply acceleration
Turing	2018	RT Cores (hardware ray tracing), INT8/INT4 inference
Ampere	2020	3rd gen Tensor Cores, TF32, sparsity support, 80GB HBM2e
Hopper	2022	Transformer Engine, FP8, DPX instructions, 3TB/s HBM3
Blackwell	2024	2nd gen Transformer Engine, FP4, dual-die design, 8TB/s HBM3e

Three inflection points:

G80 / CUDA (2006): Transformed GPUs from graphics-only to general-purpose parallel processors. Made GPU computing accessible to non-graphics programmers.
Volta / Tensor Cores (2017): Purpose-built hardware for matrix multiplication gave deep learning training a ~10× speedup over regular CUDA cores. This is when GPU = AI training hardware became firmly established.
Hopper / Transformer Engine (2022): Hardware-level support for Transformer-specific operations (attention, layer norm) with automatic FP8 precision management. Acknowledged that Transformers are the dominant AI architecture and optimized silicon for them.

Summary
#

Concept	Key Takeaway
GPU vs CPU	GPU trades single-thread speed for massive parallelism
SM	The fundamental compute unit — contains cores, schedulers, shared memory, register file
Warp (32 threads)	Executes in lockstep — same instruction, different data
SIMT	Write scalar code, hardware executes across 32 threads
Warp divergence	Different branches within a warp → both paths execute serially → wasted lanes
Latency hiding	Zero-cost warp switching — when one warp stalls, another runs
Occupancy	More resident warps = more latency hiding potential
Shared memory	Fast on-chip, programmer-managed — essential for data reuse
Bank conflicts	Same-bank accesses are serialized — design access patterns to avoid
Memory coalescing	Consecutive threads accessing consecutive addresses → single transaction
Tensor Cores	Hardware matrix multiply — 10–100× faster than regular cores for matrix ops
HBM	Wide bus, high bandwidth, on-interposer — enables feeding compute-hungry GPUs
Roofline model	Performance limited by min(compute ceiling, bandwidth × arithmetic intensity)

The fundamental principle of GPU computing is: trade latency for throughput. A single GPU thread is much slower than a single CPU thread. But by running thousands of threads and using their collective activity to hide memory latency, the GPU achieves orders-of-magnitude higher throughput for data-parallel workloads. This is why deep learning, scientific simulation, and graphics rendering all live on GPUs.

Overview#

CPU vs GPU: Different Problems, Different Designs#

The Design Trade-off#

Side-by-Side Comparison#

GPU Hardware Organization#

Hierarchical Structure#

Real GPU Specifications#

Streaming Multiprocessor (SM): The Core Building Block#

SM Internal Architecture#

What Each Component Does#

SIMT Execution Model#

What Is a Warp?#

How a Warp Executes#

SIMD vs SIMT#

Warp Divergence: The Performance Trap#

Latency Hiding Through Warp Scheduling#

How It Works Step by Step#

Occupancy#

GPU Memory Hierarchy#

The Complete Memory Map#

Registers: The Fastest Storage#

Shared Memory: The Programmer-Managed Cache#

Bank Conflicts#

Memory Coalescing: Getting Data Efficiently from Global Memory#

Tensor Cores: Hardware for Matrix Multiply#

What Tensor Cores Do#

Supported Data Types#

FP16 vs BF16: Why Two 16-bit Formats?#

CUDA Programming Model → Hardware Mapping#

The Mapping#

Choosing Block Size#

GPU Memory Bandwidth: GDDR vs HBM#

GDDR6X vs HBM3#

Roofline Model: Understanding Performance Limits#

GPU Architecture Evolution#

Summary#