Table of Contents
Overview#
A modern CPU is far more than a simple fetch-decode-execute loop. Decades of microarchitectural innovation have turned it into a deeply pipelined, speculative, out-of-order execution engine. The goal is simple: maximize IPC (Instructions Per Cycle) — the number of useful instructions completed every clock cycle.
$$ \text{CPU Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Cycle}} $$Clock speed (the third term) has hit physical limits around 4–5 GHz. So modern CPUs focus on reducing CPI (the second term) through architectural techniques. This post walks through each major technique in detail.
The Big Picture: Pipeline Overview#
A modern CPU pipeline has roughly 15–20+ stages. At a high level, it divides into two halves.
┌──────────────────────────────────────────────────────────────┐
│ CPU Pipeline │
│ │
│ FRONT-END (supply instructions) │
│ ┌───────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Fetch │──→│Predecode│──→│ Decode │──→│ Rename │ │
│ └───────┘ └────────┘ └────────┘ └───┬────┘ │
│ │ │
│ ═══════════════════════════════════════════╪═══════════ │
│ │ │
│ BACK-END (execute and retire) │ │
│ ┌─────────▼────────┐ │
│ │ Allocate (ROB) │ │
│ └─────────┬────────┘ │
│ ┌─────────▼────────┐ │
│ │ Scheduler │ │
│ │(Reservation Stn.) │ │
│ └────┬───┬───┬─────┘ │
│ ┌────▼┐┌─▼─┐┌▼────┐ │
│ │ALU ││ALU││FPU │ │
│ │ ││AGU││ │ │
│ └────┬┘└─┬─┘└┬────┘ │
│ ┌────▼───▼───▼────┐ │
│ │ Retire (Commit) │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────┘Front-end responsibility: Fetch instructions from memory, decode them, and supply a steady stream of work to the back-end.
Back-end responsibility: Execute instructions as soon as their inputs are ready (possibly out of program order), then commit results in the original program order.
Let us walk through each stage.
1. Front-End: Instruction Supply#
The front-end must deliver enough instructions every cycle to keep the execution units busy. A modern CPU tries to fetch and decode 4–8 instructions per cycle.
1.1 Instruction Fetch#
The fetch unit reads instructions from the L1 Instruction Cache (L1I). It fetches a block of bytes each cycle (typically 16–32 bytes) and feeds them into the decode stage.
┌──────────────┐
L1 I-$ ──│ Fetch Unit │──→ Instruction Buffer
│ (16-32B/cyc) │
└──────┬───────┘
│
┌──────▼───────┐
│ Branch │
│ Predictor │──→ Next fetch address
└──────────────┘But there is a problem: the fetch unit needs to know where to fetch next. For sequential code, this is simply PC + instruction_size. But branches change the control flow. The CPU cannot wait for a branch to be resolved — that would stall the pipeline for many cycles. So it predicts.
1.2 Branch Prediction#
Branch prediction is arguably the single most important factor for modern CPU performance. In a 20-stage pipeline, a mispredicted branch wastes ~15–20 cycles of work. Modern predictors achieve 95–99% accuracy.
Static Prediction#
The simplest approach: always predict one direction.
| Strategy | Rule | Accuracy |
|---|---|---|
| Always not taken | Predict fall-through | ~40–50% |
| Always taken | Predict branch target | ~60–70% |
| Backward taken, forward not | Loops taken, if-else not | ~65–75% |
These are too inaccurate for high-performance CPUs. They use dynamic prediction instead.
2-Bit Saturating Counter#
Each branch gets a 2-bit counter that tracks history. The counter must be wrong twice in a row before the prediction flips.
State machine:
Strongly Weakly Weakly Strongly
Not Taken ──→ Not Taken ──→ Taken ──→ Taken
(00) (01) (10) (11)
↑ ←─NT─ ↑ ←─NT─ ↑ ←─NT─ ↑
└──────── └──────── └──────── │
│
┌──T──→ ┌──T──→ ┌──T──→ │
│ │ │ └──T──→ (stays)Step by step for a loop that iterates 10 times:
- First encounter: counter starts at 00 (predict Not Taken). Branch is Taken → wrong. Counter: 00 → 01.
- Second iteration: counter at 01 (predict Not Taken). Branch is Taken → wrong. Counter: 01 → 10.
- Iterations 3–10: counter at 10 or 11 (predict Taken). Branch is Taken → correct. Counter saturates at 11.
- Loop exits: counter at 11 (predict Taken). Branch is Not Taken → wrong. Counter: 11 → 10.
- Next loop entry: counter at 10 (predict Taken). Branch is Taken → correct again.
Result: only 2 mispredictions at the start + 1 at exit = 3 wrong out of 11 = ~73% accuracy for this branch.
GShare: Correlating Predictor#
Many branches are correlated with other branches:
if (x > 0) // Branch A
if (y > 0) // Branch B — outcome often depends on Branch A
GShare uses a Global History Register (GHR) to capture this correlation.
Step by step:
- The GHR is a shift register that stores the last N branch outcomes (Taken=1, Not Taken=0).
- To predict a branch, XOR the GHR with the branch’s PC address: $$\text{index} = \text{PC} \oplus \text{GHR}$$
- Use this index to look up a 2-bit counter in a Pattern History Table (PHT).
- After the branch resolves, shift the outcome into the GHR and update the counter.
GHR (last 8 outcomes): [1, 0, 1, 1, 0, 1, 0, 1]
PC of current branch: [0, 1, 0, 0, 1, 1, 1, 0]
XOR
Index: [1, 1, 1, 1, 1, 0, 1, 1]
│
▼
PHT[0b11111011] → 2-bit counter → predictionThis captures patterns like “if branch A was taken and branch B was not taken, then branch C is usually taken.”
TAGE: The State of the Art#
Modern high-performance CPUs (Intel, AMD, Apple) use TAGE (TAgged GEometric history length) predictors. TAGE is the most accurate general-purpose predictor known.
Core idea: Use multiple tables, each indexed with a different history length. Longer history captures more complex patterns but needs more entries to avoid collisions.
┌──────────────────────────────────────────────────┐
│ TAGE Predictor │
│ │
│ Base Predictor (bimodal, no history) │
│ │ │
│ Table T1: indexed with history length L1 = 4 │
│ │ │
│ Table T2: indexed with history length L2 = 8 │
│ │ │
│ Table T3: indexed with history length L3 = 16 │
│ │ │
│ Table T4: indexed with history length L4 = 32 │
│ │ │
│ Table T5: indexed with history length L5 = 64 │
│ │ │
│ Table T6: indexed with history length L6 = 128 │
│ │
│ History lengths grow geometrically: │
│ Lᵢ = Lᵢ₋₁ × α, where α ≈ 2 │
└──────────────────────────────────────────────────┘Step by step:
- For a given branch, look up ALL tables in parallel.
- Each table entry has a tag to detect if the lookup is a true match or an alias.
- Find the table with the longest matching history.
- Use that table’s prediction. If no table matches, fall back to the base predictor.
- After resolution, update the matching table (and possibly allocate in a longer-history table on misprediction).
The geometric spacing ensures coverage of both short patterns (like simple loops) and very long patterns (like nested loops or correlated branches far apart).
Branch Target Buffer (BTB)#
Predicting taken/not-taken is only half the problem. The CPU also needs to predict the target address of taken branches.
| Branch type | Target prediction |
|---|---|
| Conditional | BTB lookup (PC → target address) |
| Direct call | BTB lookup |
| Return | Return Address Stack (RAS) — a hardware LIFO stack |
| Indirect jump | Indirect Target Array (history-based prediction) |
BTB Entry:
┌──────────┬────────────────┬──────┐
│ PC Tag │ Target Address │ Type │
├──────────┼────────────────┼──────┤
│ 0x4010.. │ 0x4020.. │ COND │
│ 0x4050.. │ 0x3000.. │ CALL │
│ 0x4080.. │ (use RAS) │ RET │
└──────────┴────────────────┴──────┘The RAS is elegant: every CALL pushes the return address, every RET pops it. Since call/return patterns are strictly nested, prediction accuracy is nearly 100% — unless the stack overflows.
1.3 Decode: x86 and Micro-ops#
x86 instructions are variable-length (1–15 bytes) and can be extremely complex. Modern x86 CPUs do not execute x86 instructions directly. Instead, the decode stage translates them into fixed-length micro-operations (μops).
x86 Instruction → Micro-ops (μops)
─────────────────────────────────────────────────
ADD RAX, RBX → 1 μop (simple register add)
ADD RAX, [RBX+8] → 2 μops (load + add)
PUSH RBP → 2 μops (decrement RSP + store)
REP MOVSB (copy N bytes) → N μops (one per byte)
CALL [RAX] → 3 μops (load target + push return + jump)Why do this?
- μops are fixed-size → easier to schedule and pipeline
- The back-end only sees a uniform stream of simple operations
- Complex x86 semantics are handled once in the decoder
Micro-op Cache (μop Cache / DSB):
Intel calls it the Decoded Stream Buffer. It caches already-decoded μops so that hot loops bypass the decoder entirely.
First execution of a loop:
L1 I-$ → [Decoder] → μop Cache → Back-end
↓ (cache the μops)
Subsequent iterations:
μop Cache → Back-end (decoder bypassed, saving power and latency)2. Back-End: Out-of-Order Execution#
The back-end is where the CPU’s real power lies. It executes instructions out of program order based on data availability, then commits results in program order.
2.1 Why Out-of-Order?#
Consider this code:
1: LW R1, [R2+0] ; Load R1 from memory (may take 100+ cycles on cache miss!)
2: ADD R3, R1, R4 ; R3 = R1 + R4 (depends on instruction 1)
3: MUL R5, R6, R7 ; R5 = R6 × R7 (independent of 1 and 2!)
4: SUB R8, R5, R9 ; R8 = R5 - R9 (depends on instruction 3)In-order execution:
Cycle 1: LW starts (cache miss — will take 100 cycles)
Cycle 2-100: STALL — waiting for LW to complete
Cycle 101: ADD executes (R1 now available)
Cycle 102: MUL executes
Cycle 103: SUB executes
Total: ~103 cyclesOut-of-order execution:
Cycle 1: LW starts (cache miss)
Cycle 2: MUL executes (R6 and R7 are ready — no need to wait!)
Cycle 3: SUB executes (R5 now ready from MUL)
...
Cycle 100: LW data arrives
Cycle 101: ADD executes (R1 now available)
Total: ~101 cycles — MUL and SUB were "free"The CPU overlapped independent work with the long-latency load. This is the fundamental benefit of out-of-order execution.
2.2 Register Renaming#
Before instructions can execute out of order, we must solve the problem of false dependencies.
True dependency (Read After Write, RAW):
ADD R1, R2, R3 ; Writes R1
SUB R4, R1, R5 ; Reads R1 — must wait for ADD to finishThis is a real data flow dependency. The SUB genuinely needs the result of ADD.
False dependency (Write After Write, WAW):
MUL R1, R2, R3 ; Writes R1
ADD R1, R4, R5 ; Also writes R1 — but this is a DIFFERENT computation!False dependency (Write After Read, WAR):
ADD R4, R1, R5 ; Reads R1
SUB R1, R6, R7 ; Writes R1 — different computation, but same register nameWAW and WAR are name dependencies, not data flow dependencies. They exist only because the ISA has a limited number of register names (16 in x86-64).
Solution: Register Renaming
The CPU maintains a large pool of physical registers (typically 200–400) and maps architectural register names to physical register numbers.
Step by step:
- Instruction enters the rename stage.
- For each source register, look up the current physical register mapping in the Register Alias Table (RAT).
- For each destination register, allocate a new physical register from the free list.
- Update the RAT to point the architectural register name to the new physical register.
Before renaming: After renaming:
MUL R1, R2, R3 MUL P47, P12, P33
ADD R1, R4, R5 ADD P48, P9, P21
RAT State:
R1 → P47 (after MUL)
R1 → P48 (after ADD, overrides) ← P47 and P48 are different!
R2 → P12
R3 → P33
R4 → P9
R5 → P21Now MUL writes to P47 and ADD writes to P48. They are completely independent and can execute simultaneously. The false WAW dependency is gone.
2.3 The Reorder Buffer (ROB)#
Out-of-order execution creates a problem: if instruction 5 completes before instruction 3, and instruction 3 later causes an exception (like a page fault), we need to undo instruction 5’s effects. The program should appear to execute in order from the outside.
The Reorder Buffer (ROB) solves this. It is a circular buffer that tracks all in-flight instructions in program order.
ROB (Circular Buffer):
┌─────┬────────────────────┬────────┬──────────┐
│ Idx │ Instruction │ State │ Result │
├─────┼────────────────────┼────────┼──────────┤
│ 0 │ ADD P47, P12, P33 │ Done │ 42 │ ← Head (oldest, retire next)
│ 1 │ LW P50, [P9+8] │ Wait │ — │ ← Waiting for memory
│ 2 │ SUB P48, P9, P21 │ Done │ 17 │ ← Done but cannot retire yet
│ 3 │ MUL P51, P47, P48 │ Wait │ — │
│ 4 │ XOR P52, P50, P48 │ Wait │ — │ ← Tail (newest)
└─────┴────────────────────┴────────┴──────────┘Retirement rules:
- Only the instruction at the head can retire (commit its result to architectural state).
- It can only retire if its state is Done (execution complete, no exceptions).
- Retirement happens in order, even though execution was out of order.
In the example above, ROB[0] (ADD) retires first. Then the head advances to ROB[1] (LW), but it is still waiting for memory — so no more retirements until the load completes. ROB[2] (SUB) is done but must wait its turn.
ROB size determines how far ahead the CPU can look for independent work. Larger ROB = more ILP extraction.
| CPU | ROB Entries |
|---|---|
| Intel Golden Cove (12th/13th Gen P-core) | 512 |
| AMD Zen 4 | 320 |
| Apple Avalanche (M2/M3 P-core) | 630+ |
| ARM Cortex-X3 | 320 |
2.4 Reservation Stations and Scheduling#
After renaming and ROB allocation, instructions enter Reservation Stations (RS) — a holding area where they wait until all operands are ready.
Reservation Station Entry:
┌─────┬──────┬────────┬───────┬────────┬───────┬──────┐
│ Op │ Busy │ Src1 │Ready1 │ Src2 │Ready2 │ Dest │
├─────┼──────┼────────┼───────┼────────┼───────┼──────┤
│ ADD │ 1 │P12 = 7 │ ✓ │P33 = 3 │ ✓ │ P47 │ ← Ready! Can issue
│ MUL │ 1 │P47 │ ✗ │P48 │ ✗ │ P51 │ ← Waiting for P47, P48
│ LW │ 1 │P9 = 20 │ ✓ │imm = 8 │ ✓ │ P50 │ ← Ready! Can issue
└─────┴──────┴────────┴───────┴────────┴───────┴──────┘The Wakeup-Select process runs every cycle:
Wakeup: When an execution unit produces a result (say P47 = 42), it broadcasts this on a common data bus (CDB). Every RS entry compares the broadcast tag against its pending source operands. If it matches, the entry captures the value and marks that source as Ready.
Select: Among all RS entries where both sources are Ready, the scheduler picks the oldest one (or uses other priority rules) and dispatches it to an execution unit.
Issue: The selected instruction leaves the RS and enters the execution unit.
Cycle N: ALU produces P47 = 42 → broadcasts on CDB
RS entry for MUL: P47 matches Src1 → capture 42, Ready1 = ✓
Cycle N+1: MUL still needs P48 → stays in RS
(Meanwhile, other ready instructions can issue)
Cycle N+2: ALU produces P48 = 17 → broadcasts on CDB
RS entry for MUL: P48 matches Src2 → capture 17, Ready2 = ✓
Cycle N+3: MUL is now fully ready → selected and dispatched to multiplierThis entire mechanism — rename, ROB, RS, wakeup, select — is what enables true out-of-order execution. It is the defining feature of every high-performance CPU since the Intel Pentium Pro (1995).
3. Memory Subsystem#
The memory subsystem is critical because memory access is often the bottleneck. A cache miss to DRAM costs 100–200 cycles, during which the CPU has nothing to do unless it can find other independent work.
3.1 Cache Hierarchy#
┌──────────┐
│ Core │
│ ┌──────┐ │
│ │ L1 D │ │ 32–48 KB, 4–5 cycles, 8–12 way
│ │ │ │
│ │ L1 I │ │ 32 KB, instruction cache
│ └──┬───┘ │
│ ┌──▼───┐ │
│ │ L2 │ │ 256 KB–1.25 MB, 12–14 cycles
│ └──┬───┘ │
└────┼─────┘
┌────▼─────┐
│ L3 │ 8–96 MB (shared across cores), 30–50 cycles
└────┬─────┘
│
┌────▼─────┐
│ DRAM │ 100–200+ cycles
└──────────┘| Level | Typical Size | Latency | Associativity | Shared? |
|---|---|---|---|---|
| L1D | 32–48 KB | 4–5 cycles | 8–12 way | Per core |
| L1I | 32 KB | ~4 cycles | 8 way | Per core |
| L2 | 256 KB–1.25 MB | 12–14 cycles | 8 way | Per core |
| L3 | 8–96 MB | 30–50 cycles | 16 way | Shared |
| DRAM | 16–128 GB | 100–200+ cycles | — | Shared |
3.2 How a Cache Lookup Works#
A cache is organized as a set of sets, each containing multiple ways. A memory address is split into three fields:
64-bit Address (example: 32KB L1, 64B lines, 8-way):
63 12 11 6 5 0
┌────────────────────┬───────────────┬──────────────┐
│ Tag │ Index │ Offset │
│ (52 bits) │ (6 bits) │ (6 bits) │
└────────────────────┴───────────────┴──────────────┘
│ │ │
│ │ └─→ Which byte within the 64B line?
│ └─────────────────→ Which set? (64 sets)
└────────────────────────────────────→ Does this line belong to us?Step-by-step cache lookup:
- Extract the Index bits → select a set (one of 64 sets).
- That set has 8 ways. Compare the Tag field against all 8 tags in parallel.
- If a tag matches and the Valid bit is set → Cache Hit! Use the Offset to extract the requested bytes.
- If no tag matches → Cache Miss. Fetch the line from the next level (L2). Pick a victim way to evict using the replacement policy.
3.3 Replacement Policies#
When a cache miss occurs and all ways in a set are occupied, which line should be evicted?
| Policy | How it works | Used in |
|---|---|---|
| True LRU | Track exact usage order of all ways | Small caches (low associativity) |
| Pseudo-LRU (Tree PLRU) | Binary tree approximation of LRU | L1 caches |
| RRIP | Predict re-reference interval, evict “distant” lines | L3 caches |
| Adaptive (DRRIP) | Dynamically switch between LRU-like and scan-resistant policies | Intel L3 |
Why not always use true LRU? For an 8-way cache, true LRU needs to track the ordering of 8 items = \(\log_2(8!) \approx 15\) bits per set. For a 16-way L3, this becomes impractical. Pseudo-LRU uses only \(N-1 = 7\) bits for 8 ways.
3.4 Non-Blocking Cache and MSHR#
In older CPUs, a cache miss would stall the entire pipeline until the data arrived. Modern CPUs use non-blocking caches that continue serving other requests even while a miss is outstanding.
Miss Status Holding Register (MSHR):
┌────────────────┬──────────┬──────────────────────┐
│ Miss Address │ Status │ Waiting Instructions│
├────────────────┼──────────┼──────────────────────┤
│ 0xFF00.. │ Pending │ LD P47, LD P55 │
│ 0xFF40.. │ Pending │ LD P50 │
│ (empty) │ Available│ │
│ (empty) │ Available│ │
└────────────────┴──────────┴──────────────────────┘Step by step:
- Load instruction misses in L1. An MSHR entry is allocated, recording the address and the instruction waiting for the data.
- The request is sent to L2 (or DRAM).
- Meanwhile, the CPU continues executing other instructions. If another load also misses (to a different address), a second MSHR entry is allocated → miss under miss.
- If another load misses to the same address as an outstanding miss, it joins the existing MSHR entry (no duplicate request).
- When data returns from memory, all waiting instructions in that MSHR entry are woken up.
Modern L1 caches support 10–16 outstanding misses simultaneously. This enables Memory Level Parallelism (MLP) — overlapping multiple long-latency memory accesses.
3.5 Hardware Prefetching#
Prefetchers detect memory access patterns and fetch data into the cache before the CPU requests it.
| Prefetcher | Pattern detected | Example workload |
|---|---|---|
| Next-line | Sequential access | Array traversal |
| Stride | Fixed-interval access | a[0], a[4], a[8], ... |
| Spatial | Access within a region | Struct field access |
| Temporal | Repeating irregular pattern | Pointer chasing (limited) |
Step by step (stride prefetcher):
- Observe load to address 0x1000.
- Observe load to address 0x1040 (stride = 0x40 = 64 bytes).
- Observe load to address 0x1080 (same stride confirmed).
- Now confident: prefetch 0x10C0 into cache before the CPU asks for it.
- When the CPU executes the load to 0x10C0 → cache hit instead of miss.
Prefetching is a trade-off:
$$ \text{Net benefit} = \text{Misses eliminated} \times \text{Miss penalty} - \text{Useless prefetches} \times \text{Pollution cost} $$A prefetch that brings in data the CPU never uses wastes bandwidth and may evict useful cache lines (cache pollution).
3.6 Store Buffer and Memory Ordering#
Stores do not write directly to the cache. They go through a Store Buffer, which allows the CPU to continue executing while the store waits to commit.
┌──────────┐ ┌──────────────┐ ┌─────────┐
│ Store │───→│ Store Buffer │───→│ L1 D-$ │
│ Execute │ │ (ordered) │ │ │
└──────────┘ └──────┬───────┘ └─────────┘
│
Store-to-Load Forwarding
│
┌──────────┐ ▼
│ Load │←── If same address, forward directly from Store Buffer
│ Execute │ (no need to access cache)
└──────────┘Memory ordering (x86 = Total Store Ordering):
x86 provides strong memory ordering guarantees:
| Ordering | Guaranteed? | Meaning |
|---|---|---|
| Load → Load | Yes | Loads appear in program order |
| Load → Store | Yes | A load before a store in code executes first |
| Store → Store | Yes | Stores appear in program order |
| Store → Load | No | A later load may execute before an earlier store |
The Store → Load reordering is the only relaxation in x86. If the programmer needs this ordering (e.g., in lock-free algorithms), they must insert an MFENCE instruction.
ARM uses a weaker memory model (allowing more reorderings), which gives the hardware more freedom to optimize but makes programming harder.
4. Cache Coherence#
In a multi-core CPU, each core has its own L1 (and often L2) cache. If Core 0 writes to address X and Core 1 has a cached copy of X, Core 1’s copy is now stale. Cache coherence protocols ensure all cores see a consistent view of memory.
4.1 MESI Protocol#
The most common coherence protocol. Each cache line is in one of four states:
| State | Meaning | Other cores have a copy? | Line is modified? |
|---|---|---|---|
| M (Modified) | This core has the only copy, and it has been written to | No | Yes |
| E (Exclusive) | This core has the only copy, but it is clean | No | No |
| S (Shared) | Multiple cores may have copies, all clean | Possibly | No |
| I (Invalid) | This cache line is not valid | — | — |
State transitions step by step:
Starting state: Line X is Invalid in all cores.
1. Core 0 reads X (cache miss):
→ Fetch from memory → State: Exclusive (only copy, clean)
2. Core 1 reads X (cache miss):
→ Core 0 snoops the request
→ Core 0 transitions E → Shared
→ Core 1 gets copy → State: Shared (both cores have it)
3. Core 0 writes to X:
→ Core 0 sends invalidation to Core 1
→ Core 1 transitions S → Invalid
→ Core 0 transitions S → Modified (sole dirty copy)
4. Core 1 reads X again:
→ Core 0 snoops the request
→ Core 0 writes back modified data to memory (or supplies directly)
→ Core 0 transitions M → Shared
→ Core 1 gets copy → State: Shared4.2 Snoop vs Directory-Based Coherence#
| Approach | How it works | Scalability |
|---|---|---|
| Snoop-based | Every core monitors (“snoops”) the shared bus for requests | Good up to ~8 cores |
| Directory-based | A central directory tracks which cores have each line | Scales to hundreds of cores |
Modern desktop CPUs often use a hybrid: snoop within a cluster, directory between clusters. Server CPUs (AMD EPYC, Intel Xeon) use directory-based protocols for their many cores.
5. Speculative Execution#
The CPU does not wait for branches to resolve. Based on the branch predictor’s output, it speculatively executes instructions from the predicted path.
How It Works Step by Step#
1. CPU encounters a branch instruction.
2. Branch predictor says: "Taken, target = 0x4020"
→ Save a checkpoint (snapshot of RAT, ROB state)
3. CPU fetches and executes instructions from 0x4020 (speculative)
→ These instructions go through rename, execute, and write results
→ But they do NOT retire (commit) — they stay in the ROB
4. Eventually, the branch instruction reaches the execution unit
and the true condition is evaluated.
5a. Prediction was CORRECT:
→ Discard the checkpoint
→ Speculative instructions can now retire normally
→ No penalty
5b. Prediction was WRONG (misprediction):
→ Restore the checkpoint (roll back RAT)
→ Flush all speculative instructions from the pipeline
→ Restart fetch from the correct target
→ Penalty ≈ pipeline depth × width (15–20+ cycles of wasted work)Misprediction cost:
$$ \text{Penalty} \approx \text{Pipeline depth} \approx 15\text{–}20 \text{ cycles} $$For a 6-wide CPU with a 20-cycle penalty, each misprediction wastes up to ~120 μops of work. This is why branch prediction accuracy matters so much.
Security Implication: Spectre#
Speculative execution has a security side effect. During speculation, the CPU may access data it should not (e.g., reading beyond an array boundary). Even though the results are rolled back architecturally, they leave traces in the cache (microarchitectural state). An attacker can observe these cache traces through timing measurements and infer secret data.
// Spectre-style attack (simplified)
if (x < array_size) // Predictor says: Taken
y = array2[array1[x] * 256]; // Speculatively accesses secret data
// array2 access leaves cache footprint
// After rollback, use timing to detect which array2 line was cachedHardware and software mitigations exist (retpolines, IBRS, speculative load hardening), but they come at a performance cost. This is a fundamental tension between performance and security in modern CPU design.
6. Simultaneous Multithreading (SMT)#
Even with out-of-order execution, a single thread rarely keeps all execution units busy. There are stalls from cache misses, branch mispredictions, and dependency chains. SMT fills these idle slots by interleaving instructions from multiple threads on a single physical core.
Intel calls this Hyper-Threading. Most implementations support 2 threads per core (2-way SMT).
How It Works#
Without SMT (1 thread):
Cycle: 1 2 3 4 5 6 7 8
Port 0: [ADD] [ — ] [MUL] [ — ] [ADD] [ — ] [ — ] [SUB]
Port 1: [LD ] [ — ] [ — ] [ST ] [ — ] [LD ] [ — ] [ — ]
Port 2: [ — ] [ADD] [ — ] [ — ] [ — ] [ — ] [ADD] [ — ]
Utilization: ~40-60% — many empty slots
With SMT (2 threads: T0 and T1):
Cycle: 1 2 3 4 5 6 7 8
Port 0: [T0 ] [T1 ] [T0 ] [T1 ] [T0 ] [T1 ] [T1 ] [T0 ]
Port 1: [T1 ] [T0 ] [T1 ] [T0 ] [T1 ] [T0 ] [T0 ] [T1 ]
Port 2: [T0 ] [T1 ] [T1 ] [T0 ] [T0 ] [T1 ] [T0 ] [T1 ]
Utilization: ~70-90% — idle slots filled by the other threadResource Sharing#
| Resource | Sharing strategy | Notes |
|---|---|---|
| Execution units | Competitively shared | Both threads compete for the same ALUs |
| ROB | Statically or dynamically partitioned | Each thread gets half, or allocated on demand |
| Physical registers | Partitioned | Each thread gets its own pool |
| L1 Cache | Competitively shared | Can cause thrashing |
| L2 Cache | Competitively shared | Can cause thrashing |
| TLB | Tagged (PCID) | Entries tagged with thread ID |
| Branch predictor | Shared tables | Histories may interfere |
| Fetch/decode | Alternating or shared | Front-end alternates between threads |
SMT Trade-offs#
Benefit: Higher throughput. Two threads on one SMT core typically achieve ~1.15–1.30× the throughput of a single thread (not 2×, because they compete for resources).
$$ \text{SMT Throughput Gain} \approx 15\%\text{–}30\% $$Costs:
- Each thread gets fewer resources (smaller effective ROB, fewer physical registers)
- Cache thrashing: two threads’ working sets compete for the same L1/L2
- Security: shared microarchitectural state enables side-channel attacks (Spectre, MDS)
- Latency-sensitive workloads may suffer from interference
Apple’s high-performance cores notably do not use SMT. Instead, they invest in an extremely wide pipeline (8-wide decode, 630+ ROB) to extract maximum single-thread performance. This reflects a design philosophy: rather than splitting resources between two threads, give everything to one thread and make it as fast as possible.
7. Putting It All Together: Real CPU Comparison#
Microarchitecture Parameters#
| Parameter | Intel Golden Cove | AMD Zen 4 | Apple Avalanche (M2) |
|---|---|---|---|
| Pipeline depth | ~20 stages | ~19 stages | ~16 stages |
| Decode width | 6 μops/cycle | 4 μops/cycle | 8 μops/cycle |
| Issue width | 12 ports | 6 ports | 8+ ports |
| ROB size | 512 entries | 320 entries | 630+ entries |
| Physical registers (int) | ~280 | ~224 | ~380 |
| L1D cache | 48 KB, 12-way | 32 KB, 8-way | 128 KB |
| L2 cache | 1.25 MB | 1 MB | 16 MB (shared) |
| Branch misprediction penalty | ~14 cycles | ~11 cycles | ~14 cycles |
| SMT | 2-way | 2-way | None |
Design Philosophy#
Intel Golden Cove:
Wide front-end (6-wide) + SMT + large ROB (512)
→ Balanced approach for diverse server/desktop workloads
→ SMT adds ~20% throughput for multi-threaded workloads
AMD Zen 4:
Efficient 4-wide decode + large caches + chiplet architecture
→ Excellent multi-core scaling at lower cost
→ Chiplet design allows mixing core counts flexibly
Apple Avalanche:
Ultra-wide (8-wide decode) + huge ROB (630+) + no SMT
→ Maximum single-thread performance
→ Extreme power efficiency (mobile-first design)
→ Compensates for no SMT with raw widthSummary#
| Technique | What it does | Why it matters |
|---|---|---|
| Branch prediction (TAGE) | Predicts branch direction and target | Avoids 15–20 cycle pipeline flushes |
| Micro-op cache | Caches decoded instructions | Bypasses complex x86 decoder |
| Register renaming | Maps architectural regs to physical regs | Eliminates false WAR/WAW dependencies |
| Out-of-order execution + ROB | Executes instructions when operands are ready, retires in order | Extracts ILP, hides latency |
| Reservation stations | Hold instructions until operands arrive | Enables dynamic scheduling |
| Speculative execution | Executes predicted path before branch is resolved | Hides branch resolution latency |
| Non-blocking cache + MSHR | Handles multiple cache misses simultaneously | Enables Memory Level Parallelism |
| Hardware prefetcher | Fetches data before CPU requests it | Reduces cache misses |
| Store buffer + forwarding | Buffers stores, forwards to dependent loads | Hides store latency |
| Cache coherence (MESI) | Keeps multi-core caches consistent | Correctness for shared memory |
| SMT | Runs multiple threads on one core | Fills idle execution slots |
All of these techniques serve one goal: push IPC as high as possible. A simple in-order pipeline achieves IPC ≈ 0.5–1.0. A modern out-of-order CPU achieves IPC of 4–6 on favorable workloads. This ~10× improvement is the cumulative result of decades of microarchitectural innovation.