Modern CPU Microarchitecture Deep Dive

Table of Contents

Overview
#

A modern CPU is far more than a simple fetch-decode-execute loop. Decades of microarchitectural innovation have turned it into a deeply pipelined, speculative, out-of-order execution engine. The goal is simple: maximize IPC (Instructions Per Cycle) — the number of useful instructions completed every clock cycle.

$$ \text{CPU Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Cycle}} $$

Clock speed (the third term) has hit physical limits around 4–5 GHz. So modern CPUs focus on reducing CPI (the second term) through architectural techniques. This post walks through each major technique in detail.

The Big Picture: Pipeline Overview
#

A modern CPU pipeline has roughly 15–20+ stages. At a high level, it divides into two halves.

┌──────────────────────────────────────────────────────────────┐
│                     CPU Pipeline                             │
│                                                              │
│   FRONT-END (supply instructions)                            │
│   ┌───────┐   ┌────────┐   ┌────────┐   ┌────────┐         │
│   │ Fetch │──→│Predecode│──→│ Decode │──→│ Rename │         │
│   └───────┘   └────────┘   └────────┘   └───┬────┘         │
│                                              │               │
│   ═══════════════════════════════════════════╪═══════════    │
│                                              │               │
│   BACK-END (execute and retire)              │               │
│                                    ┌─────────▼────────┐     │
│                                    │ Allocate (ROB)    │     │
│                                    └─────────┬────────┘     │
│                                    ┌─────────▼────────┐     │
│                                    │    Scheduler      │     │
│                                    │(Reservation Stn.) │     │
│                                    └────┬───┬───┬─────┘     │
│                                    ┌────▼┐┌─▼─┐┌▼────┐     │
│                                    │ALU ││ALU││FPU  │      │
│                                    │    ││AGU││     │      │
│                                    └────┬┘└─┬─┘└┬────┘     │
│                                    ┌────▼───▼───▼────┐      │
│                                    │  Retire (Commit) │      │
│                                    └─────────────────┘      │
└──────────────────────────────────────────────────────────────┘

Front-end responsibility: Fetch instructions from memory, decode them, and supply a steady stream of work to the back-end.

Back-end responsibility: Execute instructions as soon as their inputs are ready (possibly out of program order), then commit results in the original program order.

Let us walk through each stage.

1. Front-End: Instruction Supply
#

The front-end must deliver enough instructions every cycle to keep the execution units busy. A modern CPU tries to fetch and decode 4–8 instructions per cycle.

1.1 Instruction Fetch
#

The fetch unit reads instructions from the L1 Instruction Cache (L1I). It fetches a block of bytes each cycle (typically 16–32 bytes) and feeds them into the decode stage.

           ┌──────────────┐
 L1 I-$  ──│  Fetch Unit  │──→ Instruction Buffer
           │ (16-32B/cyc) │
           └──────┬───────┘
                  │
           ┌──────▼───────┐
           │   Branch     │
           │  Predictor   │──→ Next fetch address
           └──────────────┘

But there is a problem: the fetch unit needs to know where to fetch next. For sequential code, this is simply PC + instruction_size. But branches change the control flow. The CPU cannot wait for a branch to be resolved — that would stall the pipeline for many cycles. So it predicts.

1.2 Branch Prediction
#

Branch prediction is arguably the single most important factor for modern CPU performance. In a 20-stage pipeline, a mispredicted branch wastes ~15–20 cycles of work. Modern predictors achieve 95–99% accuracy.

Static Prediction
#

The simplest approach: always predict one direction.

Strategy	Rule	Accuracy
Always not taken	Predict fall-through	~40–50%
Always taken	Predict branch target	~60–70%
Backward taken, forward not	Loops taken, if-else not	~65–75%

These are too inaccurate for high-performance CPUs. They use dynamic prediction instead.

2-Bit Saturating Counter
#

Each branch gets a 2-bit counter that tracks history. The counter must be wrong twice in a row before the prediction flips.

State machine:

  Strongly     Weakly       Weakly      Strongly
  Not Taken ──→ Not Taken ──→ Taken ──→  Taken
    (00)          (01)        (10)        (11)
      ↑  ←─NT─    ↑  ←─NT─    ↑  ←─NT─    ↑
      └────────    └────────   └────────    │
                                            │
      ┌──T──→     ┌──T──→     ┌──T──→      │
      │            │           │            └──T──→ (stays)

Step by step for a loop that iterates 10 times:

First encounter: counter starts at 00 (predict Not Taken). Branch is Taken → wrong. Counter: 00 → 01.
Second iteration: counter at 01 (predict Not Taken). Branch is Taken → wrong. Counter: 01 → 10.
Iterations 3–10: counter at 10 or 11 (predict Taken). Branch is Taken → correct. Counter saturates at 11.
Loop exits: counter at 11 (predict Taken). Branch is Not Taken → wrong. Counter: 11 → 10.
Next loop entry: counter at 10 (predict Taken). Branch is Taken → correct again.

Result: only 2 mispredictions at the start + 1 at exit = 3 wrong out of 11 = ~73% accuracy for this branch.

GShare: Correlating Predictor
#

Many branches are correlated with other branches:

if (x > 0)       // Branch A
    if (y > 0)    // Branch B — outcome often depends on Branch A

GShare uses a Global History Register (GHR) to capture this correlation.

Step by step:

The GHR is a shift register that stores the last N branch outcomes (Taken=1, Not Taken=0).
To predict a branch, XOR the GHR with the branch’s PC address: $$\text{index} = \text{PC} \oplus \text{GHR}$$
Use this index to look up a 2-bit counter in a Pattern History Table (PHT).
After the branch resolves, shift the outcome into the GHR and update the counter.

GHR (last 8 outcomes): [1, 0, 1, 1, 0, 1, 0, 1]
PC of current branch:  [0, 1, 0, 0, 1, 1, 1, 0]
                              XOR
Index:                  [1, 1, 1, 1, 1, 0, 1, 1]
                              │
                              ▼
                    PHT[0b11111011] → 2-bit counter → prediction

This captures patterns like “if branch A was taken and branch B was not taken, then branch C is usually taken.”

TAGE: The State of the Art
#

Modern high-performance CPUs (Intel, AMD, Apple) use TAGE (TAgged GEometric history length) predictors. TAGE is the most accurate general-purpose predictor known.

Core idea: Use multiple tables, each indexed with a different history length. Longer history captures more complex patterns but needs more entries to avoid collisions.

┌──────────────────────────────────────────────────┐
│                  TAGE Predictor                   │
│                                                   │
│  Base Predictor (bimodal, no history)             │
│       │                                           │
│  Table T1: indexed with history length L1 = 4    │
│       │                                           │
│  Table T2: indexed with history length L2 = 8    │
│       │                                           │
│  Table T3: indexed with history length L3 = 16   │
│       │                                           │
│  Table T4: indexed with history length L4 = 32   │
│       │                                           │
│  Table T5: indexed with history length L5 = 64   │
│       │                                           │
│  Table T6: indexed with history length L6 = 128  │
│                                                   │
│  History lengths grow geometrically:              │
│  Lᵢ = Lᵢ₋₁ × α,  where α ≈ 2                   │
└──────────────────────────────────────────────────┘

Step by step:

For a given branch, look up ALL tables in parallel.
Each table entry has a tag to detect if the lookup is a true match or an alias.
Find the table with the longest matching history.
Use that table’s prediction. If no table matches, fall back to the base predictor.
After resolution, update the matching table (and possibly allocate in a longer-history table on misprediction).

$$ L_i = (1 + \alpha)^{i-1} \times L_1 \quad \text{where } \alpha \approx 2 $$

The geometric spacing ensures coverage of both short patterns (like simple loops) and very long patterns (like nested loops or correlated branches far apart).

Branch Target Buffer (BTB)
#

Predicting taken/not-taken is only half the problem. The CPU also needs to predict the target address of taken branches.

Branch type	Target prediction
Conditional	BTB lookup (PC → target address)
Direct call	BTB lookup
Return	Return Address Stack (RAS) — a hardware LIFO stack
Indirect jump	Indirect Target Array (history-based prediction)

BTB Entry:
┌──────────┬────────────────┬──────┐
│ PC Tag   │ Target Address │ Type │
├──────────┼────────────────┼──────┤
│ 0x4010.. │   0x4020..     │ COND │
│ 0x4050.. │   0x3000..     │ CALL │
│ 0x4080.. │   (use RAS)    │ RET  │
└──────────┴────────────────┴──────┘

The RAS is elegant: every CALL pushes the return address, every RET pops it. Since call/return patterns are strictly nested, prediction accuracy is nearly 100% — unless the stack overflows.

1.3 Decode: x86 and Micro-ops
#

x86 instructions are variable-length (1–15 bytes) and can be extremely complex. Modern x86 CPUs do not execute x86 instructions directly. Instead, the decode stage translates them into fixed-length micro-operations (μops).

x86 Instruction              → Micro-ops (μops)
─────────────────────────────────────────────────
ADD RAX, RBX                 → 1 μop  (simple register add)
ADD RAX, [RBX+8]             → 2 μops (load + add)
PUSH RBP                     → 2 μops (decrement RSP + store)
REP MOVSB (copy N bytes)     → N μops (one per byte)
CALL [RAX]                   → 3 μops (load target + push return + jump)

Why do this?

μops are fixed-size → easier to schedule and pipeline
The back-end only sees a uniform stream of simple operations
Complex x86 semantics are handled once in the decoder

Micro-op Cache (μop Cache / DSB):

Intel calls it the Decoded Stream Buffer. It caches already-decoded μops so that hot loops bypass the decoder entirely.

First execution of a loop:
  L1 I-$ → [Decoder] → μop Cache → Back-end
                          ↓ (cache the μops)

Subsequent iterations:
  μop Cache → Back-end  (decoder bypassed, saving power and latency)

2. Back-End: Out-of-Order Execution
#

The back-end is where the CPU’s real power lies. It executes instructions out of program order based on data availability, then commits results in program order.

2.1 Why Out-of-Order?
#

Consider this code:

LW   R1, [R2+0]     ; Load R1 from memory (may take 100+ cycles on cache miss!)
ADD  R3, R1, R4      ; R3 = R1 + R4  (depends on instruction 1)
MUL  R5, R6, R7      ; R5 = R6 × R7  (independent of 1 and 2!)
SUB  R8, R5, R9      ; R8 = R5 - R9  (depends on instruction 3)

In-order execution:

Cycle 1:   LW starts (cache miss — will take 100 cycles)
Cycle 2-100: STALL — waiting for LW to complete
Cycle 101: ADD executes (R1 now available)
Cycle 102: MUL executes
Cycle 103: SUB executes
Total: ~103 cycles

Out-of-order execution:

Cycle 1:   LW starts (cache miss)
Cycle 2:   MUL executes (R6 and R7 are ready — no need to wait!)
Cycle 3:   SUB executes (R5 now ready from MUL)
...
Cycle 100: LW data arrives
Cycle 101: ADD executes (R1 now available)
Total: ~101 cycles — MUL and SUB were "free"

The CPU overlapped independent work with the long-latency load. This is the fundamental benefit of out-of-order execution.

2.2 Register Renaming
#

Before instructions can execute out of order, we must solve the problem of false dependencies.

True dependency (Read After Write, RAW):

ADD R1, R2, R3    ; Writes R1
SUB R4, R1, R5    ; Reads R1 — must wait for ADD to finish

This is a real data flow dependency. The SUB genuinely needs the result of ADD.

False dependency (Write After Write, WAW):

MUL R1, R2, R3    ; Writes R1
ADD R1, R4, R5    ; Also writes R1 — but this is a DIFFERENT computation!

False dependency (Write After Read, WAR):

ADD R4, R1, R5    ; Reads R1
SUB R1, R6, R7    ; Writes R1 — different computation, but same register name

WAW and WAR are name dependencies, not data flow dependencies. They exist only because the ISA has a limited number of register names (16 in x86-64).

Solution: Register Renaming

The CPU maintains a large pool of physical registers (typically 200–400) and maps architectural register names to physical register numbers.

Step by step:

Instruction enters the rename stage.
For each source register, look up the current physical register mapping in the Register Alias Table (RAT).
For each destination register, allocate a new physical register from the free list.
Update the RAT to point the architectural register name to the new physical register.

Before renaming:                    After renaming:
  MUL R1, R2, R3                    MUL P47, P12, P33
  ADD R1, R4, R5                    ADD P48, P9, P21

RAT State:
  R1 → P47 (after MUL)
  R1 → P48 (after ADD, overrides)   ← P47 and P48 are different!
  R2 → P12
  R3 → P33
  R4 → P9
  R5 → P21

Now MUL writes to P47 and ADD writes to P48. They are completely independent and can execute simultaneously. The false WAW dependency is gone.

2.3 The Reorder Buffer (ROB)
#

Out-of-order execution creates a problem: if instruction 5 completes before instruction 3, and instruction 3 later causes an exception (like a page fault), we need to undo instruction 5’s effects. The program should appear to execute in order from the outside.

The Reorder Buffer (ROB) solves this. It is a circular buffer that tracks all in-flight instructions in program order.

ROB (Circular Buffer):
┌─────┬────────────────────┬────────┬──────────┐
│ Idx │    Instruction     │ State  │  Result  │
├─────┼────────────────────┼────────┼──────────┤
│  0  │ ADD P47, P12, P33  │  Done  │  42      │ ← Head (oldest, retire next)
│  1  │ LW  P50, [P9+8]   │  Wait  │  —       │ ← Waiting for memory
│  2  │ SUB P48, P9, P21   │  Done  │  17      │ ← Done but cannot retire yet
│  3  │ MUL P51, P47, P48  │  Wait  │  —       │
│  4  │ XOR P52, P50, P48  │  Wait  │  —       │ ← Tail (newest)
└─────┴────────────────────┴────────┴──────────┘

Retirement rules:

Only the instruction at the head can retire (commit its result to architectural state).
It can only retire if its state is Done (execution complete, no exceptions).
Retirement happens in order, even though execution was out of order.

In the example above, ROB[0] (ADD) retires first. Then the head advances to ROB[1] (LW), but it is still waiting for memory — so no more retirements until the load completes. ROB[2] (SUB) is done but must wait its turn.

ROB size determines how far ahead the CPU can look for independent work. Larger ROB = more ILP extraction.

CPU	ROB Entries
Intel Golden Cove (12th/13th Gen P-core)	512
AMD Zen 4	320
Apple Avalanche (M2/M3 P-core)	630+
ARM Cortex-X3	320

2.4 Reservation Stations and Scheduling
#

After renaming and ROB allocation, instructions enter Reservation Stations (RS) — a holding area where they wait until all operands are ready.

Reservation Station Entry:
┌─────┬──────┬────────┬───────┬────────┬───────┬──────┐
│ Op  │ Busy │  Src1  │Ready1 │  Src2  │Ready2 │ Dest │
├─────┼──────┼────────┼───────┼────────┼───────┼──────┤
│ ADD │   1  │P12 = 7 │  ✓    │P33 = 3 │  ✓    │ P47  │ ← Ready! Can issue
│ MUL │   1  │P47     │  ✗    │P48     │  ✗    │ P51  │ ← Waiting for P47, P48
│ LW  │   1  │P9 = 20 │  ✓    │imm = 8 │  ✓    │ P50  │ ← Ready! Can issue
└─────┴──────┴────────┴───────┴────────┴───────┴──────┘

The Wakeup-Select process runs every cycle:

Wakeup: When an execution unit produces a result (say P47 = 42), it broadcasts this on a common data bus (CDB). Every RS entry compares the broadcast tag against its pending source operands. If it matches, the entry captures the value and marks that source as Ready.
Select: Among all RS entries where both sources are Ready, the scheduler picks the oldest one (or uses other priority rules) and dispatches it to an execution unit.
Issue: The selected instruction leaves the RS and enters the execution unit.

Cycle N:   ALU produces P47 = 42  → broadcasts on CDB
           RS entry for MUL: P47 matches Src1 → capture 42, Ready1 = ✓

Cycle N+1: MUL still needs P48 → stays in RS
           (Meanwhile, other ready instructions can issue)

Cycle N+2: ALU produces P48 = 17 → broadcasts on CDB
           RS entry for MUL: P48 matches Src2 → capture 17, Ready2 = ✓

Cycle N+3: MUL is now fully ready → selected and dispatched to multiplier

This entire mechanism — rename, ROB, RS, wakeup, select — is what enables true out-of-order execution. It is the defining feature of every high-performance CPU since the Intel Pentium Pro (1995).

3. Memory Subsystem
#

The memory subsystem is critical because memory access is often the bottleneck. A cache miss to DRAM costs 100–200 cycles, during which the CPU has nothing to do unless it can find other independent work.

3.1 Cache Hierarchy
#

         ┌──────────┐
         │   Core   │
         │ ┌──────┐ │
         │ │ L1 D │ │  32–48 KB, 4–5 cycles, 8–12 way
         │ │      │ │
         │ │ L1 I │ │  32 KB, instruction cache
         │ └──┬───┘ │
         │ ┌──▼───┐ │
         │ │  L2  │ │  256 KB–1.25 MB, 12–14 cycles
         │ └──┬───┘ │
         └────┼─────┘
         ┌────▼─────┐
         │    L3    │  8–96 MB (shared across cores), 30–50 cycles
         └────┬─────┘
              │
         ┌────▼─────┐
         │   DRAM   │  100–200+ cycles
         └──────────┘

Level	Typical Size	Latency	Associativity	Shared?
L1D	32–48 KB	4–5 cycles	8–12 way	Per core
L1I	32 KB	~4 cycles	8 way	Per core
L2	256 KB–1.25 MB	12–14 cycles	8 way	Per core
L3	8–96 MB	30–50 cycles	16 way	Shared
DRAM	16–128 GB	100–200+ cycles	—	Shared

3.2 How a Cache Lookup Works
#

A cache is organized as a set of sets, each containing multiple ways. A memory address is split into three fields:

64-bit Address (example: 32KB L1, 64B lines, 8-way):

 63                 12  11           6  5          0
┌────────────────────┬───────────────┬──────────────┐
│        Tag         │    Index      │    Offset    │
│     (52 bits)      │   (6 bits)    │   (6 bits)   │
└────────────────────┴───────────────┴──────────────┘
         │                  │               │
         │                  │               └─→ Which byte within the 64B line?
         │                  └─────────────────→ Which set? (64 sets)
         └────────────────────────────────────→ Does this line belong to us?

Step-by-step cache lookup:

Extract the Index bits → select a set (one of 64 sets).
That set has 8 ways. Compare the Tag field against all 8 tags in parallel.
If a tag matches and the Valid bit is set → Cache Hit! Use the Offset to extract the requested bytes.
If no tag matches → Cache Miss. Fetch the line from the next level (L2). Pick a victim way to evict using the replacement policy.

3.3 Replacement Policies
#

When a cache miss occurs and all ways in a set are occupied, which line should be evicted?

Policy	How it works	Used in
True LRU	Track exact usage order of all ways	Small caches (low associativity)
Pseudo-LRU (Tree PLRU)	Binary tree approximation of LRU	L1 caches
RRIP	Predict re-reference interval, evict “distant” lines	L3 caches
Adaptive (DRRIP)	Dynamically switch between LRU-like and scan-resistant policies	Intel L3

Why not always use true LRU? For an 8-way cache, true LRU needs to track the ordering of 8 items = $\log_2(8!) \approx 15$ bits per set. For a 16-way L3, this becomes impractical. Pseudo-LRU uses only $N-1 = 7$ bits for 8 ways.

3.4 Non-Blocking Cache and MSHR
#

In older CPUs, a cache miss would stall the entire pipeline until the data arrived. Modern CPUs use non-blocking caches that continue serving other requests even while a miss is outstanding.

Miss Status Holding Register (MSHR):
┌────────────────┬──────────┬──────────────────────┐
│  Miss Address  │  Status  │  Waiting Instructions│
├────────────────┼──────────┼──────────────────────┤
│   0xFF00..     │ Pending  │ LD P47, LD P55       │
│   0xFF40..     │ Pending  │ LD P50               │
│   (empty)      │ Available│                      │
│   (empty)      │ Available│                      │
└────────────────┴──────────┴──────────────────────┘

Step by step:

Load instruction misses in L1. An MSHR entry is allocated, recording the address and the instruction waiting for the data.
The request is sent to L2 (or DRAM).
Meanwhile, the CPU continues executing other instructions. If another load also misses (to a different address), a second MSHR entry is allocated → miss under miss.
If another load misses to the same address as an outstanding miss, it joins the existing MSHR entry (no duplicate request).
When data returns from memory, all waiting instructions in that MSHR entry are woken up.

Modern L1 caches support 10–16 outstanding misses simultaneously. This enables Memory Level Parallelism (MLP) — overlapping multiple long-latency memory accesses.

3.5 Hardware Prefetching
#

Prefetchers detect memory access patterns and fetch data into the cache before the CPU requests it.

Prefetcher	Pattern detected	Example workload
Next-line	Sequential access	Array traversal
Stride	Fixed-interval access	`a[0], a[4], a[8], ...`
Spatial	Access within a region	Struct field access
Temporal	Repeating irregular pattern	Pointer chasing (limited)

Step by step (stride prefetcher):

Observe load to address 0x1000.
Observe load to address 0x1040 (stride = 0x40 = 64 bytes).
Observe load to address 0x1080 (same stride confirmed).
Now confident: prefetch 0x10C0 into cache before the CPU asks for it.
When the CPU executes the load to 0x10C0 → cache hit instead of miss.

Prefetching is a trade-off:

$$ \text{Net benefit} = \text{Misses eliminated} \times \text{Miss penalty} - \text{Useless prefetches} \times \text{Pollution cost} $$

A prefetch that brings in data the CPU never uses wastes bandwidth and may evict useful cache lines (cache pollution).

3.6 Store Buffer and Memory Ordering
#

Stores do not write directly to the cache. They go through a Store Buffer, which allows the CPU to continue executing while the store waits to commit.

┌──────────┐    ┌──────────────┐    ┌─────────┐
│  Store   │───→│ Store Buffer │───→│ L1 D-$  │
│ Execute  │    │  (ordered)   │    │         │
└──────────┘    └──────┬───────┘    └─────────┘
                       │
          Store-to-Load Forwarding
                       │
┌──────────┐           ▼
│   Load   │←── If same address, forward directly from Store Buffer
│ Execute  │    (no need to access cache)
└──────────┘

Memory ordering (x86 = Total Store Ordering):

x86 provides strong memory ordering guarantees:

Ordering	Guaranteed?	Meaning
Load → Load	Yes	Loads appear in program order
Load → Store	Yes	A load before a store in code executes first
Store → Store	Yes	Stores appear in program order
Store → Load	No	A later load may execute before an earlier store

The Store → Load reordering is the only relaxation in x86. If the programmer needs this ordering (e.g., in lock-free algorithms), they must insert an MFENCE instruction.

ARM uses a weaker memory model (allowing more reorderings), which gives the hardware more freedom to optimize but makes programming harder.

4. Cache Coherence
#

In a multi-core CPU, each core has its own L1 (and often L2) cache. If Core 0 writes to address X and Core 1 has a cached copy of X, Core 1’s copy is now stale. Cache coherence protocols ensure all cores see a consistent view of memory.

4.1 MESI Protocol
#

The most common coherence protocol. Each cache line is in one of four states:

State	Meaning	Other cores have a copy?	Line is modified?
M (Modified)	This core has the only copy, and it has been written to	No	Yes
E (Exclusive)	This core has the only copy, but it is clean	No	No
S (Shared)	Multiple cores may have copies, all clean	Possibly	No
I (Invalid)	This cache line is not valid	—	—

State transitions step by step:

Starting state: Line X is Invalid in all cores.

1. Core 0 reads X (cache miss):
   → Fetch from memory → State: Exclusive (only copy, clean)

2. Core 1 reads X (cache miss):
   → Core 0 snoops the request
   → Core 0 transitions E → Shared
   → Core 1 gets copy → State: Shared (both cores have it)

3. Core 0 writes to X:
   → Core 0 sends invalidation to Core 1
   → Core 1 transitions S → Invalid
   → Core 0 transitions S → Modified (sole dirty copy)

4. Core 1 reads X again:
   → Core 0 snoops the request
   → Core 0 writes back modified data to memory (or supplies directly)
   → Core 0 transitions M → Shared
   → Core 1 gets copy → State: Shared

4.2 Snoop vs Directory-Based Coherence
#

Approach	How it works	Scalability
Snoop-based	Every core monitors (“snoops”) the shared bus for requests	Good up to ~8 cores
Directory-based	A central directory tracks which cores have each line	Scales to hundreds of cores

Modern desktop CPUs often use a hybrid: snoop within a cluster, directory between clusters. Server CPUs (AMD EPYC, Intel Xeon) use directory-based protocols for their many cores.

5. Speculative Execution
#

The CPU does not wait for branches to resolve. Based on the branch predictor’s output, it speculatively executes instructions from the predicted path.

How It Works Step by Step
#

1. CPU encounters a branch instruction.

2. Branch predictor says: "Taken, target = 0x4020"
   → Save a checkpoint (snapshot of RAT, ROB state)

3. CPU fetches and executes instructions from 0x4020 (speculative)
   → These instructions go through rename, execute, and write results
   → But they do NOT retire (commit) — they stay in the ROB

4. Eventually, the branch instruction reaches the execution unit
   and the true condition is evaluated.

5a. Prediction was CORRECT:
    → Discard the checkpoint
    → Speculative instructions can now retire normally
    → No penalty

5b. Prediction was WRONG (misprediction):
    → Restore the checkpoint (roll back RAT)
    → Flush all speculative instructions from the pipeline
    → Restart fetch from the correct target
    → Penalty ≈ pipeline depth × width (15–20+ cycles of wasted work)

Misprediction cost:

$$ \text{Penalty} \approx \text{Pipeline depth} \approx 15\text{–}20 \text{ cycles} $$

For a 6-wide CPU with a 20-cycle penalty, each misprediction wastes up to ~120 μops of work. This is why branch prediction accuracy matters so much.

Security Implication: Spectre
#

Speculative execution has a security side effect. During speculation, the CPU may access data it should not (e.g., reading beyond an array boundary). Even though the results are rolled back architecturally, they leave traces in the cache (microarchitectural state). An attacker can observe these cache traces through timing measurements and infer secret data.

// Spectre-style attack (simplified)
if (x < array_size)                // Predictor says: Taken
    y = array2[array1[x] * 256];   // Speculatively accesses secret data
                                    // array2 access leaves cache footprint
// After rollback, use timing to detect which array2 line was cached

Hardware and software mitigations exist (retpolines, IBRS, speculative load hardening), but they come at a performance cost. This is a fundamental tension between performance and security in modern CPU design.

6. Simultaneous Multithreading (SMT)
#

Even with out-of-order execution, a single thread rarely keeps all execution units busy. There are stalls from cache misses, branch mispredictions, and dependency chains. SMT fills these idle slots by interleaving instructions from multiple threads on a single physical core.

Intel calls this Hyper-Threading. Most implementations support 2 threads per core (2-way SMT).

How It Works
#

Without SMT (1 thread):
Cycle:   1     2     3     4     5     6     7     8
Port 0: [ADD] [ — ] [MUL] [ — ] [ADD] [ — ] [ — ] [SUB]
Port 1: [LD ] [ — ] [ — ] [ST ] [ — ] [LD ] [ — ] [ — ]
Port 2: [ — ] [ADD] [ — ] [ — ] [ — ] [ — ] [ADD] [ — ]
  Utilization: ~40-60% — many empty slots

With SMT (2 threads: T0 and T1):
Cycle:   1     2     3     4     5     6     7     8
Port 0: [T0 ] [T1 ] [T0 ] [T1 ] [T0 ] [T1 ] [T1 ] [T0 ]
Port 1: [T1 ] [T0 ] [T1 ] [T0 ] [T1 ] [T0 ] [T0 ] [T1 ]
Port 2: [T0 ] [T1 ] [T1 ] [T0 ] [T0 ] [T1 ] [T0 ] [T1 ]
  Utilization: ~70-90% — idle slots filled by the other thread

Resource Sharing
#

Resource	Sharing strategy	Notes
Execution units	Competitively shared	Both threads compete for the same ALUs
ROB	Statically or dynamically partitioned	Each thread gets half, or allocated on demand
Physical registers	Partitioned	Each thread gets its own pool
L1 Cache	Competitively shared	Can cause thrashing
L2 Cache	Competitively shared	Can cause thrashing
TLB	Tagged (PCID)	Entries tagged with thread ID
Branch predictor	Shared tables	Histories may interfere
Fetch/decode	Alternating or shared	Front-end alternates between threads

SMT Trade-offs
#

Benefit: Higher throughput. Two threads on one SMT core typically achieve ~1.15–1.30× the throughput of a single thread (not 2×, because they compete for resources).

$$ \text{SMT Throughput Gain} \approx 15\%\text{–}30\% $$

Costs:

Each thread gets fewer resources (smaller effective ROB, fewer physical registers)
Cache thrashing: two threads’ working sets compete for the same L1/L2
Security: shared microarchitectural state enables side-channel attacks (Spectre, MDS)
Latency-sensitive workloads may suffer from interference

Apple’s high-performance cores notably do not use SMT. Instead, they invest in an extremely wide pipeline (8-wide decode, 630+ ROB) to extract maximum single-thread performance. This reflects a design philosophy: rather than splitting resources between two threads, give everything to one thread and make it as fast as possible.

7. Putting It All Together: Real CPU Comparison
#

Microarchitecture Parameters
#

Parameter	Intel Golden Cove	AMD Zen 4	Apple Avalanche (M2)
Pipeline depth	~20 stages	~19 stages	~16 stages
Decode width	6 μops/cycle	4 μops/cycle	8 μops/cycle
Issue width	12 ports	6 ports	8+ ports
ROB size	512 entries	320 entries	630+ entries
Physical registers (int)	~280	~224	~380
L1D cache	48 KB, 12-way	32 KB, 8-way	128 KB
L2 cache	1.25 MB	1 MB	16 MB (shared)
Branch misprediction penalty	~14 cycles	~11 cycles	~14 cycles
SMT	2-way	2-way	None

Design Philosophy
#

Intel Golden Cove:
  Wide front-end (6-wide) + SMT + large ROB (512)
  → Balanced approach for diverse server/desktop workloads
  → SMT adds ~20% throughput for multi-threaded workloads

AMD Zen 4:
  Efficient 4-wide decode + large caches + chiplet architecture
  → Excellent multi-core scaling at lower cost
  → Chiplet design allows mixing core counts flexibly

Apple Avalanche:
  Ultra-wide (8-wide decode) + huge ROB (630+) + no SMT
  → Maximum single-thread performance
  → Extreme power efficiency (mobile-first design)
  → Compensates for no SMT with raw width

Summary
#

Technique	What it does	Why it matters
Branch prediction (TAGE)	Predicts branch direction and target	Avoids 15–20 cycle pipeline flushes
Micro-op cache	Caches decoded instructions	Bypasses complex x86 decoder
Register renaming	Maps architectural regs to physical regs	Eliminates false WAR/WAW dependencies
Out-of-order execution + ROB	Executes instructions when operands are ready, retires in order	Extracts ILP, hides latency
Reservation stations	Hold instructions until operands arrive	Enables dynamic scheduling
Speculative execution	Executes predicted path before branch is resolved	Hides branch resolution latency
Non-blocking cache + MSHR	Handles multiple cache misses simultaneously	Enables Memory Level Parallelism
Hardware prefetcher	Fetches data before CPU requests it	Reduces cache misses
Store buffer + forwarding	Buffers stores, forwards to dependent loads	Hides store latency
Cache coherence (MESI)	Keeps multi-core caches consistent	Correctness for shared memory
SMT	Runs multiple threads on one core	Fills idle execution slots

All of these techniques serve one goal: push IPC as high as possible. A simple in-order pipeline achieves IPC ≈ 0.5–1.0. A modern out-of-order CPU achieves IPC of 4–6 on favorable workloads. This ~10× improvement is the cumulative result of decades of microarchitectural innovation.

Overview#

The Big Picture: Pipeline Overview#

1. Front-End: Instruction Supply#

1.1 Instruction Fetch#

1.2 Branch Prediction#

Static Prediction#

2-Bit Saturating Counter#

GShare: Correlating Predictor#

TAGE: The State of the Art#

Branch Target Buffer (BTB)#

1.3 Decode: x86 and Micro-ops#

2. Back-End: Out-of-Order Execution#

2.1 Why Out-of-Order?#

2.2 Register Renaming#

2.3 The Reorder Buffer (ROB)#

2.4 Reservation Stations and Scheduling#

3. Memory Subsystem#

3.1 Cache Hierarchy#

3.2 How a Cache Lookup Works#

3.3 Replacement Policies#

3.4 Non-Blocking Cache and MSHR#

3.5 Hardware Prefetching#

3.6 Store Buffer and Memory Ordering#

4. Cache Coherence#

4.1 MESI Protocol#

4.2 Snoop vs Directory-Based Coherence#

5. Speculative Execution#

How It Works Step by Step#

Security Implication: Spectre#

6. Simultaneous Multithreading (SMT)#

How It Works#

Resource Sharing#

SMT Trade-offs#

7. Putting It All Together: Real CPU Comparison#

Microarchitecture Parameters#

Design Philosophy#

Summary#

Overview
#

The Big Picture: Pipeline Overview
#

1. Front-End: Instruction Supply
#

1.1 Instruction Fetch
#

1.2 Branch Prediction
#

Static Prediction
#

2-Bit Saturating Counter
#

GShare: Correlating Predictor
#

TAGE: The State of the Art
#

Branch Target Buffer (BTB)
#

1.3 Decode: x86 and Micro-ops
#

2. Back-End: Out-of-Order Execution
#

2.1 Why Out-of-Order?
#

2.2 Register Renaming
#

2.3 The Reorder Buffer (ROB)
#

2.4 Reservation Stations and Scheduling
#

3. Memory Subsystem
#

3.1 Cache Hierarchy
#

3.2 How a Cache Lookup Works
#

3.3 Replacement Policies
#

3.4 Non-Blocking Cache and MSHR
#

3.5 Hardware Prefetching
#

3.6 Store Buffer and Memory Ordering
#

4. Cache Coherence
#

4.1 MESI Protocol
#

4.2 Snoop vs Directory-Based Coherence
#

5. Speculative Execution
#

How It Works Step by Step
#

Security Implication: Spectre
#

6. Simultaneous Multithreading (SMT)
#

How It Works
#

Resource Sharing
#

SMT Trade-offs
#

7. Putting It All Together: Real CPU Comparison
#

Microarchitecture Parameters
#

Design Philosophy
#

Summary
#