Table of Contents

Overview
#

Multi-processor systems use multiple processing units to achieve higher performance through parallel execution. This approach became essential after single-core scaling hit the power wall.

Types of Parallel Systems
#

Flynn’s Taxonomy
#

TypeDescriptionExample
SISDSingle Instruction, Single DataTraditional uniprocessor
SIMDSingle Instruction, Multiple DataGPU, vector processors
MISDMultiple Instruction, Single DataRare (fault tolerance)
MIMDMultiple Instruction, Multiple DataMulti-core CPUs

Shared Memory Architecture
#

Uniform Memory Access (UMA)
#

┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│CPU 0│ │CPU 1│ │CPU 2│ │CPU 3│
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
   │       │       │       │
   └───────┴───┬───┴───────┘
        ┌──────┴──────┐
        │  Shared Bus │
        └──────┬──────┘
        ┌──────┴──────┐
        │   Memory    │
        └─────────────┘

All processors have equal access time to memory.

Non-Uniform Memory Access (NUMA)
#

┌─────────────────┐     ┌─────────────────┐
│     Node 0      │     │     Node 1      │
│ ┌───┐   ┌───┐   │     │ ┌───┐   ┌───┐   │
│ │CPU│   │CPU│   │     │ │CPU│   │CPU│   │
│ └─┬─┘   └─┬─┘   │     │ └─┬─┘   └─┬─┘   │
│   └───┬───┘     │     │   └───┬───┘     │
│   ┌───┴───┐     │←───→│   ┌───┴───┐     │
│   │Local  │     │     │   │Local  │     │
│   │Memory │     │     │   │Memory │     │
│   └───────┘     │     │   └───────┘     │
└─────────────────┘     └─────────────────┘

Local memory: fast access Remote memory: slower access

Cache Coherence
#

The Problem
#

Multiple caches may hold copies of same data:

CPU 0 Cache: X = 5
CPU 1 Cache: X = 5
Memory:      X = 5

CPU 0 writes X = 10
CPU 0 Cache: X = 10
CPU 1 Cache: X = 5  ← Stale!
Memory:      X = 5

Coherence Protocols
#

MSI Protocol States:

  • Modified (M): Exclusive, dirty
  • Shared (S): Clean, may be in other caches
  • Invalid (I): Not valid

MESI Protocol (adds Exclusive):

  • Exclusive (E): Clean, only copy

Snooping
#

Each cache monitors bus transactions:

CPU 0 writes X
Bus broadcast: "Writing X"
CPU 1 snoops, invalidates its copy

Directory-Based
#

Central directory tracks which caches have each line:

Directory entry for X:
- Present in: CPU 0, CPU 2
- State: Shared

CPU 0 wants to write:
- Send invalidate to CPU 2
- Update directory
- Grant write permission

Memory Consistency
#

Sequential Consistency
#

All processors see same order of operations.

Most intuitive, but limits optimizations.

Relaxed Consistency
#

Allow reordering for performance:

  • Writes may be buffered
  • Reads may bypass writes
  • Memory barriers needed

Synchronization
#

Atomic Operations
#

// Compare and Swap
int compare_and_swap(int *ptr, int old, int new) {
    atomic {
        if (*ptr == old) {
            *ptr = new;
            return 1;
        }
        return 0;
    }
}

Lock Implementation
#

void acquire_lock(int *lock) {
    while (!compare_and_swap(lock, 0, 1)) {
        // Spin or yield
    }
}

void release_lock(int *lock) {
    *lock = 0;
}

Scalability
#

Amdahl’s Law for Parallel Systems
#

$$ \text{Speedup} = \frac{1}{(1-p) + \frac{p}{n}} $$

Where:

  • \(p\): Parallel fraction
  • \(n\): Number of processors

Gustafson’s Law
#

With larger problems:

$$ \text{Speedup} = (1-p) + p \cdot n $$

Linear scaling possible with scaled workloads.

Multi-Core vs Multi-Processor
#

AspectMulti-CoreMulti-Processor
LocationSame chipSeparate chips
Cache sharingOften L3 sharedTypically separate
MemorySingle controllerMultiple controllers
CommunicationFast on-chipSlower off-chip
CostLowerHigher

Performance Considerations
#

Bottlenecks
#

  1. Memory bandwidth: Limited shared resource
  2. Cache contention: False sharing
  3. Synchronization: Lock overhead
  4. Load imbalance: Idle processors

False Sharing
#

// Bad: arr[0] and arr[1] likely same cache line
thread 0: writes arr[0]
thread 1: writes arr[1]
// Constant invalidation!

// Good: Pad to separate cache lines
struct padded {
    int value;
    char padding[60];  // 64-byte cache line
};

Summary
#

ConceptKey Point
Shared memoryCommon address space
Cache coherenceKeep caches consistent
Memory consistencyDefine operation ordering
SynchronizationCoordinate access
ScalabilityAmdahl limits parallel speedup