Table of Contents

Overview
#

The evolution of sequence modeling: RNN → LSTM → Transformer → LLM

RNN (Recurrent Neural Network)
#

Architecture
#

x_t → [Hidden State h_t] → y_t
           ↑ ↓
         h_{t-1}

Equations
#

$$ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) $$$$ y_t = W_{hy} h_t + b_y $$

Problems
#

IssueDescription
Vanishing GradientGradients shrink exponentially over time
Exploding GradientGradients grow exponentially
Short MemoryDifficulty with long sequences

LSTM (Long Short-Term Memory)
#

Key Innovation: Gates
#

         ┌──────────────────────────────┐
x_t ────→│  Forget │ Input │ Output    │────→ h_t
h_{t-1} →│   Gate  │ Gate  │  Gate     │
         └──────────────────────────────┘
                  Cell State c_t

Equations
#

Forget Gate:

$$ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) $$

Input Gate:

$$ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) $$

Cell Update:

$$ \tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) $$$$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t $$

Output Gate:

$$ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) $$$$ h_t = o_t \odot \tanh(c_t) $$

Benefits
#

  • Solves vanishing gradient via cell state highway
  • Selective memory through gates
  • Better long-term dependencies

Transformer
#

Key Innovation: Self-Attention
#

No recurrence - parallel processing of entire sequence.

Attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Architecture
#

Input → Embedding → [Multi-Head Attention + FFN] × N → Output
                    Position Encoding

Advantages over RNN/LSTM
#

AspectRNN/LSTMTransformer
ParallelizationSequentialFully parallel
Long-rangeDifficultEasy (direct attention)
Training speedSlowFast
ScalabilityLimitedScales well

LLM (Large Language Model)
#

Definition
#

Transformer-based models with billions of parameters trained on massive text corpora.

Key Examples
#

ModelParametersOrganization
GPT-3175BOpenAI
GPT-4~1.7T (est.)OpenAI
LLaMA7B-70BMeta
ClaudeUnknownAnthropic
PaLM540BGoogle

Capabilities
#

  • Text generation
  • Question answering
  • Summarization
  • Translation
  • Code generation
  • Reasoning

Training
#

  1. Pre-training: Self-supervised on large corpus
  2. Fine-tuning: Task-specific or instruction tuning
  3. RLHF: Reinforcement Learning from Human Feedback

Evolution Summary
#

RNN (1986)
  ↓ (vanishing gradient)
LSTM (1997)
  ↓ (sequential bottleneck)
Transformer (2017)
  ↓ (scale up)
LLM (2020+)