Numerical Descent

Overview
#

Numerical descent methods are iterative optimization algorithms used to find local minima of differentiable functions. In deep learning, these methods minimize loss functions to train neural networks.

Gradient Descent
#

The fundamental update rule:

$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t) $$

Where:

$\theta$: Model parameters
$\eta$: Learning rate
$L$: Loss function
$\nabla_\theta L$: Gradient of loss with respect to parameters

Types of Gradient Descent
#

Batch Gradient Descent
#

Uses entire dataset for each update:

$$ \theta = \theta - \eta \cdot \nabla_\theta L(\theta; X, Y) $$

Pros	Cons
Stable convergence	Slow for large datasets
Guaranteed descent	High memory usage

Stochastic Gradient Descent (SGD)
#

Updates using single sample:

$$ \theta = \theta - \eta \cdot \nabla_\theta L(\theta; x_i, y_i) $$

Pros	Cons
Fast updates	Noisy gradients
Can escape local minima	Unstable convergence

Mini-batch Gradient Descent
#

Uses subset of data:

$$ \theta = \theta - \eta \cdot \nabla_\theta L(\theta; X_{batch}, Y_{batch}) $$

Typical batch sizes: 32, 64, 128, 256

Advanced Optimizers
#

Momentum
#

Accumulates velocity in consistent gradient directions:

$$ v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t) $$$$ \theta_{t+1} = \theta_t - v_t $$

Where $\gamma$ is momentum coefficient (typically 0.9).

RMSprop
#

Adapts learning rate per parameter:

$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2 $$$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t $$

Adam
#

Combines momentum and adaptive learning rates:

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t $$$$ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 $$

Bias-corrected estimates:

$$ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} $$

Update:

$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

Default values: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

Comparison
#

Optimizer	Adaptive LR	Momentum	Use Case
SGD	No	Optional	Simple, well-tuned
RMSprop	Yes	No	RNNs, non-stationary
Adam	Yes	Yes	Default choice
AdamW	Yes	Yes	With weight decay

Learning Rate Schedules
#

Step Decay
#

$$ \eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor} $$

Exponential Decay
#

$$ \eta_t = \eta_0 \cdot e^{-kt} $$

Cosine Annealing
#

$$ \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi)) $$

Convergence Considerations
#

Learning rate too high: Divergence or oscillation
Learning rate too low: Slow convergence, stuck in local minima
Gradient clipping: Prevents exploding gradients

$$ g = \min\left(1, \frac{\text{threshold}}{\|g\|}\right) \cdot g $$

Overview#

Gradient Descent#

Types of Gradient Descent#

Batch Gradient Descent#

Stochastic Gradient Descent (SGD)#

Mini-batch Gradient Descent#

Advanced Optimizers#

Momentum#

RMSprop#

Adam#

Comparison#

Learning Rate Schedules#

Step Decay#

Exponential Decay#

Cosine Annealing#

Convergence Considerations#