Table of Contents
1. Forward Pass: Neural Network Output Pipeline#
In the final stage of a classification neural network, the output flows through three sequential transformations:
$$ \underbrace{\vphantom{\frac{A}{B}}\text{Neural Network}}_{\text{feature extraction}} \longrightarrow \underbrace{\vphantom{\frac{A}{B}}z_i}_{\text{logits}} \longrightarrow \underbrace{\vphantom{\frac{A}{B}}p_i}_{\text{softmax}} \longrightarrow \underbrace{\vphantom{\frac{A}{B}}L}_{\text{cross entropy}} $$Each stage has a specific role:
- Logits ($z_i$): Raw, unnormalized scores from the final linear layer
- Softmax ($p_i$): Converts logits into a valid probability distribution
- Cross Entropy ($L$): Measures the difference between predicted and true distributions
2. Mathematical Definitions#
2.1 Logits to Probability: Softmax Function#
The softmax function transforms raw logits into probabilities that sum to 1:
$$ \underbrace{\vphantom{\frac{e^{z_i}}{\sum_j}}{\color{blue}p_i}}_{\text{probability}} = \underbrace{\frac{e^{z_i}}{\sum_j e^{z_j}}}_{\text{softmax function}} \tag{1} $$where:
- $z_i$ is the logit (raw score) for class $i$
- $e^{z_i}$ ensures all values are positive
- $\sum_j e^{z_j}$ normalizes so that $\sum_i p_i = 1$
2.2 Probability to Loss: Cross Entropy#
Cross entropy measures how well the predicted distribution $p$ matches the true distribution $y$:
$$ L = - \underbrace{\vphantom{\sum_i^n}\sum_i}_{\text{all classes}} \overbrace{\vphantom{\sum_i^n}y_i}^{\text{true label}} \underbrace{\vphantom{\sum_i^n}\log({\color{blue}p_i})}_{\text{log probability}} \tag{2} $$where:
- $y_i = 1$ for the correct class, $y_i = 0$ otherwise (one-hot encoding)
- Since $0 < p_i \leq 1$, we have $\log(p_i) \leq 0$. The negative sign flips this to ensure $L \geq 0$
3. Backpropagation: Chain Rule Setup#
To update network weights, we need $\frac{\partial L}{\partial z_i}$ (gradient w.r.t. logits).
By the chain rule, we decompose this through intermediate variables:
$$ \underbrace{\vphantom{\frac{\partial L}{\partial p}}\frac{\partial L}{\partial z_i}}_{\text{what we want}} = {\color{red}\underbrace{\vphantom{\frac{\partial L}{\partial p}}\frac{\partial L}{\partial p}}_{\text{CE gradient}}} \cdot {\color{blue}\underbrace{\vphantom{\frac{\partial L}{\partial p}}\frac{\partial p}{\partial z}}_{\text{softmax gradient}}} \tag{3} $$We will derive each term separately:
- ${\color{red}\frac{\partial L}{\partial p}}$ (CE gradient): How loss changes with probability
- ${\color{blue}\frac{\partial p}{\partial z}}$ (Softmax gradient): How probability changes with logit
Starting from the output layer and working backwards.
4. Derivation Part 1: Cross Entropy Gradient#
Goal: Find ${\color{red}\frac{\partial L}{\partial p_i}}$
Starting from the cross entropy definition:
$$ L = -\sum_i y_i \log(p_i) $$Taking the partial derivative with respect to $p_i$:
$$ {\color{red}\underbrace{\vphantom{\frac{\partial}{\partial p}}\frac{\partial L}{\partial p_i}}_{\text{CE gradient}}} = \frac{\partial}{\partial p_i}\left[ \underbrace{\vphantom{\frac{\partial}{\partial p}}-y_i}_{\text{label}} \underbrace{\vphantom{\frac{\partial}{\partial p}}\log(p_i)}_{\text{log prob}} \right] \tag{4} $$Using the derivative of natural log: $\frac{d}{dx}\log(x) = \frac{1}{x}$
$$ \underbrace{\vphantom{\frac{1}{p}}\frac{\partial}{\partial p_i}\log_e(p_i)}_{\text{natural log}} = \underbrace{\vphantom{\frac{1}{p}}\frac{1}{p_i}}_{\text{derivative of log}} \tag{5} $$Therefore:
$$ \boxed{{\color{red}\frac{\partial L}{\partial p_i} = -\frac{y_i}{p_i}}} \tag{6} $$5. Derivation Part 2: Softmax Gradient#
Goal: Find ${\color{blue}\frac{\partial p_j}{\partial z_i}}$
Recall the softmax function:
$$ p_i = \frac{e^{z_i}}{\sum_k e^{z_k}} \tag{7} $$For cleaner notation, let’s define the normalization constant:
$$ S \equiv \sum_k e^{z_k} = e^{z_1} + e^{z_2} + \cdots + e^{z_n} \tag{8} $$Now softmax becomes:
$$ \underbrace{\vphantom{\frac{e}{S}}p_i}_{\text{probability}} = \frac{\overbrace{\vphantom{S}e^{z_i}}^{\text{numerator}}} {\underbrace{\vphantom{e}S}_{\text{normalizer}}} \tag{9} $$Key observation: $S$ contains ALL $z_k$ terms. Therefore:
- Changing $z_i$ affects the numerator $e^{z_i}$
- Changing $z_i$ also affects the denominator $S$ (since $z_i$ is one of the terms in the sum)
This is why we need the quotient rule for differentiation.
We must consider two cases due to the summation in the denominator.
Case 1: Same Index ($i = j$)#
When differentiating $p_i$ with respect to $z_i$:
$$ {\color{blue}\underbrace{\vphantom{\frac{\partial}{\partial z}}\frac{\partial p_i}{\partial z_i}}_{\text{diagonal term}}} = \frac{\partial}{\partial z_i}\left( \frac{\overbrace{\vphantom{S}e^{z_i}}^{f}} {\underbrace{\vphantom{e}S}_{g}} \right) \tag{10} $$Applying the quotient rule: $\frac{d}{dx}\left[\frac{f}{g}\right] = \frac{f’g - fg’}{g^2}$
where:
- $f = e^{z_i}$, so $f’ = e^{z_i}$
- $g = S = \sum_k e^{z_k}$, so $g’ = \frac{\partial S}{\partial z_i} = e^{z_i}$
Factoring out $e^{z_i}$:
$$ = \frac{ \overbrace{\vphantom{S}e^{z_i}}^{\text{common}} \cdot \overbrace{\vphantom{S}(S-e^{z_i})}^{\text{remaining}}} {S^2} \tag{12} $$Separating the fractions:
$$ = \underbrace{\vphantom{\frac{S}{S}}\frac{e^{z_i}}{S}}_{p_i} \cdot \underbrace{\vphantom{\frac{S}{S}}\frac{S-e^{z_i}}{S}}_{1-p_i} \tag{13} $$Recognizing $p_i = \frac{e^{z_i}}{S}$:
$$ = \underbrace{\vphantom{\frac{S}{S}}\frac{e^{z_i}}{S}}_{p_i} \cdot \left(1 - \underbrace{\vphantom{\frac{S}{S}}\frac{e^{z_i}}{S}}_{p_i}\right) \tag{14} $$$$ \boxed{{\color{blue}\frac{\partial p_i}{\partial z_i} = p_i(1-p_i)}} \tag{15} $$Case 2: Different Index ($i \neq j$)#
When differentiating $p_i$ with respect to $z_j$ (where $j \neq i$):
$$ {\color{blue}\underbrace{\vphantom{\frac{\partial}{\partial z}}\frac{\partial p_i}{\partial z_j}}_{\text{off-diagonal}}} = \frac{\partial}{\partial z_j}\left( \frac{\overbrace{\vphantom{S}e^{z_i}}^{\text{const w.r.t. }z_j}} {\underbrace{\vphantom{e}S}_{\text{contains }z_j}} \right) \tag{16} $$Here:
- $f = e^{z_i}$ is constant w.r.t. $z_j$, so $f’ = 0$
- $g = S$, so $g’ = \frac{\partial S}{\partial z_j} = e^{z_j}$
Summary: Softmax Jacobian#
$$ {\color{blue}\frac{\partial p_i}{\partial z_j}} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \\ -p_i \cdot p_j & \text{if } i \neq j \end{cases} = p_i(\delta_{ij} - p_j) \tag{21} $$where $\delta_{ij}$ is the Kronecker delta.
6. Combining: Full Gradient Derivation#
Now we combine both gradients using the chain rule.
Key observation: In softmax, changing $z_i$ affects ALL probabilities $p_j$ (not just $p_i$), because $z_i$ appears in the denominator $S = \sum_k e^{z_k}$.
Therefore, we must sum over all paths:
$$ \frac{\partial L}{\partial z_i} = \sum_{j} {\color{red}\frac{\partial L}{\partial p_j}} \cdot {\color{blue}\frac{\partial p_j}{\partial z_i}} \tag{22} $$This splits into two cases based on our softmax derivative:
$$ \frac{\partial L}{\partial z_i} = \underbrace{\vphantom{\sum_{j \neq i}}{\color{red}\frac{\partial L}{\partial p_i}} \cdot {\color{blue}\frac{\partial p_i}{\partial z_i}}}_{\text{when }j=i} + \underbrace{\sum_{j \neq i}{\color{red}\frac{\partial L}{\partial p_j}} \cdot {\color{blue}\frac{\partial p_j}{\partial z_i}}}_{\text{when }j \neq i} \tag{23} $$Substituting our derived values from equations (6), (15), and (20):
$$ = \underbrace{\vphantom{\sum_{j \neq i}}{\color{red}\left(-\frac{y_i}{p_i}\right)} \cdot {\color{blue}p_i(1-p_i)}}_{\text{diagonal term}} + \underbrace{\sum_{j \neq i}{\color{red}\left(-\frac{y_j}{p_j}\right)} \cdot {\color{blue}(-p_j \cdot p_i)}}_{\text{off-diagonal terms}} \tag{24} $$Simplifying the diagonal term:
$$ {\color{red}\left(-\frac{y_i}{p_i}\right)} \cdot {\color{blue}p_i(1-p_i)} = -y_i(1-p_i) = -y_i + y_i p_i \tag{25} $$Simplifying the off-diagonal terms:
$$ {\color{red}\left(-\frac{y_j}{p_j}\right)} \cdot {\color{blue}(-p_j \cdot p_i)} = y_j \cdot p_i \tag{26} $$Combining:
$$ = \underbrace{\vphantom{\sum_{j \neq i}}-y_i + y_i p_i}_{\text{from diagonal}} + \underbrace{\sum_{j \neq i} y_j \cdot p_i}_{\text{from off-diagonal}} \tag{27} $$$$ = -y_i + y_i p_i + p_i \sum_{j \neq i} y_j \tag{28} $$Since $\sum_j y_j = 1$ (one-hot encoding), we have $\sum_{j \neq i} y_j = 1 - y_i$:
$$ = -y_i + y_i p_i + p_i(1 - y_i) \tag{29} $$$$ = -y_i + y_i p_i + p_i - y_i p_i \tag{30} $$$$ = p_i - y_i \tag{31} $$7. Final Result#
$$ \boxed{ \underbrace{\vphantom{\frac{\partial L}{\partial z}}\frac{\partial L}{\partial z_i}}_{\text{gradient}} = \underbrace{\vphantom{\frac{\partial L}{\partial z}}p_i}_{\text{predicted}} - \underbrace{\vphantom{\frac{\partial L}{\partial z}}y_i}_{\text{true}} } \tag{32} $$Interpretation:
- The gradient is simply the difference between predicted probability and true label
- If $y_i = 1$ (correct class): gradient $= p_i - 1 < 0$ (negative, pushing logit up)
- If $y_i = 0$ (wrong class): gradient $= p_i > 0$ (positive, pushing logit down)
This elegant result is why Softmax + Cross Entropy is the standard choice for classification tasks.
Summary Table#
| Step | Forward | Backward |
|---|---|---|
| Softmax | $z_i \xrightarrow{\text{softmax}} p_i$ | ${\color{blue}\frac{\partial p_j}{\partial z_i} = p_i(\delta_{ij} - p_j)}$ |
| Cross Entropy | $p_i \xrightarrow{\text{CE}} L$ | ${\color{red}\frac{\partial L}{\partial p_i} = -\frac{y_i}{p_i}}$ |
| Combined | $z_i \rightarrow L$ | $\frac{\partial L}{\partial z_i} = p_i - y_i$ |