The cross-entropy loss is given by
\[ Loss = \frac{1}{N} \sum_{n=1}^N{ \left[ y_n \cdot log(\sigma (\vec{w}^T \vec{x}_n))) + (1 - y_n) \cdot log(1 - \sigma (\vec{w}^T \vec{x}_n)) \right] } \]
and we try to find
\[ \frac{\partial Loss(\vec{w})}{\partial \vec{w}} \]
Assume we have a nested function \(F(x)=f(g(h(x)))\), and we want to compute the derivative \(\frac{\partial F(x)}{\partial x}\).
We can obtain \(\frac{\partial F(x)}{\partial x}\) using the chain rule. The chain rule is based on the following equality:
\[ \frac{\partial F(x)}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} \]
\[ \begin{aligned} Loss(w) & = \frac{1}{N} \sum_{n=1}^N{ [ y_n \cdot \underbrace{log( \underbrace{\sigma (\underbrace{w^T x_n}_{h})}_g )}_f + (1 - y_n) \cdot log(1 - \sigma (w^T x_n)) ] } \\ \end{aligned} \]
To apply the chain rule, note that in our case \(F(x)\) is the part \(log(\sigma (\vec{w}^T \vec{x}_n)))\). Hence, the functions \(f,g,h\) that are “chained” or nested as in \(F(x)=f(g(h(x)))\) are the the following:
function | derivative w.r.t \(a\) |
---|---|
\(f(a)=log(a)\) | \(f'(a)=\frac{1}{a}\) |
\(g(a)=\sigma (a)\) | \(g'(a)= \sigma(a) \cdot (1-\sigma(a))\) |
\(h(\vec{a})=\vec{a}^T \vec{x}\) | \(h'(a)= \vec{x}\) |
\[ Loss = \frac{1}{N} \sum_{n=1}^N{ \left[ y_n \cdot \underbrace{ log(\sigma (\vec{w}^T x_n))) }_{f(g(h(w)))} + (1 - y_n) \cdot \underbrace{ log(1 - \sigma (\vec{w}^T x_n)) }_{f(g(h(w)))} \right] } \]
\[ \begin{aligned} \frac{\partial Loss(\vec{w})}{\partial \vec{w}} & = \frac{1}{N} \sum_{n=1}^N \left[ y_n \cdot \underbrace{ \frac{1}{\sigma(\vec{w}^T \vec{x}_n)} }_{f'(a)} \cdot \underbrace{ \sigma(\vec{w}^T \vec{x}_n) \cdot (1- \sigma(\vec{w}^T \vec{x}_n)) }_{g'(a)} \cdot \underbrace{ \vec{x}_n }_{h'(a)} + (1-y_n) \cdot \underbrace{ \frac{1}{1 -\sigma(\vec{w}^T \vec{x}_n)} }_{f'(a)} \cdot \underbrace{ - \sigma(\vec{w}^T \vec{x}_n) \cdot (1- \sigma(\vec{w}^T \vec{x}_n)) }_{g'(a)} \cdot \underbrace{ \vec{x}_n }_{h'(a)} \right] \\ & = \frac{1}{N} \sum_{n=1}^N \left[ y_n \cdot \frac{\cancel{\sigma(\vec{w}^T \vec{x}_n)} \cdot (1- \sigma(\vec{w}^T \vec{x}_n))}{\cancel{\sigma(\vec{w}^T \vec{x}_n)}} \cdot \vec{x}_n - (1-y_n) \cdot \frac{\sigma(\vec{w}^T \vec{x}_n) \cdot \cancel{(1- \sigma(\vec{w}^T \vec{x}_n))}}{\cancel{(1 -\sigma(\vec{w}^T \vec{x}_n))}} \cdot \vec{x}_n \right] \\ & = \frac{1}{N} \sum_{n=1}^N \left[ y_n \cdot (1- \sigma(\vec{w}^T \vec{x}_n)) \cdot \vec{x}_n - (1-y_n) \cdot \sigma(\vec{w}^T \vec{x}_n) \cdot \vec{x}_n \right] \\ & = \frac{1}{N} \sum_{n=1}^N \left[ y_n \vec{x}_n - y_n \vec{x}_n \sigma(\vec{w}^T \vec{x}_n) - \vec{x}_n \sigma(\vec{w}^T \vec{x}_n) +y_n \vec{x}_n \sigma(\vec{w}^T \vec{x}_n) \right] \\ & = \frac{1}{N} \sum_{n=1}^N \left[ \vec{x}_n \ (\underbrace{y_n}_\text{label} - \underbrace{\sigma(\vec{w}^T \vec{x}_n)}_\text{predicted prob}) \right] \end{aligned} \]