# Exercise: gradient of soft-max error function

The soft-max regression model can be used in the $k$ classes classification problem. The model consists of composition of probabilities distribution for each $k$ classes. So, the activation function $h_\theta(x^{(i)})$ is given by:

$h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix}$
And
$h_{\theta_n}(x^{(i)}) = p(y^{(i)} = n | x^{(i)}; \theta) = \frac{ e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }$

The error function is given by:
$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]=- \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log( h_{\theta_j}(x^{(i)}) )\right]$

Notice that Kronecker delta $\delta_{y^{(i)}j}$, with value 1 when
$y^{(i)} = j$, and zero otherwise. This is needed to enforce that the contribution of a specific activation function only to the appropriate training class.

And we want to know $\nabla_{\theta_n} J(\theta)$ and in order to do so, we will compute $\nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)}))$ for two cases:

• $n \neq j$

$\nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)})) = \frac{1}{h_{\theta_j}(x^{(i)})} \nabla_{\theta_n}h_{\theta_j}(x^{(i)}) \\ = \frac{1}{h_{\theta_j}(x^{(i)})} \frac{\partial }{\partial \theta_n}\frac{e^{ \theta_j^T x^{(i)} }}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} = \frac{1}{h_{\theta_j}(x^{(i)})} e^{ \theta_j^T x^{(i)} } \frac{\partial }{\partial \theta_n}\frac{1}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} \\ = - \frac{1}{h_{\theta_j}(x^{(i)})} \frac{ e^{\theta_j^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}})^2 }\frac{\partial }{\partial \theta_n}\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}} = - \frac{1}{h_{\theta_j}(x^{(i)})} \frac{ e^{\theta_j^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}}) } \frac{ x^{(i)} e^{\theta_n^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}}) } \\ = - \frac{1}{h_{\theta_j}(x^{(i)})} h_{\theta_j}(x^{(i)}) x^{(i)}h_{\theta_n}(x^{(i)}) \\= -x^{(i)}h_{\theta_n}(x^{(i)})$

• $n = j$

$\nabla_{\theta_j} \text{log}(h_{\theta_j}(x^{(i)})) = \frac{1}{h_{\theta_j}(x^{(i)})} \nabla_{\theta_j}h_{\theta_j}(x^{(i)}) \\ = \frac{1}{h_{\theta_j}(x^{(i)})} \frac{\partial }{\partial \theta_j}\frac{e^{ \theta_j^T x^{(i)} }}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} = \frac{1}{h_{\theta_j}(x^{(i)})} \frac { \frac{\partial e^{\theta_j^T x^{(i)}}}{\partial \theta_j}(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}) - e^{\theta_j^T x^{(i)}}\frac{\partial (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})}{\partial \theta_j} } { (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } \\= \frac{1}{h_{\theta_j}(x^{(i)})} \frac { x^{(i)}e^{\theta_j^T x^{(i)}}(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})-x^{(i)}(e^{\theta_j^T x^{(i)}})^2 } { (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } \\= \frac{x^{(i)}}{h_{\theta_j}(x^{(i)})} ( \frac{ e^{\theta_j^T x^{(i)}}}{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})} -\frac{(e^{\theta_j^T x^{(i)}})^2}{(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } )$
$= \frac{x^{(i)}}{h_{\theta_j}(x^{(i)})} \left[ h_{\theta_j}(x^{(i)}) -h_{\theta_j}(x^{(i)})^2 \right] \\= x^{(i)} \left[ 1-h_{\theta_j}(x^{(i)}) \right] \\= x^{(i)}-x^{(i)}h_{\theta_j}(x^{(i)})= x^{(i)}-x^{(i)}h_{\theta_n}(x^{(i)})$

From the results we may observe that the difference between the two is a single term $x^{(i)}$ that is present when $n = j$. So we can write both in a single formula:
$\nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)})) = x^{(i)}\left[ \delta_{jn}-h_{\theta_n}(x^{(i)})\right ]$

With this result at hand, we are ready to finish the computation of the gradient of the error.
$\nabla_{\theta_n} J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \nabla_{\theta_n}\log( h_{\theta_j}(x^{(i)}) )\right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} x^{(i)}\left[ \delta_{jn}-h_{\theta_n}(x^{(i)})\right ] \right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m}x^{(i)} \sum_{j=1}^{k} \delta_{y^{(i)}j}\delta_{jn} - h_{\theta_n}(x^{(i)}) \sum_{j=1}^{k} \delta_{y^{(i)}j} \right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m} x^{(i)} (\delta_{y^{(i)}n}-h_{\theta_n}(x^{(i)}) ) \right]$

This formula is ready to be used in a Gradient Descent algorithm capable of learning the (local) optimal parameters of the soft-max activation function.