## Exercise: gradient of soft-max error function

The soft-max regression model can be used in the $k$ classes classification problem. The model consists of composition of probabilities distribution for each $k$ classes. So, the activation function $h_\theta(x^{(i)})$ is given by:

$h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix}$
And
$h_{\theta_n}(x^{(i)}) = p(y^{(i)} = n | x^{(i)}; \theta) = \frac{ e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }$

The error function is given by:
$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]=- \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log( h_{\theta_j}(x^{(i)}) )\right]$