Exercise: gradient of soft-max error function

The soft-max regression model can be used in the k classes classification problem. The model consists of composition of probabilities distribution for each k classes. So, the activation function h_\theta(x^{(i)}) is given by:

h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix}
And
h_{\theta_n}(x^{(i)}) = p(y^{(i)} = n | x^{(i)}; \theta) = \frac{ e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }

The error function is given by:
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]=- \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \log( h_{\theta_j}(x^{(i)}) )\right]

Notice that Kronecker delta \delta_{y^{(i)}j}, with value 1 when
y^{(i)} = j, and zero otherwise. This is needed to enforce that the contribution of a specific activation function only to the appropriate training class.

And we want to know \nabla_{\theta_n} J(\theta) and in order to do so, we will compute \nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)})) for two cases:

  • n \neq j

\nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)})) = \frac{1}{h_{\theta_j}(x^{(i)})} \nabla_{\theta_n}h_{\theta_j}(x^{(i)}) \\ = \frac{1}{h_{\theta_j}(x^{(i)})} \frac{\partial }{\partial \theta_n}\frac{e^{ \theta_j^T x^{(i)} }}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} = \frac{1}{h_{\theta_j}(x^{(i)})} e^{ \theta_j^T x^{(i)} } \frac{\partial }{\partial \theta_n}\frac{1}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} \\ = - \frac{1}{h_{\theta_j}(x^{(i)})} \frac{ e^{\theta_j^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}})^2 }\frac{\partial }{\partial \theta_n}\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}} = - \frac{1}{h_{\theta_j}(x^{(i)})} \frac{ e^{\theta_j^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}}) } \frac{ x^{(i)} e^{\theta_n^T x^{(i)}} }{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)}}}) } \\ = - \frac{1}{h_{\theta_j}(x^{(i)})} h_{\theta_j}(x^{(i)}) x^{(i)}h_{\theta_n}(x^{(i)}) \\= -x^{(i)}h_{\theta_n}(x^{(i)})

  • n = j

\nabla_{\theta_j} \text{log}(h_{\theta_j}(x^{(i)})) = \frac{1}{h_{\theta_j}(x^{(i)})} \nabla_{\theta_j}h_{\theta_j}(x^{(i)}) \\ = \frac{1}{h_{\theta_j}(x^{(i)})} \frac{\partial }{\partial \theta_j}\frac{e^{ \theta_j^T x^{(i)} }}{\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}} = \frac{1}{h_{\theta_j}(x^{(i)})} \frac { \frac{\partial e^{\theta_j^T x^{(i)}}}{\partial \theta_j}(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }}) - e^{\theta_j^T x^{(i)}}\frac{\partial (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})}{\partial \theta_j} } { (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } \\= \frac{1}{h_{\theta_j}(x^{(i)})} \frac { x^{(i)}e^{\theta_j^T x^{(i)}}(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})-x^{(i)}(e^{\theta_j^T x^{(i)}})^2 } { (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } \\= \frac{x^{(i)}}{h_{\theta_j}(x^{(i)})} ( \frac{ e^{\theta_j^T x^{(i)}}}{ (\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})} -\frac{(e^{\theta_j^T x^{(i)}})^2}{(\sum_{l=1}^{k}{e^{ \theta_l^T x^{(i)} }})^2 } )
= \frac{x^{(i)}}{h_{\theta_j}(x^{(i)})} \left[ h_{\theta_j}(x^{(i)}) -h_{\theta_j}(x^{(i)})^2 \right] \\= x^{(i)} \left[ 1-h_{\theta_j}(x^{(i)}) \right] \\= x^{(i)}-x^{(i)}h_{\theta_j}(x^{(i)})= x^{(i)}-x^{(i)}h_{\theta_n}(x^{(i)})

From the results we may observe that the difference between the two is a single term x^{(i)} that is present when n = j. So we can write both in a single formula:
\nabla_{\theta_n} \text{log}(h_{\theta_j}(x^{(i)})) = x^{(i)}\left[ \delta_{jn}-h_{\theta_n}(x^{(i)})\right ]

With this result at hand, we are ready to finish the computation of the gradient of the error.
\nabla_{\theta_n} J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} \nabla_{\theta_n}\log( h_{\theta_j}(x^{(i)}) )\right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \delta_{y^{(i)}j} x^{(i)}\left[ \delta_{jn}-h_{\theta_n}(x^{(i)})\right ] \right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m}x^{(i)} \sum_{j=1}^{k} \delta_{y^{(i)}j}\delta_{jn} - h_{\theta_n}(x^{(i)}) \sum_{j=1}^{k} \delta_{y^{(i)}j} \right] \\= - \frac{1}{m} \left[ \sum_{i=1}^{m} x^{(i)} (\delta_{y^{(i)}n}-h_{\theta_n}(x^{(i)}) ) \right]

This formula is ready to be used in a Gradient Descent algorithm capable of learning the (local) optimal parameters of the soft-max activation function.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s