The soft-max regression model can be used in the classes classification problem. The model consists of composition of probabilities distribution for each
classes. So, the activation function
is given by:
And
The error function is given by:
Notice that Kronecker delta , with value 1 when
, and zero otherwise. This is needed to enforce that the contribution of a specific activation function only to the appropriate training class.
And we want to know and in order to do so, we will compute
for two cases:
From the results we may observe that the difference between the two is a single term that is present when
. So we can write both in a single formula:
With this result at hand, we are ready to finish the computation of the gradient of the error.
This formula is ready to be used in a Gradient Descent algorithm capable of learning the (local) optimal parameters of the soft-max activation function.