This paper revisits the special type of a neural network known under two
names. In the statistics and machine learning community it is known as a
multi-class logistic regression neural network. In the neural network
community, it is simply the soft-max layer. The importance is underscored by
its role in deep learning: as the last layer, whose autput is actually the
classification of the input patterns, such as images. Our exposition focuses on
mathematically rigorous derivation of the key equation expressing the gradient.
The fringe benefit of our approach is a fully vectorized expression, which is a
basis of an efficient implementation. The second result of this paper is the
positivity of the second derivative of the cross-entropy loss function as
function of the weights. This result proves that optimization methods based on
convexity may be used to train this network. As a corollary, we demonstrate
that no L2-regularizer is needed to guarantee convergence of gradient
descent