It is typically understood that the training of modern neural networks is a
process of fitting the probability distribution of desired output. However,
recent paradoxical observations in a number of language generation tasks let
one wonder if this canonical probability-based explanation can really account
for the empirical success of deep learning. To resolve this issue, we propose
an alternative utility-based explanation to the standard supervised learning
procedure in deep learning. The basic idea is to interpret the learned neural
network not as a probability model but as an ordinal utility function that
encodes the preference revealed in training data. In this perspective, training
of the neural network corresponds to a utility learning process. Specifically,
we show that for all neural networks with softmax outputs, the SGD learning
dynamic of maximum likelihood estimation (MLE) can be seen as an iteration
process that optimizes the neural network toward an optimal utility function.
This utility-based interpretation can explain several otherwise-paradoxical
observations about the neural networks thus trained. Moreover, our
utility-based theory also entails an equation that can transform the learned
utility values back to a new kind of probability estimation with which
probability-compatible decision rules enjoy dramatic (double-digits)
performance improvements. These evidences collectively reveal a phenomenon of
utility-probability duality in terms of what modern neural networks are (truly)
modeling: We thought they are one thing (probabilities), until the
unexplainable showed up; changing mindset and treating them as another thing
(utility values) largely reconcile the theory, despite remaining subtleties
regarding its original (probabilistic) identity