37 research outputs found
Random deep neural networks are biased towards simple functions
We prove that the binary classifiers of bit strings generated by random wide
deep neural networks with ReLU activation function are biased towards simple
functions. The simplicity is captured by the following two properties. For any
given input bit string, the average Hamming distance of the closest input bit
string with a different classification is at least sqrt(n / (2{\pi} log n)),
where n is the length of the string. Moreover, if the bits of the initial
string are flipped randomly, the average number of flips required to change the
classification grows linearly with n. These results are confirmed by numerical
experiments on deep neural networks with two hidden layers, and settle the
conjecture stating that random deep neural networks are biased towards simple
functions. This conjecture was proposed and numerically explored in [Valle
P\'erez et al., ICLR 2019] to explain the unreasonably good generalization
properties of deep learning algorithms. The probability distribution of the
functions generated by random deep neural networks is a good choice for the
prior probability distribution in the PAC-Bayesian generalization bounds. Our
results constitute a fundamental step forward in the characterization of this
distribution, therefore contributing to the understanding of the generalization
properties of deep learning algorithms
Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks
We make two related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC–Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of partially-aggregated estimators, proving that these lead to unbiased lower-variance output and gradient estimators; (2) we reformulate a PAC–Bayesian bound for signed-output networks to derive in combination with the above a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. We show empirically that this leads to competitive generalisation guarantees and compares favourably to other methods for training such networks. Finally, we note that the above leads to a simpler PAC–Bayesian training scheme for sign-activation networks than previous work