19 research outputs found
Are Saddles Good Enough for Deep Learning?
Recent years have seen a growing interest in understanding deep neural networks from an optimization perspective. It is understood now that converging to low-cost local minima is sufficient for such models to become effective in practice. However, in this work, we propose a new hypothesis based on recent theoretical findings and empirical studies that deep neural network models actually converge to saddle points with high degeneracy. Our findings from this work are new, and can have a significant impact on the development of gradient descent based methods for training deep networks. We validated our hypotheses using an extensive experimental evaluation on standard datasets such as MNIST and CIFAR-10, and also showed that recent efforts that attempt to escape saddles finally converge to saddles with high degeneracy, which we define as `good saddles'. We also verified the famous Wigner's Semicircle Law in our experimental results
Free energy of Bayesian Convolutional Neural Network with Skip Connection
Since the success of Residual Network(ResNet), many of architectures of
Convolutional Neural Networks(CNNs) have adopted skip connection. While the
generalization performance of CNN with skip connection has been explained
within the framework of Ensemble Learning, the dependency on the number of
parameters have not been revealed. In this paper, we show that Bayesian free
energy of Convolutional Neural Network both with and without skip connection in
Bayesian learning. The upper bound of free energy of Bayesian CNN with skip
connection does not depend on the oveparametrization and, the generalization
error of Bayesian CNN has similar property.Comment: 16 pages, 4 figure
Bayesian Free Energy of Deep ReLU Neural Network in Overparametrized Cases
In many research fields in artificial intelligence, it has been shown that
deep neural networks are useful to estimate unknown functions on high
dimensional input spaces. However, their generalization performance is not yet
completely clarified from the theoretical point of view because they are
nonidentifiable and singular learning machines. Moreover, a ReLU function is
not differentiable, to which algebraic or analytic methods in singular learning
theory cannot be applied. In this paper, we study a deep ReLU neural network in
overparametrized cases and prove that the Bayesian free energy, which is equal
to the minus log marginal likelihoodor the Bayesian stochastic complexity, is
bounded even if the number of layers are larger than necessary to estimate an
unknown data-generating function. Since the Bayesian generalization error is
equal to the increase of the free energy as a function of a sample size, our
result also shows that the Bayesian generalization error does not increase even
if a deep ReLU neural network is designed to be sufficiently large or in an
opeverparametrized state.Comment: 20pages, 2figur
Epistemic uncertainty quantification in deep learning classification by the Delta method
The Delta method is a classical procedure for quantifying epistemic uncertainty in statistical models, but its direct application to deep neural networks is prevented by the large number of parameters . We propose a low cost approximation of the Delta method applicable to -regularized deep neural networks based on the top eigenpairs of the Fisher information matrix. We address efficient computation of full-rank approximate eigendecompositions in terms of the exact inverse Hessian, the inverse outer-products of gradients approximation and the so-called Sandwich estimator. Moreover, we provide bounds on the approximation error for the uncertainty of the predictive class probabilities. We show that when the smallest computed eigenvalue of the Fisher information matrix is near the -regularization rate, the approximation error will be close to zero even when . A demonstration of the methodology is presented using a TensorFlow implementation, and we show that meaningful rankings of images based on predictive uncertainty can be obtained for two LeNet and ResNet-based neural networks using the MNIST and CIFAR-10 datasets. Further, we observe that false positives have on average a higher predictive epistemic uncertainty than true positives. This suggests that there is supplementing information in the uncertainty measure not captured by the classification alone.publishedVersio
Deep Learning is Singular, and That's Good
In singular models, the optimal set of parameters forms an analytic set with
singularities and classical statistical inference cannot be applied to such
models. This is significant for deep learning as neural networks are singular
and thus "dividing" by the determinant of the Hessian or employing the Laplace
approximation are not appropriate. Despite its potential for addressing
fundamental issues in deep learning, singular learning theory appears to have
made little inroads into the developing canon of deep learning theory. Via a
mix of theory and experiment, we present an invitation to singular learning
theory as a vehicle for understanding deep learning and suggest important
future work to make singular learning theory directly applicable to how deep
learning is performed in practice
Learning Capacity: A Measure of the Effective Dimensionality of a Model
We exploit a formal correspondence between thermodynamics and inference,
where the number of samples can be thought of as the inverse temperature, to
define a "learning capacity'' which is a measure of the effective
dimensionality of a model. We show that the learning capacity is a tiny
fraction of the number of parameters for many deep networks trained on typical
datasets, depends upon the number of samples used for training, and is
numerically consistent with notions of capacity obtained from the PAC-Bayesian
framework. The test error as a function of the learning capacity does not
exhibit double descent. We show that the learning capacity of a model saturates
at very small and very large sample sizes; this provides guidelines, as to
whether one should procure more data or whether one should search for new
architectures, to improve performance. We show how the learning capacity can be
used to understand the effective dimensionality, even for non-parametric models
such as random forests and -nearest neighbor classifiers