2,223 research outputs found
Capacity Control of ReLU Neural Networks by Basis-path Norm
Recently, path norm was proposed as a new capacity measure for neural
networks with Rectified Linear Unit (ReLU) activation function, which takes the
rescaling-invariant property of ReLU into account. It has been shown that the
generalization error bound in terms of the path norm explains the empirical
generalization behaviors of the ReLU neural networks better than that of other
capacity measures. Moreover, optimization algorithms which take path norm as
the regularization term to the loss function, like Path-SGD, have been shown to
achieve better generalization performance. However, the path norm counts the
values of all paths, and hence the capacity measure based on path norm could be
improperly influenced by the dependency among different paths. It is also known
that each path of a ReLU network can be represented by a small group of
linearly independent basis paths with multiplication and division operation,
which indicates that the generalization behavior of the network only depends on
only a few basis paths. Motivated by this, we propose a new norm
\emph{Basis-path Norm} based on a group of linearly independent paths to
measure the capacity of neural networks more accurately. We establish a
generalization error bound based on this basis path norm, and show it explains
the generalization behaviors of ReLU networks more accurately than previous
capacity measures via extensive experiments. In addition, we develop
optimization algorithms which minimize the empirical risk regularized by the
basis-path norm. Our experiments on benchmark datasets demonstrate that the
proposed regularization method achieves clearly better performance on the test
set than the previous regularization approaches
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
It is widely observed that deep learning models with learned parameters
generalize well, even with much more model parameters than the number of
training samples. We systematically investigate the underlying reasons why deep
neural networks often generalize well, and reveal the difference between the
minima (with the same training error) that generalize well and those they
don't. We show that it is the characteristics the landscape of the loss
function that explains the good generalization capability. For the landscape of
loss function for deep networks, the volume of basin of attraction of good
minima dominates over that of poor minima, which guarantees optimization
methods with random initialization to converge to good minima. We theoretically
justify our findings through analyzing 2-layer neural networks; and show that
the low-complexity solutions have a small norm of Hessian matrix with respect
to model parameters. For deeper networks, extensive numerical evidence helps to
support our arguments
Sample Compression, Support Vectors, and Generalization in Deep Learning
Even though Deep Neural Networks (DNNs) are widely celebrated for their
practical performance, they possess many intriguing properties related to depth
that are difficult to explain both theoretically and intuitively. Understanding
how weights in deep networks coordinate together across layers to form useful
learners has proven challenging, in part because the repeated composition of
nonlinearities has proved intractable. This paper presents a reparameterization
of DNNs as a linear function of a feature map that is locally independent of
the weights. This feature map transforms depth-dependencies into simple tensor
products and maps each input to a discrete subset of the feature space. Then,
using a max-margin assumption, the paper develops a sample compression
representation of the neural network in terms of the discrete activation state
of neurons induced by s ``support vectors". The paper shows that the number of
support vectors s relates with learning guarantees for neural networks through
sample compression bounds, yielding a sample complexity of O(ns/epsilon) for
networks with n neurons. Finally, the number of support vectors s is found to
monotonically increase with width and label noise but decrease with depth.Comment: 15 pages, 10 figure
Data-Dependent Path Normalization in Neural Networks
We propose a unified framework for neural net normalization, regularization
and optimization, which includes Path-SGD and Batch-Normalization and
interpolates between them across two different dimensions. Through this
framework we investigate issue of invariance of the optimization, data
dependence and the connection with natural gradients.Comment: 17 pages, 3 figure
Towards moderate overparameterization: global convergence guarantees for training shallow neural networks
Many modern neural network architectures are trained in an overparameterized
regime where the parameters of the model exceed the size of the training
dataset. Sufficiently overparameterized neural network architectures in
principle have the capacity to fit any set of labels including random noise.
However, given the highly nonconvex nature of the training landscape it is not
clear what level and kind of overparameterization is required for first order
methods to converge to a global optima that perfectly interpolate any labels. A
number of recent theoretical works have shown that for very wide neural
networks where the number of hidden units is polynomially large in the size of
the training data gradient descent starting from a random initialization does
indeed converge to a global optima. However, in practice much more moderate
levels of overparameterization seems to be sufficient and in many cases
overparameterized models seem to perfectly interpolate the training data as
soon as the number of parameters exceed the size of the training data by a
constant factor. Thus there is a huge gap between the existing theoretical
literature and practical experiments. In this paper we take a step towards
closing this gap. Focusing on shallow neural nets and smooth activations, we
show that (stochastic) gradient descent when initialized at random converges at
a geometric rate to a nearby global optima as soon as the square-root of the
number of network parameters exceeds the size of the training data. Our results
also benefit from a fast convergence rate and continue to hold for
non-differentiable activations such as Rectified Linear Units (ReLUs)
Training wide residual networks for deployment using a single bit for each weight
For fast and energy-efficient deployment of trained deep neural networks on
resource-constrained embedded hardware, each learned weight parameter should
ideally be represented and stored using a single bit. Error-rates usually
increase when this requirement is imposed. Here, we report large improvements
in error rates on multiple datasets, for deep convolutional neural networks
deployed with 1-bit-per-weight. Using wide residual networks as our main
baseline, our approach simplifies existing methods that binarize weights by
applying the sign function in training; we apply scaling factors for each layer
with constant unlearned values equal to the layer-specific standard deviations
used for initialization. For CIFAR-10, CIFAR-100 and ImageNet, and models with
1-bit-per-weight requiring less than 10 MB of parameter memory, we achieve
error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top-1 / Top-5) respectively. We
also considered MNIST, SVHN and ImageNet32, achieving 1-bit-per-weight test
results of 0.27%, 1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error
rates halve previously reported values, and are within about 1% of our
error-rates for the same network with full-precision weights. For networks that
overfit, we also show significant improvements in error rate by not learning
batch normalization scale and offset parameters. This applies to both full
precision and 1-bit-per-weight networks. Using a warm-restart learning-rate
schedule, we found that training for 1-bit-per-weight is just as fast as
full-precision networks, with better accuracy than standard schedules, and
achieved about 98%-99% of peak performance in just 62 training epochs for
CIFAR-10/100. For full training code and trained models in MATLAB, Keras and
PyTorch see https://github.com/McDonnell-Lab/1-bit-per-weight/ .Comment: Published as a conference paper at ICLR 201
Deep Neural Networks
Deep Neural Networks (DNNs) are universal function approximators providing
state-of- the-art solutions on wide range of applications. Common perceptual
tasks such as speech recognition, image classification, and object tracking are
now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack
of a mathematical framework providing an explicit and interpretable
input-output formula for any topology, (2) quantification of DNNs stability
regarding adversarial examples (i.e. modified inputs fooling DNN predictions
whilst undetectable to humans), (3) absence of generalization guarantees and
controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to
apply DNNs to domains where expert labeling is scarce as in the medical field.
Answering those points would provide theoretical perspectives for further
developments based on a common ground. Furthermore, DNNs are now deployed in
tremendous societal applications, pushing the need to fill this theoretical gap
to ensure control, reliability, and interpretability.Comment: Technical Repor
Deep Semi-Random Features for Nonlinear Function Approximation
We propose semi-random features for nonlinear function approximation. The
flexibility of semi-random feature lies between the fully adjustable units in
deep learning and the random features used in kernel methods. For one hidden
layer models with semi-random features, we prove with no unrealistic
assumptions that the model classes contain an arbitrarily good function as the
width increases (universality), and despite non-convexity, we can find such a
good function (optimization theory) that generalizes to unseen new data
(generalization bound). For deep models, with no unrealistic assumptions, we
prove universal approximation ability, a lower bound on approximation error, a
partial optimization guarantee, and a generalization bound. Depending on the
problems, the generalization bound of deep semi-random features can be
exponentially better than the known bounds of deep ReLU nets; our
generalization error bound can be independent of the depth, the number of
trainable weights as well as the input dimensionality. In experiments, we show
that semi-random features can match the performance of neural networks by using
slightly more units, and it outperforms random features by using significantly
fewer units. Moreover, we introduce a new implicit ensemble method by using
semi-random features.Comment: AAAI 2018 - Extended versio
Approximation and Estimation for High-Dimensional Deep Learning Networks
It has been experimentally observed in recent years that multi-layer
artificial neural networks have a surprising ability to generalize, even when
trained with far more parameters than observations. Is there a theoretical
basis for this? The best available bounds on their metric entropy and
associated complexity measures are essentially linear in the number of
parameters, which is inadequate to explain this phenomenon. Here we examine the
statistical risk (mean squared predictive error) of multi-layer networks with
-type controls on their parameters and with ramp activation functions
(also called lower-rectified linear units). In this setting, the risk is shown
to be upper bounded by , where is the input
dimension to each layer, is the number of layers, and is the sample
size. In this way, the input dimension can be much larger than the sample size
and the estimator can still be accurate, provided the target function has such
controls and that the sample size is at least moderately large
compared to . The heart of the analysis is the development of a
sampling strategy that demonstrates the accuracy of a sparse covering of deep
ramp networks. Lower bounds show that the identified risk is close to being
optimal
What Kinds of Functions do Deep Neural Networks Learn? Insights from Variational Spline Theory
We develop a variational framework to understand the properties of functions
learned by deep neural networks with ReLU activation functions fit to data. We
propose a new function space, which is reminiscent of classical bounded
variation spaces, that captures the compositional structure associated with
deep neural networks. We derive a representer theorem showing that deep ReLU
networks are solutions to regularized data fitting problems in this function
space. The function space consists of compositions of functions from the
(non-reflexive) Banach spaces of second-order bounded variation in the Radon
domain. These are Banach spaces with sparsity-promoting norms, giving insight
into the role of sparsity in deep neural networks. The neural network solutions
have skip connections and rank bounded weight matrices, providing new
theoretical support for these common architectural choices. The variational
problem we study can be recast as a finite-dimensional neural network training
problem with regularization schemes related to the notions of weight decay and
path-norm regularization. Finally, our analysis builds on techniques from
variational spline theory, providing new connections between deep neural
networks and splines
- …