8,042 research outputs found
Statistical Guarantees for Regularized Neural Networks
Neural networks have become standard tools in the analysis of data, but they
lack comprehensive mathematical theories. For example, there are very few
statistical guarantees for learning neural networks from data, especially for
classes of estimators that are used in practice or at least similar to such. In
this paper, we develop a general statistical guarantee for estimators that
consist of a least-squares term and a regularizer. We then exemplify this
guarantee with -regularization, showing that the corresponding
prediction error increases at most sub-linearly in the number of layers and at
most logarithmically in the total number of parameters. Our results establish a
mathematical basis for regularized estimation of neural networks, and they
deepen our mathematical understanding of neural networks and deep learning more
generally
Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview
Substantial progress has been made recently on developing provably accurate
and efficient algorithms for low-rank matrix factorization via nonconvex
optimization. While conventional wisdom often takes a dim view of nonconvex
optimization algorithms due to their susceptibility to spurious local minima,
simple iterative methods such as gradient descent have been remarkably
successful in practice. The theoretical footings, however, had been largely
lacking until recently.
In this tutorial-style overview, we highlight the important role of
statistical models in enabling efficient nonconvex optimization with
performance guarantees. We review two contrasting approaches: (1) two-stage
algorithms, which consist of a tailored initialization step followed by
successive refinement; and (2) global landscape analysis and
initialization-free algorithms. Several canonical matrix factorization problems
are discussed, including but not limited to matrix sensing, phase retrieval,
matrix completion, blind deconvolution, robust principal component analysis,
phase synchronization, and joint alignment. Special care is taken to illustrate
the key technical insights underlying their analyses. This article serves as a
testament that the integrated consideration of optimization and statistics
leads to fruitful research findings.Comment: Invited overview articl
Estimation of High-Dimensional Graphical Models Using Regularized Score Matching
Graphical models are widely used to model stochastic dependences among large
collections of variables. We introduce a new method of estimating undirected
conditional independence graphs based on the score matching loss, introduced by
Hyvarinen (2005), and subsequently extended in Hyvarinen (2007). The
regularized score matching method we propose applies to settings with
continuous observations and allows for computationally efficient treatment of
possibly non-Gaussian exponential family models. In the well-explored Gaussian
setting, regularized score matching avoids issues of asymmetry that arise when
applying the technique of neighborhood selection, and compared to existing
methods that directly yield symmetric estimates, the score matching approach
has the advantage that the considered loss is quadratic and gives piecewise
linear solution paths under regularization. Under suitable
irrepresentability conditions, we show that -regularized score matching
is consistent for graph estimation in sparse high-dimensional settings. Through
numerical experiments and an application to RNAseq data, we confirm that
regularized score matching achieves state-of-the-art performance in the
Gaussian case and provides a valuable tool for computationally efficient
estimation in non-Gaussian graphical models
Provable Guarantees for Gradient-Based Meta-Learning
We study the problem of meta-learning through the lens of online convex
optimization, developing a meta-algorithm bridging the gap between popular
gradient-based meta-learning and classical regularization-based multi-task
transfer methods. Our method is the first to simultaneously satisfy good sample
efficiency guarantees in the convex setting, with generalization bounds that
improve with task-similarity, while also being computationally scalable to
modern deep learning architectures and the many-task setting. Despite its
simplicity, the algorithm matches, up to a constant factor, a lower bound on
the performance of any such parameter-transfer method under natural task
similarity assumptions. We use experiments in both convex and deep learning
settings to verify and demonstrate the applicability of our theory.Comment: ICML 201
Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression
For various applications, the relations between the dependent and independent
variables are highly nonlinear. Consequently, for large scale complex problems,
neural networks and regression trees are commonly preferred over linear models
such as Lasso. This work proposes learning the feature nonlinearities by
binning feature values and finding the best fit in each quantile using
non-convex regularized linear regression. The algorithm first captures the
dependence between neighboring quantiles by enforcing smoothness via
piecewise-constant/linear approximation and then selects a sparse subset of
good features. We prove that the proposed algorithm is statistically and
computationally efficient. In particular, it achieves linear rate of
convergence while requiring near-minimal number of samples. Evaluations on
synthetic and real datasets demonstrate that algorithm is competitive with
current state-of-the-art and accurately learns feature nonlinearities. Finally,
we explore an interesting connection between the binning stage of our algorithm
and sparse Johnson-Lindenstrauss matrices.Comment: 22 pages, 7 figure
A Unified Framework for Training Neural Networks
The lack of mathematical tractability of Deep Neural Networks (DNNs) has
hindered progress towards having a unified convergence analysis of training
algorithms, in the general setting. We propose a unified optimization framework
for training different types of DNNs, and establish its convergence for
arbitrary loss, activation, and regularization functions, assumed to be smooth.
We show that framework generalizes well-known first- and second-order training
methods, and thus allows us to show the convergence of these methods for
various DNN architectures and learning tasks, as a special case of our
approach. We discuss some of its applications in training various DNN
architectures (e.g., feed-forward, convolutional, linear networks), to
regression and classification tasks.Comment: 15 pages, submitted to NIPS 201
Interaction Screening: Efficient and Sample-Optimal Learning of Ising Models
We consider the problem of learning the underlying graph of an unknown Ising
model on p spins from a collection of i.i.d. samples generated from the model.
We suggest a new estimator that is computationally efficient and requires a
number of samples that is near-optimal with respect to previously established
information-theoretic lower-bound. Our statistical estimator has a physical
interpretation in terms of "interaction screening". The estimator is consistent
and is efficiently implemented using convex optimization. We prove that with
appropriate regularization, the estimator recovers the underlying graph using a
number of samples that is logarithmic in the system size p and exponential in
the maximum coupling-intensity and maximum node-degree.Comment: To be published in Advances in Neural Information Processing Systems
3
To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout
Training deep belief networks (DBNs) requires optimizing a non-convex
function with an extremely large number of parameters. Naturally, existing
gradient descent (GD) based methods are prone to arbitrarily poor local minima.
In this paper, we rigorously show that such local minima can be avoided (upto
an approximation error) by using the dropout technique, a widely used heuristic
in this domain. In particular, we show that by randomly dropping a few nodes of
a one-hidden layer neural network, the training objective function, up to a
certain approximation error, decreases by a multiplicative factor.
On the flip side, we show that for training convex empirical risk minimizers
(ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple
dropout based GD method for convex ERMs is stable in the face of arbitrary
changes to any one of the training points. Using the above assertion, we show
that dropout provides fast rates for generalization error in learning (convex)
generalized linear models (GLM). Moreover, using the above mentioned stability
properties of dropout, we design dropout based differentially private
algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of
the individual training points while providing accurate predictions for new
test points. Finally, we empirically validate our stability assertions for
dropout in the context of convex ERMs and show that surprisingly, dropout
significantly outperforms (in terms of prediction accuracy) the L2
regularization based methods for several benchmark datasets.Comment: Currently under review for ICML 201
Memory Bounded Deep Convolutional Networks
In this work, we investigate the use of sparsity-inducing regularizers during
training of Convolution Neural Networks (CNNs). These regularizers encourage
that fewer connections in the convolution and fully connected layers take
non-zero values and in effect result in sparse connectivity between hidden
units in the deep network. This in turn reduces the memory and runtime cost
involved in deploying the learned CNNs. We show that training with such
regularization can still be performed using stochastic gradient descent
implying that it can be used easily in existing codebases. Experimental
evaluation of our approach on MNIST, CIFAR, and ImageNet datasets shows that
our regularizers can result in dramatic reductions in memory requirements. For
instance, when applied on AlexNet, our method can reduce the memory consumption
by a factor of four with minimal loss in accuracy
Scaleable input gradient regularization for adversarial robustness
In this work we revisit gradient regularization for adversarial robustness
with some new ingredients. First, we derive new per-image theoretical
robustness bounds based on local gradient information. These bounds strongly
motivate input gradient regularization. Second, we implement a scaleable
version of input gradient regularization which avoids double backpropagation:
adversarially robust ImageNet models are trained in 33 hours on four consumer
grade GPUs. Finally, we show experimentally and through theoretical
certification that input gradient regularization is competitive with
adversarial training. Moreover we demonstrate that gradient regularization does
not lead to gradient obfuscation or gradient masking
- β¦