2,183 research outputs found
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
Sharp analysis of low-rank kernel matrix approximations
We consider supervised learning problems within the positive-definite kernel
framework, such as kernel ridge regression, kernel logistic regression or the
support vector machine. With kernels leading to infinite-dimensional feature
spaces, a common practical limiting difficulty is the necessity of computing
the kernel matrix, which most frequently leads to algorithms with running time
at least quadratic in the number of observations n, i.e., O(n^2). Low-rank
approximations of the kernel matrix are often considered as they allow the
reduction of running time complexities to O(p^2 n), where p is the rank of the
approximation. The practicality of such methods thus depends on the required
rank p. In this paper, we show that in the context of kernel ridge regression,
for approximations based on a random subset of columns of the original kernel
matrix, the rank p may be chosen to be linear in the degrees of freedom
associated with the problem, a quantity which is classically used in the
statistical analysis of such methods, and is often seen as the implicit number
of parameters of non-parametric estimators. This result enables simple
algorithms that have sub-quadratic running time complexity, but provably
exhibit the same predictive performance than existing algorithms, for any given
problem instance, and not only for worst-case situations
Network Lasso: Clustering and Optimization in Large Graphs
Convex optimization is an essential tool for modern data analysis, as it
provides a framework to formulate and solve many problems in machine learning
and data mining. However, general convex optimization solvers do not scale
well, and scalable solvers are often specialized to only work on a narrow class
of problems. Therefore, there is a need for simple, scalable algorithms that
can solve many common optimization problems. In this paper, we introduce the
\emph{network lasso}, a generalization of the group lasso to a network setting
that allows for simultaneous clustering and optimization on graphs. We develop
an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to
solve this problem in a distributed and scalable manner, which allows for
guaranteed global convergence even on large graphs. We also examine a
non-convex extension of this approach. We then demonstrate that many types of
problems can be expressed in our framework. We focus on three in particular -
binary classification, predicting housing prices, and event detection in time
series data - comparing the network lasso to baseline approaches and showing
that it is both a fast and accurate method of solving large optimization
problems
A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning
The Rashomon effect occurs when many different explanations exist for the
same phenomenon. In machine learning, Leo Breiman used this term to
characterize problems where many accurate-but-different models exist to
describe the same data. In this work, we study how the Rashomon effect can be
useful for understanding the relationship between training and test
performance, and the possibility that simple-yet-accurate models exist for many
problems. We consider the Rashomon set - the set of almost-equally-accurate
models for a given problem - and study its properties and the types of models
it could contain. We present the Rashomon ratio as a new measure related to
simplicity of model classes, which is the ratio of the volume of the set of
accurate models to the volume of the hypothesis space; the Rashomon ratio is
different from standard complexity measures from statistical learning theory.
For a hierarchy of hypothesis spaces, the Rashomon ratio can help modelers to
navigate the trade-off between simplicity and accuracy. In particular, we find
empirically that a plot of empirical risk vs. Rashomon ratio forms a
characteristic -shaped Rashomon curve, whose elbow seems to be a
reliable model selection criterion. When the Rashomon set is large, models that
are accurate - but that also have various other useful properties - can often
be obtained. These models might obey various constraints such as
interpretability, fairness, or monotonicity.Comment: Revisited sections 3, 4, 5, 6, 7, and
- …