60,733 research outputs found
Hypothesis Set Stability and Generalization
We present a study of generalization for data-dependent hypothesis sets. We
give a general learning guarantee for data-dependent hypothesis sets based on a
notion of transductive Rademacher complexity. Our main result is a
generalization bound for data-dependent hypothesis sets expressed in terms of a
notion of hypothesis set stability and a notion of Rademacher complexity for
data-dependent hypothesis sets that we introduce. This bound admits as special
cases both standard Rademacher complexity bounds and algorithm-dependent
uniform stability bounds. We also illustrate the use of these learning bounds
in the analysis of several scenarios.Comment: Published in NeurIPS 2019. This version is equivalent to the
camera-ready version but also includes the supplementary materia
The complexity of algorithmic hypothesis class
University of Technology Sydney. Faculty of Engineering and Information Technology.Statistical learning theory provides the mathematical and theoretical foundations for statistical learning algorithms and inspires the development of more efficient methods. It is observed that learning algorithms may not output some hypotheses in the predefined hypothesis class. Therefore, in this thesis, we focus on statistical learning theory and study how to measure the complexity of the algorithmic hypothesis class, which is a subset of the predefined hypothesis class that a learning algorithm will (or is likely to) output. By designing complexity measures for the algorithmic hypothesis class, we provide new generalization bounds for k-dimensional coding schemes and multi-task learning and propose two frameworks to derive tighter generalization bounds than the current state-of-the-art.
We take k-dimensional coding schemes, a set of unsupervised learning algorithms, and multi-task learning, a set of supervised learning algorithms, as examples to demonstrate that learning algorithm outputs may have special properties and are therefore included in a subset of the predefined hypothesis class. By analyzing the subsets (or the algorithmic hypothesis classes), we shed new light on learning problems and derive tighter generalization bounds than the current state-of-the-art. Specifically, for k-dimensional coding schemes, we show that the induced algorithmic loss function classes are sets of Lipschitz-continuous hypotheses and that a dimensionality-dependent complexity measure helps to derive small Lipschitz constants and thus improve the generalization bounds. For multi-task learning, we prove that tasks can act as regularizer and that feature structures can contribute to a small algorithmic hypothesis class and also help to improve the generalization bounds.
To more precisely exploit algorithmic hypothesis class complexity by considering the hypothesis and feature structure properties, we extend algorithmic robustness and stability to complexity measures for the hypothesis class.
Inspired by the idea of algorithmic robustness, we propose the complexity measure of uniform robustness. Compared to the Rademacher complexity, our measure more finely considers the geometric information of data. For example, when the sample space is covered by a small number of small radius and widely separated balls, the uniform robustness can be very small while the Rademacher complexity can be very large. Moreover, based on the definition of uniform robustness, we also provide a framework to derive generalization bounds for a very general class of learning algorithms.
We exploit the algorithmic hypothesis class of stable algorithms by studying the definition of algorithmic stability. Stable learning algorithms have the property that their outputs will not change much when one training example is changed. This implies that their outputs will not be sufficiently far apart, even though the training sample is completely altered. Thus, stable learning algorithms often have small algorithmic hypothesis classes. However, since measuring the complexity of the small algorithmic hypothesis class is unknown, we design a novel complexity measure called the algorithmic Rademacher complexity to measure the algorithmic hypothesis class of stable learning algorithms and provide sharper error bounds than the current state-of-the-art
Sparse mean localization by information theory
Sparse feature selection is necessary when we fit statistical models, we have
access to a large group of features, don't know which are relevant, but assume
that most are not. Alternatively, when the number of features is larger than
the available data the model becomes over parametrized and the sparse feature
selection task involves selecting the most informative variables for the model.
When the model is a simple location model and the number of relevant features
does not grow with the total number of features, sparse feature selection
corresponds to sparse mean estimation. We deal with a simplified mean
estimation problem consisting of an additive model with gaussian noise and mean
that is in a restricted, finite hypothesis space. This restriction simplifies
the mean estimation problem into a selection problem of combinatorial nature.
Although the hypothesis space is finite, its size is exponential in the
dimension of the mean. In limited data settings and when the size of the
hypothesis space depends on the amount of data or on the dimension of the data,
choosing an approximation set of hypotheses is a desirable approach. Choosing a
set of hypotheses instead of a single one implies replacing the bias-variance
trade off with a resolution-stability trade off. Generalization capacity
provides a resolution selection criterion based on allowing the learning
algorithm to communicate the largest amount of information in the data to the
learner without error. In this work the theory of approximation set coding and
generalization capacity is explored in order to understand this approach. We
then apply the generalization capacity criterion to the simplified sparse mean
estimation problem and detail an importance sampling algorithm which at once
solves the difficulty posed by large hypothesis spaces and the slow convergence
of uniform sampling algorithms
Almost-everywhere algorithmic stability and generalization error
We explore in some detail the notion of algorithmic stability as a viable
framework for analyzing the generalization error of learning algorithms. We
introduce the new notion of training stability of a learning algorithm and show
that, in a general setting, it is sufficient for good bounds on generalization
error. In the PAC setting, training stability is both necessary and sufficient
for learnability.\ The approach based on training stability makes no reference
to VC dimension or VC entropy. There is no need to prove uniform convergence,
and generalization error is bounded directly via an extended McDiarmid
inequality. As a result it potentially allows us to deal with a broader class
of learning algorithms than Empirical Risk Minimization. \ We also explore the
relationships among VC dimension, generalization error, and various notions of
stability. Several examples of learning algorithms are considered.Comment: Appears in Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence (UAI2002
Stacking and stability
Stacking is a general approach for combining multiple models toward greater
predictive accuracy. It has found various application across different domains,
ensuing from its meta-learning nature. Our understanding, nevertheless, on how
and why stacking works remains intuitive and lacking in theoretical insight. In
this paper, we use the stability of learning algorithms as an elemental
analysis framework suitable for addressing the issue. To this end, we analyze
the hypothesis stability of stacking, bag-stacking, and dag-stacking and
establish a connection between bag-stacking and weighted bagging. We show that
the hypothesis stability of stacking is a product of the hypothesis stability
of each of the base models and the combiner. Moreover, in bag-stacking and
dag-stacking, the hypothesis stability depends on the sampling strategy used to
generate the training set replicates. Our findings suggest that 1) subsampling
and bootstrap sampling improve the stability of stacking, and 2) stacking
improves the stability of both subbagging and bagging.Comment: 15 pages, 1 figur
Stability of decision trees and logistic regression
Decision trees and logistic regression are one of the most popular and
well-known machine learning algorithms, frequently used to solve a variety of
real-world problems. Stability of learning algorithms is a powerful tool to
analyze their performance and sensitivity and subsequently allow researchers to
draw reliable conclusions. The stability of these two algorithms has remained
obscure. To that end, in this paper, we derive two stability notions for
decision trees and logistic regression: hypothesis and pointwise hypothesis
stability. Additionally, we derive these notions for L2-regularized logistic
regression and confirm existing findings that it is uniformly stable. We show
that the stability of decision trees depends on the number of leaves in the
tree, i.e., its depth, while for logistic regression, it depends on the
smallest eigenvalue of the Hessian matrix of the cross-entropy loss. We show
that logistic regression is not a stable learning algorithm. We construct the
upper bounds on the generalization error of all three algorithms. Moreover, we
present a novel stability measuring framework that allows one to measure the
aforementioned notions of stability. The measures are equivalent to estimates
of expected loss differences at an input example and then leverage bootstrap
sampling to yield statistically reliable estimates. Finally, we apply this
framework to the three algorithms analyzed in this paper to confirm our
theoretical findings and, in addition, we discuss the possibilities of
developing new training techniques to optimize the stability of logistic
regression, and hence decrease its generalization error.Comment: 13 page
Stability Analysis and Learning Bounds for Transductive Regression Algorithms
This paper uses the notion of algorithmic stability to derive novel
generalization bounds for several families of transductive regression
algorithms, both by using convexity and closed-form solutions. Our analysis
helps compare the stability of these algorithms. It also shows that a number of
widely used transductive regression algorithms are in fact unstable. Finally,
it reports the results of experiments with local transductive regression
demonstrating the benefit of our stability bounds for model selection, for one
of the algorithms, in particular for determining the radius of the local
neighborhood used by the algorithm.Comment: 26 page
A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent
We study the generalization error of randomized learning algorithms --
focusing on stochastic gradient descent (SGD) -- using a novel combination of
PAC-Bayes and algorithmic stability. Importantly, our generalization bounds
hold for all posterior distributions on an algorithm's random hyperparameters,
including distributions that depend on the training data. This inspires an
adaptive sampling algorithm for SGD that optimizes the posterior at runtime. We
analyze this algorithm in the context of our generalization bounds and evaluate
it on a benchmark dataset. Our experiments demonstrate that adaptive sampling
can reduce empirical risk faster than uniform sampling while also improving
out-of-sample accuracy.Comment: In Neural Information Processing Systems (NIPS) 2017. The latest
version specifies that the references to Kuzborskij & Lampert (2017) are for
v2 of their manuscript, which was posted to arXiv in March, 2017.
Importantly, Theorem 3 therein (a stability bound for convex losses) has a
different form than the final versio
Graph-based Generalization Bounds for Learning Binary Relations
We investigate the generalizability of learned binary relations: functions
that map pairs of instances to a logical indicator. This problem has
application in numerous areas of machine learning, such as ranking, entity
resolution and link prediction. Our learning framework incorporates an example
labeler that, given a sequence of instances and a desired training size
, subsamples pairs from without replacement. The challenge
in analyzing this learning scenario is that pairwise combinations of random
variables are inherently dependent, which prevents us from using traditional
learning-theoretic arguments. We present a unified, graph-based analysis, which
allows us to analyze this dependence using well-known graph identities. We are
then able to bound the generalization error of learned binary relations using
Rademacher complexity and algorithmic stability. The rate of uniform
convergence is partially determined by the labeler's subsampling process. We
thus examine how various assumptions about subsampling affect generalization;
under a natural random subsampling process, our bounds guarantee
uniform convergence
Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle
While machine learning has proven to be a powerful data-driven solution to
many real-life problems, its use in sensitive domains has been limited due to
privacy concerns. A popular approach known as **differential privacy** offers
provable privacy guarantees, but it is often observed in practice that it could
substantially hamper learning accuracy. In this paper we study the learnability
(whether a problem can be learned by any algorithm) under Vapnik's general
learning setting with differential privacy constraint, and reveal some
intricate relationships between privacy, stability and learnability.
In particular, we show that a problem is privately learnable **if an only
if** there is a private algorithm that asymptotically minimizes the empirical
risk (AERM). In contrast, for non-private learning AERM alone is not sufficient
for learnability. This result suggests that when searching for private learning
algorithms, we can restrict the search to algorithms that are AERM. In light of
this, we propose a conceptual procedure that always finds a universally
consistent algorithm whenever the problem is learnable under privacy
constraint. We also propose a generic and practical algorithm and show that
under very general conditions it privately learns a wide class of learning
problems. Lastly, we extend some of the results to the more practical
-differential privacy and establish the existence of a
phase-transition on the class of problems that are approximately privately
learnable with respect to how small needs to be.Comment: to appear, Journal of Machine Learning Research, 201
- …