22 research outputs found
Data-Dependent Stability of Stochastic Gradient Descent
We establish a data-dependent notion of algorithmic stability for Stochastic
Gradient Descent (SGD), and employ it to develop novel generalization bounds.
This is in contrast to previous distribution-free algorithmic stability results
for SGD which depend on the worst-case constants. By virtue of the
data-dependent argument, our bounds provide new insights into learning with SGD
on convex and non-convex problems. In the convex case, we show that the bound
on the generalization error depends on the risk at the initialization point. In
the non-convex case, we prove that the expected curvature of the objective
function around the initialization point has crucial influence on the
generalization error. In both cases, our results suggest a simple data-driven
strategy to stabilize SGD by pre-screening its initialization. As a corollary,
our results allow us to show optimistic generalization bounds that exhibit fast
convergence rates for SGD subject to a vanishing empirical risk and low noise
of stochastic gradient
Transfer learning through greedy subset selection
We study the binary transfer learning problem, focusing on how to select sources from a large pool and how to combine them to yield a good performance on a target task. In particular, we consider the transfer learning setting where one does not have direct access to the source data, but rather employs the source hypotheses trained from them. Building on the literature on the best subset selection problem, we propose an efficient algorithm that selects relevant source hypotheses and feature dimensions simultaneously. On three computer vision datasets we achieve state-of-the-art results, substantially outperforming transfer learning and popular feature selection baselines in a small-sample setting. Also, we theoretically prove that, under reasonable assumptions on the source hypotheses, our algorithm can learn effectively from few examples
Scalable Greedy Algorithms for Transfer Learning
In this paper we consider the binary transfer learning problem, focusing on
how to select and combine sources from a large pool to yield a good performance
on a target task. Constraining our scenario to real world, we do not assume the
direct access to the source data, but rather we employ the source hypotheses
trained from them. We propose an efficient algorithm that selects relevant
source hypotheses and feature dimensions simultaneously, building on the
literature on the best subset selection problem. Our algorithm achieves
state-of-the-art results on three computer vision datasets, substantially
outperforming both transfer learning and popular feature selection baselines in
a small-sample setting. We also present a randomized variant that achieves the
same results with the computational cost independent from the number of source
hypotheses and feature dimensions. Also, we theoretically prove that, under
reasonable assumptions on the source hypotheses, our algorithm can learn
effectively from few examples
Theory and Algorithms for Hypothesis Transfer Learning
The design and analysis of machine learning algorithms typically considers the problem of learning on a single task, and the nature of learning in such scenario is well explored. On the other hand, very often tasks faced by machine learning systems arrive sequentially, and therefore it is reasonable to ask whether a better approach can be taken than retraining such systems from scratch given newly available data. Indeed, by drawing analogy from human learning, a novel skill could be acquired more easily whenever the learner shares a relevant past experience. In response to this observation, the machine learning community has drawn its attention towards a form of learning known as transfer learning - learning a novel task by leveraging upon auxiliary information extracted from previous tasks. Tangible progress has been made in both theory and practice of transfer learning; however, many questions are still to be addressed.
In this thesis we will focus on an efficient type of transfer learning, known as the Hypothesis Transfer Learning (HTL), where auxiliary information is retained in a form of previously induced hypotheses. This is in contrast to the large body of work where one transfers from the data associated with previously encountered tasks. In particular, we theoretically investigate conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses. We investigate HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization, which also touches the theory of non-convex transfer learning problems. In addition, we demonstrate the benefits of HTL empirically, by proposing two algorithms tailored for real-life situations with application to visual learning problems - learning a new class in a multi-class classification setting by transferring from known classes, and an efficient greedy HTL algorithm for learning with large number of source hypotheses.
From theoretical point of view this thesis consistently identifies the key quantitative characteristics of relatedness between novel and previous tasks, and explicitates them in generalization bounds. These findings corroborate many previous works in the transfer learning literature and provide a theoretical basis for design and analysis of new HTL algorithms
Learning Lipschitz Functions by GD-trained Shallow Overparameterized ReLU Neural Networks
We explore the ability of overparameterized shallow ReLU neural networks to
learn Lipschitz, non-differentiable, bounded functions with additive noise when
trained by Gradient Descent (GD). To avoid the problem that in the presence of
noise, neural networks trained to nearly zero training error are inconsistent
in this class, we focus on the early-stopped GD which allows us to show
consistency and optimal rates. In particular, we explore this problem from the
viewpoint of the Neural Tangent Kernel (NTK) approximation of a GD-trained
finite-width neural network. We show that whenever some early stopping rule is
guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the
kernel induced by the ReLU activation function, the same rule can be used to
achieve minimax optimal rate for learning on the class of considered Lipschitz
functions by neural networks. We discuss several data-free and data-dependent
practically appealing stopping rules that yield optimal rates
Mixture Weight Estimation and Model Prediction in Multi-source Multi-target Domain Adaptation
We consider the problem of learning a model from multiple heterogeneous
sources with the goal of performing well on a new target distribution. The goal
of learner is to mix these data sources in a target-distribution aware way and
simultaneously minimize the empirical risk on the mixed source. The literature
has made some tangible advancements in establishing theory of learning on
mixture domain. However, there are still two unsolved problems. Firstly, how to
estimate the optimal mixture of sources, given a target domain; Secondly, when
there are numerous target domains, how to solve empirical risk minimization
(ERM) for each target using possibly unique mixture of data sources in a
computationally efficient manner. In this paper we address both problems
efficiently and with guarantees. We cast the first problem, mixture weight
estimation, as a convex-nonconcave compositional minimax problem, and propose
an efficient stochastic algorithm with provable stationarity guarantees. Next,
for the second problem, we identify that for certain regimes, solving ERM for
each target domain individually can be avoided, and instead parameters for a
target optimal model can be viewed as a non-linear function on a space of the
mixture coefficients. Building upon this, we show that in the offline setting,
a GD-trained overparameterized neural network can provably learn such function
to predict the model of target domain instead of solving a designated ERM
problem. Finally, we also consider an online setting and propose a label
efficient online algorithm, which predicts parameters for new targets given an
arbitrary sequence of mixing coefficients, while enjoying regret guarantees