12 research outputs found
The Benefit of Multitask Representation Learning
We discuss a general method to learn data representations from multiple
tasks. We provide a justification for this method in both settings of multitask
learning and learning-to-learn. The method is illustrated in detail in the
special case of linear feature learning. Conditions on the theoretical
advantage offered by multitask representation learning over independent task
learning are established. In particular, focusing on the important example of
half-space learning, we derive the regime in which multitask representation
learning is beneficial over independent task learning, as a function of the
sample size, the number of tasks and the intrinsic data dimensionality. Other
potential applications of our results include multitask feature learning in
reproducing kernel Hilbert spaces and multilayer, deep networks.Comment: To appear in Journal of Machine Learning Research (JMLR). 31 page
Structured estimation with atomic norms: General bounds and applications.
Abstract For structured estimation problems with atomic norms, recent advances in the literature express sample complexity and estimation error bounds in terms of certain geometric measures, in particular Gaussian width of the unit norm ball, Gaussian width of a spherical cap induced by a tangent cone, and a restricted norm compatibility constant. However, given an atomic norm, bounding these geometric measures can be difficult. In this paper, we present general upper bounds for such geometric measures, which only require simple information of the atomic norm under consideration, and we establish tightness of these bounds by providing the corresponding lower bounds. We show applications of our analysis to certain atomic norms, especially k-support norm, for which existing result is incomplete
Outlier detection using distributionally robust optimization under the Wasserstein metric
We present a Distributionally Robust Optimization (DRO) approach to outlier detection in a linear regression setting, where the closeness of probability distributions is measured using the Wasserstein metric. Training samples contaminated with outliers skew the regression plane computed by least squares and thus impede outlier detection. Classical approaches, such as robust regression, remedy this problem by downweighting the contribution of atypical data points. In contrast, our Wasserstein DRO approach hedges against a family of distributions that are close to the empirical distribution. We show that the resulting formulation encompasses a class of models, which include the regularized Least Absolute Deviation (LAD) as a special case. We provide new insights into the regularization term and give guidance on the selection of the regularization coefficient from the standpoint of a confidence region. We establish two types of performance guarantees for the solution to our formulation under mild conditions. One is related to its out-of-sample behavior, and the other concerns the discrepancy between the estimated and true regression planes. Extensive numerical results demonstrate the superiority of our approach to both robust regression and the regularized LAD in terms of estimation accuracy and outlier detection rates
Estimation with Norm Regularization
Analysis of non-asymptotic estimation error and structured statistical
recovery based on norm regularized regression, such as Lasso, needs to consider
four aspects: the norm, the loss function, the design matrix, and the noise
model. This paper presents generalizations of such estimation error analysis on
all four aspects compared to the existing literature. We characterize the
restricted error set where the estimation error vector lies, establish
relations between error sets for the constrained and regularized problems, and
present an estimation error bound applicable to any norm. Precise
characterizations of the bound is presented for isotropic as well as
anisotropic subGaussian design matrices, subGaussian noise models, and convex
loss functions, including least squares and generalized linear models. Generic
chaining and associated results play an important role in the analysis. A key
result from the analysis is that the sample complexity of all such estimators
depends on the Gaussian width of a spherical cap corresponding to the
restricted error set. Further, once the number of samples crosses the
required sample complexity, the estimation error decreases as
, where depends on the Gaussian width of the unit norm
ball.Comment: Fixed technical issues. Generalized some result
IST Austria Thesis
Traditionally machine learning has been focusing on the problem of solving a single
task in isolation. While being quite well understood, this approach disregards an
important aspect of human learning: when facing a new problem, humans are able to
exploit knowledge acquired from previously learned tasks. Intuitively, access to several
problems simultaneously or sequentially could also be advantageous for a machine
learning system, especially if these tasks are closely related. Indeed, results of many
empirical studies have provided justification for this intuition. However, theoretical
justifications of this idea are rather limited.
The focus of this thesis is to expand the understanding of potential benefits of information
transfer between several related learning problems. We provide theoretical
analysis for three scenarios of multi-task learning - multiple kernel learning, sequential
learning and active task selection. We also provide a PAC-Bayesian perspective on
lifelong learning and investigate how the task generation process influences the generalization
guarantees in this scenario. In addition, we show how some of the obtained
theoretical results can be used to derive principled multi-task and lifelong learning
algorithms and illustrate their performance on various synthetic and real-world datasets
Multitask and transfer learning for multi-aspect data
Supervised learning aims to learn functional relationships between inputs and outputs. Multitask learning tackles supervised learning tasks by performing them simultaneously to exploit commonalities between them. In this thesis, we focus on the problem of eliminating negative transfer in order to achieve better performance in multitask learning. We start by considering a general scenario in which the relationship between tasks is unknown. We then narrow our analysis to the case where data are characterised by a combination of underlying aspects, e.g., a dataset of images of faces, where each face is determined by a person's facial structure, the emotion being expressed, and the lighting conditions. In machine learning there have been numerous efforts based on multilinear models to decouple these aspects but these have primarily used techniques from the field of unsupervised learning. In this thesis we take inspiration from these approaches and hypothesize that supervised learning methods can also benefit from exploiting these aspects. The contributions of this thesis are as follows: 1. A multitask learning and transfer learning method that avoids negative transfer when there is no prescribed information about the relationships between tasks. 2. A multitask learning approach that takes advantage of a lack of overlapping features between known groups of tasks associated with different aspects. 3. A framework which extends multitask learning using multilinear algebra, with the aim of learning tasks associated with a combination of elements from different aspects. 4. A novel convex relaxation approach that can be applied both to the suggested framework and more generally to any tensor recovery problem. Through theoretical validation and experiments on both synthetic and real-world datasets, we show that the proposed approaches allow fast and reliable inferences. Furthermore, when performing learning tasks on an aspect of interest, accounting for secondary aspects leads to significantly more accurate results than using traditional approaches