128 research outputs found
Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization
We introduce a proximal version of the stochastic dual coordinate ascent
method and show how to accelerate the method using an inner-outer iteration
procedure. We analyze the runtime of the framework and obtain rates that
improve state-of-the-art results for various key machine learning optimization
problems including SVM, logistic regression, ridge regression, Lasso, and
multiclass SVM. Experiments validate our theoretical findings
Distributed Machine Learning via Sufficient Factor Broadcasting
Matrix-parametrized models, including multiclass logistic regression and
sparse coding, are used in machine learning (ML) applications ranging from
computer vision to computational biology. When these models are applied to
large-scale ML problems starting at millions of samples and tens of thousands
of classes, their parameter matrix can grow at an unexpected rate, resulting in
high parameter synchronization costs that greatly slow down distributed
learning. To address this issue, we propose a Sufficient Factor Broadcasting
(SFB) computation model for efficient distributed learning of a large family of
matrix-parameterized models, which share the following property: the parameter
update computed on each data sample is a rank-1 matrix, i.e., the outer product
of two "sufficient factors" (SFs). By broadcasting the SFs among worker
machines and reconstructing the update matrices locally at each worker, SFB
improves communication efficiency --- communication costs are linear in the
parameter matrix's dimensions, rather than quadratic --- without affecting
computational correctness. We present a theoretical convergence analysis of
SFB, and empirically corroborate its efficiency on four different
matrix-parametrized ML models
Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs
This paper introduces a novel algorithm for transductive inference in
higher-order MRFs, where the unary energies are parameterized by a variable
classifier. The considered task is posed as a joint optimization problem in the
continuous classifier parameters and the discrete label variables. In contrast
to prior approaches such as convex relaxations, we propose an advantageous
decoupling of the objective function into discrete and continuous subproblems
and a novel, efficient optimization method related to ADMM. This approach
preserves integrality of the discrete label variables and guarantees global
convergence to a critical point. We demonstrate the advantages of our approach
in several experiments including video object segmentation on the DAVIS data
set and interactive image segmentation
Efficient Multi-Template Learning for Structured Prediction
Conditional random field (CRF) and Structural Support Vector Machine
(Structural SVM) are two state-of-the-art methods for structured prediction
which captures the interdependencies among output variables. The success of
these methods is attributed to the fact that their discriminative models are
able to account for overlapping features on the whole input observations. These
features are usually generated by applying a given set of templates on labeled
data, but improper templates may lead to degraded performance. To alleviate
this issue, in this paper, we propose a novel multiple template learning
paradigm to learn structured prediction and the importance of each template
simultaneously, so that hundreds of arbitrary templates could be added into the
learning model without caution. This paradigm can be formulated as a special
multiple kernel learning problem with exponential number of constraints. Then
we introduce an efficient cutting plane algorithm to solve this problem in the
primal, and its convergence is presented. We also evaluate the proposed
learning paradigm on two widely-studied structured prediction tasks,
\emph{i.e.} sequence labeling and dependency parsing. Extensive experimental
results show that the proposed method outperforms CRFs and Structural SVMs due
to exploiting the importance of each template. Our complexity analysis and
empirical results also show that our proposed method is more efficient than
OnlineMKL on very sparse and high-dimensional data. We further extend this
paradigm for structured prediction using generalized -block norm
regularization with , and experiments show competitive performances when
Efficient and Modular Implicit Differentiation
Automatic differentiation (autodiff) has revolutionized machine learning. It
allows expressing complex computations by composing elementary ones in creative
ways and removes the burden of computing their derivatives by hand. More
recently, differentiation of optimization problem solutions has attracted
widespread attention with applications such as optimization as a layer, and in
bi-level problems such as hyper-parameter optimization and meta-learning.
However, the formulas for these derivatives often involve case-by-case tedious
mathematical derivations. In this paper, we propose a unified, efficient and
modular approach for implicit differentiation of optimization problems. In our
approach, the user defines (in Python in the case of our implementation) a
function capturing the optimality conditions of the problem to be
differentiated. Once this is done, we leverage autodiff of and implicit
differentiation to automatically differentiate the optimization problem. Our
approach thus combines the benefits of implicit differentiation and autodiff.
It is efficient as it can be added on top of any state-of-the-art solver and
modular as the optimality condition specification is decoupled from the
implicit differentiation mechanism. We show that seemingly simple principles
allow to recover many recently proposed implicit differentiation methods and
create new ones easily. We demonstrate the ease of formulating and solving
bi-level optimization problems using our framework. We also showcase an
application to the sensitivity analysis of molecular dynamics.Comment: V2: some corrections and link to softwar
Sur les methodes efficaces d’estimation statistique a haute dimension
In this thesis we consider several aspects of parameter estimation for statistics and machine learning and optimization techniques applicable to these problems. The goal of parameter estimation is to find the unknown hidden parameters, which govern the data, for example parameters of an unknown probability density. The construction of estimators through optimization problems is only one side of the coin, finding the optimal value of the parameter often is an optimization problem that needs to be solved, using various optimization techniques. Hopefully these optimization problems are convex for a wide class of problems, and we can exploit their structure to get fast convergence rates. The first main contribution of the thesis is to develop moment-matching techniques for multi-index non-linear regression problems. We consider the classical non-linear regression problem, which is unfeasible in high dimensions due to the curse of dimensionality; that is why we assume a model, which states that in fact the data is a nonlinear function of several linear projections of data. We combine two existing techniques: ADE and SIR to develop the hybrid method without some of the weak sides of its parents: it works both in multi-index models and with weak assumptions on the data distribution. In the second main contribution we use a special type of averaging for stochastic gradient descent. We consider conditional exponential families (such as logistic regression), where the goal is to find the unknown value of the parameter. Classical approaches, such as SGD with constant step-size are known to converge only to some neighborhood of the optimal value of the parameter, even with averaging. We propose the averaging of moment parameters, which we call prediction functions. For finite-dimensional models this type of averaging can lead to negative error, i.e., this approach provides us with the estimator better than any linear estimator can ever achieve. For infinite-dimensional models our approach converges to the optimal prediction, while parameter averaging never does.The third main contribution of this thesis deals with Fenchel-Young losses. We consider multi-class linear classifiers with the losses of a certain type, such that their dual conjugate has a direct product of simplices as a support. The corresponding saddle-point convex-concave formulation has a special form with a bilinear matrix term and classical approaches suffer from the time-consuming multiplication of matrices. We show, that for multi-class SVM losses with smart matrix-multiplication sampling techniques, our approach has an iteration complexity which is sublinear, i.e., we need to pay only trice : for number of classes , number of features and number of samples , whereas all existing techniques have higher complexity. This is possible due to the right choice of geometries and using a mirror descent approach.Dans cette thèse, nous examinons plusieurs aspects de l'estimation des paramètres pour les statistiques et les techniques d'apprentissage automatique, aussi que les méthodes d'optimisation applicables à ces problèmes. Le but de l'estimation des paramètres est de trouver les paramètres cachés inconnus qui régissent les données, par exemple les paramètres dont la densité de probabilité est inconnue. La construction d'estimateurs par le biais de problèmes d'optimisation n'est qu'une partie du problème, trouver la valeur optimale du paramètre est souvent un problème d'optimisation qui doit être résolu, en utilisant diverses techniques. Ces problèmes d'optimisation sont souvent convexes pour une large classe de problèmes, et nous pouvons exploiter leur structure pour obtenir des taux de convergence rapides. La première contribution principale de la thèse est de développer des techniques d'appariement de moments pour des problèmes de régression non linéaire multi-index. Nous considérons le problème classique de régression non linéaire, qui est irréalisable dans des dimensions élevées en raison de la malédiction de la dimensionnalité et c'est pourquoi nous supposons que les données sont en fait une fonction non linéaire de plusieurs projections linéaires des données. Nous combinons deux techniques existantes : ADE et SIR pour développer la méthode hybride sans certain des aspects faibles de ses parents : elle fonctionne à la fois dans des modèles multi-index et avec des hypothèses faibles sur la distribution des données. Dans la deuxième contribution principale, nous utilisons un type particulier de calcul de la moyenne pour la descente stochastique du gradient. Nous considérons les familles exponentielles conditionnelles (comme la régression logistique), où l'objectif est de trouver la valeur inconnue du paramètre. Les approches classiques, telles que SGD avec une taille de pas constante, ne convergent que vers un certain voisinage de la valeur optimale du paramètre, même avec le calcul de la moyenne. Nous proposons le calcul de la moyenne des paramètres de moments, que nous appelons fonctions de prédiction. Pour les modèles à dimensions finies, ce type de calcul de la moyenne peut entraîner une erreur négative, c'est-à-dire que cette approche nous fournit un estimateur meilleur que tout estimateur linéaire ne peut jamais le faire. Pour les modèles à dimensions infinies, notre approche converge vers la prédiction optimale, alors que le calcul de la moyenne des paramètres ne le fait jamais.La troisième contribution principale de cette thèse porte sur les pertes de Fenchel-Young. Nous considérons des classificateurs linéaires multi-classes avec les pertes d'un certain type, de sorte que leur double conjugué a un produit direct de simplices comme support. La formulation convexe-concave à point-selle correspondante a une forme spéciale avec un terme de matrice bilinéaire et les approches classiques souffrent de la multiplication des matrices qui prend beaucoup de temps. Nous montrons que pour les pertes SVM multi-classes avec des techniques d'échantillonnage efficaces, notre approche a une complexité d'itération sous-linéaire, c'est-à-dire que nous devons payer seulement trois fois : pour le nombre de classes , le nombre de caractéristiques et le nombre d'échantillons , alors que toutes les techniques existantes sont plus complexes. Ceci est possible grâce au bon choix des géométries et à l'utilisation d'une approche de descente en miroir
- …