173 research outputs found
Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics
In a wide range of statistical learning problems such as ranking, clustering
or metric learning among others, the risk is accurately estimated by
-statistics of degree , i.e. functionals of the training data with
low variance that take the form of averages over -tuples. From a
computational perspective, the calculation of such statistics is highly
expensive even for a moderate sample size , as it requires averaging
terms. This makes learning procedures relying on the optimization of
such data functionals hardly feasible in practice. It is the major goal of this
paper to show that, strikingly, such empirical risks can be replaced by
drastically computationally simpler Monte-Carlo estimates based on terms
only, usually referred to as incomplete -statistics, without damaging the
learning rate of Empirical Risk Minimization (ERM)
procedures. For this purpose, we establish uniform deviation results describing
the error made when approximating a -process by its incomplete version under
appropriate complexity assumptions. Extensions to model selection, fast rate
situations and various sampling techniques are also considered, as well as an
application to stochastic gradient descent for ERM. Finally, numerical examples
are displayed in order to provide strong empirical evidence that the approach
we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2:
minor correction to Theorem 4 and its proof, added 1 reference. v3: typo
corrected in Proposition 3. v4: improved presentation, added experiments on
model selection for clustering, fixed minor typo
Similarity Learning for High-Dimensional Sparse Data
A good measure of similarity between data points is crucial to many tasks in
machine learning. Similarity and metric learning methods learn such measures
automatically from data, but they do not scale well respect to the
dimensionality of the data. In this paper, we propose a method that can learn
efficiently similarity measure from high-dimensional sparse data. The core idea
is to parameterize the similarity measure as a convex combination of rank-one
matrices with specific sparsity structures. The parameters are then optimized
with an approximate Frank-Wolfe procedure to maximally satisfy relative
similarity constraints on the training data. Our algorithm greedily
incorporates one pair of features at a time into the similarity measure,
providing an efficient way to control the number of active features and thus
reduce overfitting. It enjoys very appealing convergence guarantees and its
time and memory complexity depends on the sparsity of the data instead of the
dimension of the feature space. Our experiments on real-world high-dimensional
datasets demonstrate its potential for classification, dimensionality reduction
and data exploration.Comment: 14 pages. Proceedings of the 18th International Conference on
Artificial Intelligence and Statistics (AISTATS 2015). Matlab code:
https://github.com/bellet/HDS
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions
In decentralized networks (of sensors, connected objects, etc.), there is an
important need for efficient algorithms to optimize a global cost function, for
instance to learn a global model from the local data collected by each
computing unit. In this paper, we address the problem of decentralized
minimization of pairwise functions of the data points, where these points are
distributed over the nodes of a graph defining the communication topology of
the network. This general problem finds applications in ranking, distance
metric learning and graph inference, among others. We propose new gossip
algorithms based on dual averaging which aims at solving such problems both in
synchronous and asynchronous settings. The proposed framework is flexible
enough to deal with constrained and regularized variants of the optimization
problem. Our theoretical analysis reveals that the proposed algorithms preserve
the convergence rate of centralized dual averaging up to an additive bias term.
We present numerical simulations on Area Under the ROC Curve (AUC) maximization
and metric learning problems which illustrate the practical interest of our
approach
Extending Gossip Algorithms to Distributed Estimation of U-Statistics
Efficient and robust algorithms for decentralized estimation in networks are
essential to many distributed systems. Whereas distributed estimation of sample
mean statistics has been the subject of a good deal of attention, computation
of -statistics, relying on more expensive averaging over pairs of
observations, is a less investigated area. Yet, such data functionals are
essential to describe global properties of a statistical population, with
important examples including Area Under the Curve, empirical variance, Gini
mean difference and within-cluster point scatter. This paper proposes new
synchronous and asynchronous randomized gossip algorithms which simultaneously
propagate data across the network and maintain local estimates of the
-statistic of interest. We establish convergence rate bounds of and
for the synchronous and asynchronous cases respectively, where
is the number of iterations, with explicit data and network dependent
terms. Beyond favorable comparisons in terms of rate analysis, numerical
experiments provide empirical evidence the proposed algorithms surpasses the
previously introduced approach.Comment: to be presented at NIPS 201
Decentralized Collaborative Learning of Personalized Models over Networks
We consider a set of learning agents in a col-laborative peer-to-peer network, where each agent learns a personalized model according to its own learning objective. The question addressed in this paper is: how can agents improve upon their locally trained model by communicating with other agents that have similar objectives? We introduce and analyze two asynchronous gossip algorithms running in a fully decentralized manner. Our first approach , inspired from label propagation, aims to smooth pre-trained local models over the network while accounting for the confidence that each agent has in its initial model. In our second approach, agents jointly learn and propagate their model by making iterative updates based on both their local dataset and the behavior of their neighbors. Our algorithm to optimize this challenging objective in a decentralized way is based on ADMM
A Probabilistic Theory of Supervised Similarity Learning for Pointwise ROC Curve Optimization
The performance of many machine learning techniques depends on the choice of
an appropriate similarity or distance measure on the input space. Similarity
learning (or metric learning) aims at building such a measure from training
data so that observations with the same (resp. different) label are as close
(resp. far) as possible. In this paper, similarity learning is investigated
from the perspective of pairwise bipartite ranking, where the goal is to rank
the elements of a database by decreasing order of the probability that they
share the same label with some query data point, based on the similarity
scores. A natural performance criterion in this setting is pointwise ROC
optimization: maximize the true positive rate under a fixed false positive
rate. We study this novel perspective on similarity learning through a rigorous
probabilistic framework. The empirical version of the problem gives rise to a
constrained optimization formulation involving U-statistics, for which we
derive universal learning rates as well as faster rates under a noise
assumption on the data distribution. We also address the large-scale setting by
analyzing the effect of sampling-based approximations. Our theoretical results
are supported by illustrative numerical experiments.Comment: 8 pages main paper, 22 pages with appendices, proceedings of ICML
201
Distributed Differentially Private Averaging with Improved Utility and Robustness to Malicious Parties
Learning from data owned by several parties, as in federated learning, raises
challenges regarding the privacy guarantees provided to participants and the
correctness of the computation in the presence of malicious parties. We tackle
these challenges in the context of distributed averaging, an essential building
block of distributed and federated learning. Our first contribution is a novel
distributed differentially private protocol which naturally scales with the
number of parties. The key idea underlying our protocol is to exchange
correlated Gaussian noise along the edges of a network graph, complemented by
independent noise added by each party. We analyze the differential privacy
guarantees of our protocol and the impact of the graph topology, showing that
we can match the accuracy of the trusted curator model even when each party
communicates with only a logarithmic number of other parties chosen at random.
This is in contrast with protocols in the local model of privacy (with lower
accuracy) or based on secure aggregation (where all pairs of users need to
exchange messages). Our second contribution is to enable users to prove the
correctness of their computations without compromising the efficiency and
privacy guarantees of the protocol. Our construction relies on standard
cryptographic primitives like commitment schemes and zero knowledge proofs.Comment: 39 page
Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints
Many applications of AI involve scoring individuals using a learned function
of their attributes. These predictive risk scores are then used to take
decisions based on whether the score exceeds a certain threshold, which may
vary depending on the context. The level of delegation granted to such systems
in critical applications like credit lending and medical diagnosis will heavily
depend on how questions of fairness can be answered. In this paper, we study
fairness for the problem of learning scoring functions from binary labeled
data, a classic learning task known as bipartite ranking. We argue that the
functional nature of the ROC curve, the gold standard measure of ranking
accuracy in this context, leads to several ways of formulating fairness
constraints. We introduce general families of fairness definitions based on the
AUC and on ROC curves, and show that our ROC-based constraints can be
instantiated such that classifiers obtained by thresholding the scoring
function satisfy classification fairness for a desired range of thresholds. We
establish generalization bounds for scoring functions learned under such
constraints, design practical learning algorithms and show the relevance our
approach with numerical experiments on real and synthetic data.Comment: 35 pages, 13 figures, 6 table
- …