22 research outputs found
Mass Volume Curves and Anomaly Ranking
This paper aims at formulating the issue of ranking multivariate unlabeled
observations depending on their degree of abnormality as an unsupervised
statistical learning task. In the 1-d situation, this problem is usually
tackled by means of tail estimation techniques: univariate observations are
viewed as all the more `abnormal' as they are located far in the tail(s) of the
underlying probability distribution. It would be desirable as well to dispose
of a scalar valued `scoring' function allowing for comparing the degree of
abnormality of multivariate observations. Here we formulate the issue of
scoring anomalies as a M-estimation problem by means of a novel functional
performance criterion, referred to as the Mass Volume curve (MV curve in
short), whose optimal elements are strictly increasing transforms of the
density almost everywhere on the support of the density. We first study the
statistical estimation of the MV curve of a given scoring function and we
provide a strategy to build confidence regions using a smoothed bootstrap
approach. Optimization of this functional criterion over the set of piecewise
constant scoring functions is next tackled. This boils down to estimating a
sequence of empirical minimum volume sets whose levels are chosen adaptively
from the data, so as to adjust to the variations of the optimal MV curve, while
controling the bias of its approximation by a stepwise curve. Generalization
bounds are then established for the difference in sup norm between the MV curve
of the empirical scoring function thus obtained and the optimal MV curve
Ranking Median Regression: Learning to Order through Local Consensus
This article is devoted to the problem of predicting the value taken by a
random permutation , describing the preferences of an individual over a
set of numbered items say, based on the observation of
an input/explanatory r.v. e.g. characteristics of the individual), when
error is measured by the Kendall distance. In the probabilistic
formulation of the 'Learning to Order' problem we propose, which extends the
framework for statistical Kemeny ranking aggregation developped in
\citet{CKS17}, this boils down to recovering conditional Kemeny medians of
given from i.i.d. training examples . For this reason, this statistical learning problem is
referred to as \textit{ranking median regression} here. Our contribution is
twofold. We first propose a probabilistic theory of ranking median regression:
the set of optimal elements is characterized, the performance of empirical risk
minimizers is investigated in this context and situations where fast learning
rates can be achieved are also exhibited. Next we introduce the concept of
local consensus/median, in order to derive efficient methods for ranking median
regression. The major advantage of this local learning approach lies in its
close connection with the widely studied Kemeny aggregation problem. From an
algorithmic perspective, this permits to build predictive rules for ranking
median regression by implementing efficient techniques for (approximate) Kemeny
median computations at a local level in a tractable manner. In particular,
versions of -nearest neighbor and tree-based methods, tailored to ranking
median regression, are investigated. Accuracy of piecewise constant ranking
median regression rules is studied under a specific smoothness assumption for
's conditional distribution given
Regular Variation in Hilbert Spaces and Principal Component Analysis for Functional Extremes
Motivated by the increasing availability of data of functional nature, we
develop a general probabilistic and statistical framework for extremes of
regularly varying random elements in . We place ourselves in a
Peaks-Over-Threshold framework where a functional extreme is defined as an
observation whose -norm is comparatively large. Our goal is to
propose a dimension reduction framework resulting into finite dimensional
projections for such extreme observations. Our contribution is double. First,
we investigate the notion of Regular Variation for random quantities valued in
a general separable Hilbert space, for which we propose a novel concrete
characterization involving solely stochastic convergence of real-valued random
variables. Second, we propose a notion of functional Principal Component
Analysis (PCA) accounting for the principal `directions' of functional
extremes. We investigate the statistical properties of the empirical covariance
operator of the angular component of extreme functions, by upper-bounding the
Hilbert-Schmidt norm of the estimation error for finite sample sizes. Numerical
experiments with simulated and real data illustrate this work.Comment: 29 pages (main paper), 5 pages (appendix
On Medians of (Randomized) Pairwise Means
Tournament procedures, recently introduced in Lugosi & Mendelson (2016),
offer an appealing alternative, from a theoretical perspective at least, to the
principle of Empirical Risk Minimization in machine learning. Statistical
learning by Median-of-Means (MoM) basically consists in segmenting the training
data into blocks of equal size and comparing the statistical performance of
every pair of candidate decision rules on each data block: that with highest
performance on the majority of the blocks is declared as the winner. In the
context of nonparametric regression, functions having won all their duels have
been shown to outperform empirical risk minimizers w.r.t. the mean squared
error under minimal assumptions, while exhibiting robustness properties. It is
the purpose of this paper to extend this approach in order to address other
learning problems, in particular for which the performance criterion takes the
form of an expectation over pairs of observations rather than over one single
observation, as may be the case in pairwise ranking, clustering or metric
learning. Precisely, it is proved here that the bounds achieved by MoM are
essentially conserved when the blocks are built by means of independent
sampling without replacement schemes instead of a simple segmentation. These
results are next extended to situations where the risk is related to a pairwise
loss function and its empirical counterpart is of the form of a -statistic.
Beyond theoretical results guaranteeing the performance of the
learning/estimation methods proposed, some numerical experiments provide
empirical evidence of their relevance in practice
Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints
Many applications of AI involve scoring individuals using a learned function
of their attributes. These predictive risk scores are then used to take
decisions based on whether the score exceeds a certain threshold, which may
vary depending on the context. The level of delegation granted to such systems
in critical applications like credit lending and medical diagnosis will heavily
depend on how questions of fairness can be answered. In this paper, we study
fairness for the problem of learning scoring functions from binary labeled
data, a classic learning task known as bipartite ranking. We argue that the
functional nature of the ROC curve, the gold standard measure of ranking
accuracy in this context, leads to several ways of formulating fairness
constraints. We introduce general families of fairness definitions based on the
AUC and on ROC curves, and show that our ROC-based constraints can be
instantiated such that classifiers obtained by thresholding the scoring
function satisfy classification fairness for a desired range of thresholds. We
establish generalization bounds for scoring functions learned under such
constraints, design practical learning algorithms and show the relevance our
approach with numerical experiments on real and synthetic data.Comment: 35 pages, 13 figures, 6 table