Search CORE

689 research outputs found

On Tail Index Estimation based on Multivariate Data

Author: Clémençon Stéphan
Dematteo Antoine
Publication venue
Publication date: 09/04/2014
Field of study

This article is devoted to the study of tail index estimation based on i.i.d. multivariate observations, drawn from a standard heavy-tailed distribution, i.e. of which 1-d Pareto-like marginals share the same tail index. A multivariate Central Limit Theorem for a random vector, whose components correspond to (possibly dependent) Hill estimators of the common shape index alpha, is established under mild conditions. Motivated by the statistical analysis of extremal spatial data in particular, we introduce the concept of (standard) heavy-tailed random field of tail index alpha and show how this limit result can be used in order to build an estimator of alpha with small asymptotic mean squared error, through a proper convex linear combination of the coordinates. Beyond asymptotic results, simulation experiments illustrating the relevance of the approach promoted are also presented

arXiv.org e-Print Archive

Mass Volume Curves and Anomaly Ranking

Author: Clémençon Stephan
Thomas Albert
Publication venue
Publication date: 01/01/2018
Field of study

This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more `abnormal' as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued `scoring' function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve

arXiv.org e-Print Archive

Learning Reputation in an Authorship Network

Author: Clémençon Stéphan
Dhanjal Charanpal
Publication venue
Publication date: 25/11/2013
Field of study

The problem of searching for experts in a given academic field is hugely important in both industry and academia. We study exactly this issue with respect to a database of authors and their publications. The idea is to use Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) to perform topic modelling in order to find authors who have worked in a query field. We then construct a coauthorship graph and motivate the use of influence maximisation and a variety of graph centrality measures to obtain a ranked list of experts. The ranked lists are further improved using a Markov Chain-based rank aggregation approach. The complete method is readily scalable to large datasets. To demonstrate the efficacy of the approach we report on an extensive set of computational simulations using the Arnetminer dataset. An improvement in mean average precision is demonstrated over the baseline case of simply using the order of authors found by the topic models

arXiv.org e-Print Archive

Crossref

Functional Bipartite Ranking: a Wavelet-Based Filtering Approach

Author: Clémençon Stéphan
Depecker Marine
Publication venue
Publication date: 01/12/2013
Field of study

It is the main goal of this article to address the bipartite ranking issue from the perspective of functional data analysis (FDA). Given a training set of independent realizations of a (possibly sampled) second-order random function with a (locally) smooth autocorrelation structure and to which a binary label is randomly assigned, the objective is to learn a scoring function s with optimal ROC curve. Based on linear/nonlinear wavelet-based approximations, it is shown how to select compact finite dimensional representations of the input curves adaptively, in order to build accurate ranking rules, using recent advances in the ranking problem for multivariate data with binary feedback. Beyond theoretical considerations, the performance of the learning methods for functional bipartite ranking proposed in this paper are illustrated by numerical experiments

arXiv.org e-Print Archive

HAL-CEA

Ranking the best instances

Author: Clémençon Stéphan
Vayatis Nicolas
Publication venue
Publication date: 01/01/2007
Field of study

We formulate the local ranking problem in the framework of bipartite ranking where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC/AROC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem.Comment: 29 page

arXiv.org e-Print Archive

Hal-Diderot

Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics

Author: Bellet Aurélien
Clémençon Stéphan
Colin Igor
Publication venue
Publication date: 01/01/2016
Field of study

In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by

U

-statistics of degree

d\geq 1

, i.e. functionals of the training data with low variance that take the form of averages over

k

-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size

n

, as it requires averaging

O(n^d)

terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on

O(n)

terms only, usually referred to as incomplete

U

-statistics, without damaging the

O_{\mathbb{P}}(1/\sqrt{n})

learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a

U

-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2: minor correction to Theorem 4 and its proof, added 1 reference. v3: typo corrected in Proposition 3. v4: improved presentation, added experiments on model selection for clustering, fixed minor typo

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot