Search CORE

77 research outputs found

High-dimensional clustering

Author: Biernacki Christophe
Maugis Cathy
Publication venue: HAL CCSD
Publication date: 19/09/2017
Field of study

International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL Descartes

HAL-INSA Toulouse

A sparse variable selection procedure in model-based clustering

Author: Maugis-Rabusseau Cathy
Meynet Caroline
Publication venue: HAL CCSD
Publication date: 21/09/2012
Field of study

Au vu de l'augmentation du nombre de jeux de données de grande dimension, la sélection de variables pour la classification non supervisée est un enjeu important. Dans le cadre de la classification par mélanges gaussiens, nous reformulons le problème de sélection de variables en un problème général de sélection de modèle. Dans un premier temps, notre procédure consiste à construire une sous-collection de modèles grâce à une méthode de régularisation l1. Puis, l'estimateur du maximum de vraisemblance est déterminé via un algorithme EM pour chaque modèle. Enfin un critère pénalisé non asymptotique est proposé pour sélectionner à la fois le nombre de composants du mélange et l'ensemble des variables informatives pour la classification. D'un point de vue théorique, un théorème général de sélection de modèles dans le cadre de l'estimation par maximum de vraisemblance avec une collection aléatoire de modèles est établi. Il permet en particulier de justifier la forme de la pénalité de notre critère, forme qui dépend de la complexité de la collection de modèles. En pratique, ce critère est calibré grâce à la méthode dite de l'heuristique de pente. Cette procédure est illustrée sur deux jeux de données simulées. Finalement, une extension, associée à une modélisation plus générale des variables non informatives pour la classification, est proposée

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Parameter recovery in two-component contamination mixtures: the L2 strategy

Author: Gadat Sébastien
Marteau Clément
Maugis Cathy
Publication venue: TSE Working Paper
Publication date: 01/05/2016
Field of study

Toulouse Capitole Publications

Toulouse 1 Capitole Publications

Pratique de l'heuristique de pente et le package CAPUSHE

Author: Baudry Jean-Patrick
Maugis Cathy
Michel Bertrand
Publication venue: HAL CCSD
Publication date: 31/08/2010
Field of study

National audienceLa mise en oeuvre des méthodes "data-driven" de calibration de critères pénalisés, issues de l'heuristique de pente de Birgé et Massart (2007), implique des difficultés pratiques

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL Descartes

HAL-INSA Toulouse

Hal-Diderot

Multidimensional two-component Gaussian mixtures detection

Author: Laurent Béatrice
Marteau Clément
Maugis-Rabusseau Cathy
Publication venue: Institut Henri Poincaré (IHP)
Publication date: 01/01/2017
Field of study

International audienceLet

(X_1,\ldots,X_n)

be a

d

-dimensional i.i.d sample from a distribution with density

f

. The problem of detection of a two-component mixture is considered. Our aim is to decide whether

f

is the density of a standard Gaussian random

d

-vector (

f=\phi_d

) against

f

is a two-component mixture:

f=(1-\varepsilon)\phi_d +\varepsilon \phi_d (.-\mu)

where

(\varepsilon,\mu)

are unknown parameters. Optimal separation conditions on

\varepsilon, \mu, n

and the dimension

d

are established, allowing to separate both hypotheses with prescribed errors. Several testing procedures are proposed and two alternative subsets are considered

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Slope Heuristics: Overview and Implementation

Author: Baudry Jean-Patrick
Maugis Cathy
Michel Bertrand
Publication venue: HAL CCSD
Publication date: 05/03/2010
Field of study

RR INRIA-7223, Version 1Model selection is a general paradigm which includes many statistical problems. One of the most fruitful and popular approaches to carry it out is the minimization of a penalized criterion. Birgé and Massart (2006) have proposed a promising data-driven method to calibrate such criteria whose penalties are known up to a multiplicative factor: the ``slope heuristics''. Theoretical works validate this heuristic method in some situations and several papers report a promising practical behavior in various frameworks. The purpose of this work is twofold. First, an introduction to the slope heuristics and an overview of the theoretical and practical results about it are presented. Second, we focus on the practical difficulties occurring for applying the slope heuristics. A new practical approach is carried out and compared to the standard dimension jump method. All the practical solutions discussed in this paper in different frameworks are implemented and brought together in a Matlab graphical user interface called capushe

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL-INSA Toulouse

Selective inference after convex clustering with $\ell_1$ penalization

Author: Bachoc François
Maugis-Rabusseau Cathy
Neuvial Pierre
Publication venue
Publication date: 04/09/2023
Field of study

Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with

\ell_1

penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with

\ell_1

penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.Comment: 40 pages, 8 figure

arXiv.org e-Print Archive