Search CORE

487 research outputs found

Approaches to working in high-dimensional data spaces: gene expression microarrays

Author: A Dupuy
A Statnikov
AK Jain
B Efron
BJ Frey
C Lai
CF Aliferis
D J Miller
D Miller
DB Allison
DF Ransohoff
DF Ransohoff
EP Xing
GV Trunk
I Guyon
I Guyon
J Novovicova
J Wang
JA Swets
JD Storey
KA Shedden
KY Yeung
L Ein-Dor
MW Graham
R Clarke
R Clarke
RO Duda
S Ramaswamy
T Lange
TR Golub
VN Vapnik
Y Wang
Z Wang
Publication venue: Nature Publishing Group
Publication date
Field of study

This review provides a focused summary of the implications of high-dimensional data spaces produced by gene expression microarrays for building better models of cancer diagnosis, prognosis, and therapeutics. We identify the unique challenges posed by high dimensionality to highlight methodological problems and discuss recent methods in predictive classification, unsupervised subclass discovery, and marker identification

Crossref

PubMed Central

Simultaneous model-based clustering and visualization in the Fisher discriminative subspace

Author: A. Jain
A. Montanari
A. Raftery
C. Biernacki
C. Biernacki
C. Bishop
C. Bouveyron
C. Fraley
C. Maugis
Camille Brunet
Charles Bouveyron
D. Foley
D. Rubin
D. Scott
D.A. Clausi
E. Anderson
E. Tipping
G. Celeux
G. Celeux
G. Golub
G. Kimeldorf
G. McLachlan
G. McLachlan
G. McLachlan
G. Schwarz
H. Akaike
I. Jolliffe
J. Baek
J. Friedman
J. Ye
J. Ye
K. Fukunaga
K. Liu
L. Parsons
M. Law
N. Campbell
N. Trendafilov
P. Howland
P. McNicholas
R. Agrawal
R. Bellman
R. Duda
R. Fisher
S. Boutemedjet
T. Alexandrov
T. Hastie
T. Hastie
W. Krzanowski
Y. Hamamoto
Y.F. Guo
Z. Jin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/04/2011
Field of study

Clustering in high-dimensional spaces is nowadays a recurrent problem in many scientific domains but remains a difficult task from both the clustering accuracy and the result understanding points of view. This paper presents a discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimension of the original space. By constraining model parameters within and between groups, a family of 12 parsimonious DLM models is exhibited which allows to fit onto various situations. An estimation algorithm, called the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters and the discriminative subspace. Experiments on simulated and real datasets show that the proposed approach performs better than existing clustering methods while providing a useful representation of the clustered data. The method is as well applied to the clustering of mass spectrometry data

arXiv.org e-Print Archive

HAL Evry

Crossref

HAL-Paris1

PARAMETRIC LINK MODELS FOR KNOWLEDGE TRANSFER IN STATISTICAL LEARNING

Author: Beninel Farid
Biernacki Christophe
Bouveyron Charles
Jacques Julien
Lourme Alexandre
Publication venue: Nova Publishers
Publication date: 01/01/2012
Field of study

International audienceWhen a statistical model is designed in a prediction purpose, a major assumption is the absence of evolution in the modeled phenomenon between the training and the prediction stages. Thus, training and future data must be in the same feature space and must have the same distribution. Unfortunately, this assumption turns out to be often false in real-world applications. For instance, biological motivations could lead to classify individuals from a given species when only individuals from another species are available for training. In regression, we would sometimes use a predictive model for data having not exactly the same distribution that the training data used for estimating the model. This chapter presents techniques for transfering a statistical model estimated from a source population to a target population. Three tasks of statistical learning are considered: Probabilistic classification (parametric and semi-parametric), linear regression (includingmixture of regressions) and model-based clustering (Gaussian and Student). In each situation, the knowledge transfer is carried out by introducing parametric links between both populations. The use of such transfer techniques would improve the performance of learning by avoiding much expensive data labeling efforts

HAL Descartes

A probabilistic approach to emission-line galaxy classification

Author: Beck R.
Costa-Duarte M. V.
Dantas M. L. L.
de Souza R. S.
Feigelson E. D.
Gieseke F.
Killedar M.
Krone-Martins A.
Lablanche P. -Y.
Vilalta R.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2017
Field of study

We invoke a Gaussian mixture model (GMM) to jointly analyse two traditional emission-line classification schemes of galaxy ionization sources: the Baldwin-Phillips-Terlevich (BPT) and

\rm W_{H\alpha}

vs. [NII]/H

\alpha

(WHAN) diagrams, using spectroscopic data from the Sloan Digital Sky Survey Data Release 7 and SEAGal/STARLIGHT datasets. We apply a GMM to empirically define classes of galaxies in a three-dimensional space spanned by the

\log

[OIII]/H

\beta

\log

[NII]/H

\alpha

, and

\log

EW(H

{\alpha}

), optical parameters. The best-fit GMM based on several statistical criteria suggests a solution around four Gaussian components (GCs), which are capable to explain up to 97 per cent of the data variance. Using elements of information theory, we compare each GC to their respective astronomical counterpart. GC1 and GC4 are associated with star-forming galaxies, suggesting the need to define a new starburst subgroup. GC2 is associated with BPT's Active Galaxy Nuclei (AGN) class and WHAN's weak AGN class. GC3 is associated with BPT's composite class and WHAN's strong AGN class. Conversely, there is no statistical evidence -- based on four GCs -- for the existence of a Seyfert/LINER dichotomy in our sample. Notwithstanding, the inclusion of an additional GC5 unravels it. The GC5 appears associated to the LINER and Passive galaxies on the BPT and WHAN diagrams respectively. Subtleties aside, we demonstrate the potential of our methodology to recover/unravel different objects inside the wilderness of astronomical datasets, without lacking the ability to convey physically interpretable results. The probabilistic classifications from the GMM analysis are publicly available within the COINtoolbox (https://cointoolbox.github.io/GMM\_Catalogue/).Comment: Accepted for publication in MNRA

arXiv.org e-Print Archive

Leiden University Scholary Publications

Radboud Repository

Kernel discriminant analysis and clustering with parsimonious Gaussian process models

Author: Bouveyron Charles
Fauvel Mathieu
Girard Stéphane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2015
Field of study

International audienceThis work presents a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case. Experimental results on various data sets demonstrate the effectiveness of the proposed method

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot