Search CORE

33 research outputs found

Prédiction phénotypique et sélection de variables en grande dimension dans les modèles linéaires et linéaires mixtes

Author: LAURENT Béatrice
ROHART Florian
SAN CRISTOBAL Magali
Publication venue
Publication date: 01/01/2012
Field of study

Les nouvelles technologies permettent l'acquisition de données génomiques et post-génomiques de grande dimension, c'est-à-dire des données pour lesquelles il y a toujours un plus grand nombre de variables mesurées que d'individus sur lesquels on les mesure. Ces données nécessitent généralement des hypothèses supplémentaires afin de pouvoir être analysées, comme une hypothèse de parcimonie pour laquelle peu de variables sont supposées influentes. C'est dans ce contexte de grande dimension que nous avons travaillé sur des données réelles issues de l espèce porcine et de la technologie haut-débit, plus particulièrement le métabolome obtenu à partir de la spectrométrie RMN et des phénotypes mesurés post-mortem pour la plupart. L'objectif est double : d'une part la prédiction de phénotypes d intérêt pour la production porcine et d'autre part l'explicitation de relations biologiques entre ces phénotypes et le métabolome. On montre, grâce à une analyse dans le modèle linéaire effectuée avec la méthode Lasso, que le métabolome a un pouvoir prédictif non négligeable pour certains phénotypes importants pour la production porcine comme le taux de muscle et la consommation moyenne journalière. Le deuxième objectif est traité grâce au domaine statistique de la sélection de variables. Les méthodes classiques telles que la méthode Lasso et la procédure FDR sont investiguées et de nouvelles méthodes plus performantes sont développées : nous proposons une méthode de sélection de variables en modèle linéaire basée sur des tests d'hypothèses multiples. Cette méthode possède des résultats non asymptotiques de puissance sous certaines conditions sur le signal. De part les données annexes disponibles sur les animaux telles que les lots dans lesquels ils ont évolués ou les relations de parentés qu'ils possèdent, les modèles mixtes sont considérés. Un nouvel algorithme de sélection d'effets fixes est développé et il s'avère beaucoup plus rapide que les algorithmes existants qui ont le même objectif. Grâce à sa décomposition en étapes distinctes, l algorithme peut être combiné à toutes les méthodes de sélection de variables développées pour le modèle linéaire classique. Toutefois, les résultats de convergence dépendent de la méthode utilisée. On montre que la combinaison de cet algorithme avec la méthode de tests multiples donne de très bons résultats empiriques. Toutes ces méthodes sont appliquées au jeu de données réelles et des relations biologiques sont mises en évidenceRecent technologies have provided scientists with genomics and post-genomics high-dimensional data; there are always more variables that are measured than the number of individuals. These high dimensional datasets usually need additional assumptions in order to be analyzed, such as a sparsity condition which means that only a small subset of the variables are supposed to be relevant. In this high-dimensional context we worked on a real dataset which comes from the pig species and high-throughput biotechnologies. Metabolomic data has been measured with NMR spectroscopy and phenotypic data has been mainly obtained post-mortem. There are two objectives. On one hand, we aim at obtaining good prediction for the production phenotypes and on the other hand we want to pinpoint metabolomic data that explain the phenotype under study. Thanks to the Lasso method applied in a linear model, we show that metabolomic data has a real prediction power for some important phenotypes for livestock production, such as a lean meat percentage and the daily food consumption. The second objective is a problem of variable selection. Classic statistical tools such as the Lasso method or the FDR procedure are investigated and new powerful methods are developed. We propose a variable selection method based on multiple hypotheses testing. This procedure is designed to perform in linear models and non asymptotic results are given under a condition on the signal. Since supplemental data are available on the real dataset such as the batch or the family relationships between the animals, linear mixed models are considered. A new algorithm for fixed effects selection is developed, and this algorithm turned out to be faster than the usual ones. Thanks to its structure, it can be combined with any variable selection methods built for linear models. However, the convergence property of this algorithm depends on the method that is used. The multiple hypotheses testing procedure shows good empirical results. All the mentioned methods are applied to the real data and biological relationships are emphasizedTOULOUSE-INSA-Bib. electronique (315559905) / SudocSudocFranceF

OpenGrey Repository

Phenotypic prediction based on metabolomic data : lasso vs Bolasso, primary data vs wavelet transformation

Author: Canlet Cécile
Laurent Béatrice
Milan Denis
Molina J.
Paris Alain
Rohart Florian
San Cristobal Magali
Villa-Vialaneix Nathalie
Publication venue: HAL CCSD
Publication date: 01/08/2010
Field of study

International audienceUnderstanding the relations between various 'omics data (such as metabolomics or genomics data) and phenotypes of interest is one of the current major challenges in biology. This question can be addressed by trying to learn a way to predict the phenotype value from the omic from joint observations of the omic and of the phenotype. In this paper, we focus on the prediction of a phenotype related to the quality of the meat from metabolomic data. As metabolomic data are high dimensional data and as, conjointly, the number of observations is often restricted, model selection methods are a way both to obtain a relevant solution to the prediction problem but also to select the most important metabolomes related to the phenotype under study. During the past years, model selection has know a growing interest in the statistical community: the first - and also probably the most known - selection method has been introducted by \citep{Tibshirani:1996} under the name of LASSO. Several variants of this original approach has then been proposed such as, recently, a bootstraped LASSO, named BOLASSO, introduced in (Bach, 2009). The proposal of this paper is to combine a wavelet representation of the metabolome spectra (see (Mallat, 1999) and (Antonini, 1992) for a complete introduction to wavelets) with the BOLASSO approach. We compare this methodology to more classical methods using either the original spectra as predictors (instead of the wavelet representation) or the original LASSO to select the model. The following section deals with the methodological description of the approach whereas the next one details the experiments and results

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Exploring transcriptomic diversity in muscle revealed that cellular signaling pathways mainly differentiate five Western porcine breeds

Author: A Bonnet
A Chow
A Liaw
A Ouali
A Stinckens
A Whitehead
AJ Amaral
AS Van Laere
B Guo
B Vicoso
Christine Lascor
CJ Rubin
D Bem
D Natt
D Tabas-Madrid
DA Ferrarezi
Denis Milan
DJ Glass
E Huff-Lonergan
E Jimenez-Guri
E Kristiansson
F Rohart
Florian Rohart
G Laval
GR Foxcroft
H Fraser
HC Yang
HJ Megens
HP Patel
JC Politz
JG Laing
K-A Lê Cao
KK Starheim
KS Kobayashi
L Andersson
L Atanasova
L Breiman
L Canario
L Fontanesi
L Müller
L Yang
Laurence Liaubet
Lidwine Trouilh
M Blomberg Jensen
M Mensack
M Oleksiak
M Pérez-Enciso
M SanCristobal
M SanCristobal
MA Busby
MA Egerman
Magali SanCristobal
Marcel Bouffaud
Marie-José Mercat
MB Hufford
MJ Mercat
N Buys
N Tkachuk
P Clayton
P Herpin
P Khaitovich
P Van Damme
Pascal G.P. Martin
PJ Ferre
S Rifkin
S Thorsteinsdottir
SH Lee
SY Jiang
T Giger
T Le Gall
T van der Lende
Thierry Tribout
Thomas Faraut
V Voillet
W Enard
X Su
Y Benjamini
Y Li
Y Yao
Yannick Lippi
Z Yin
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Multiple hypothesis testing for variable selection

Author: Rohart Florian
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

We propose two new procedures based on multiple hypothesis testing for correct support estimation in high-dimensional sparse linear models. We conclusively prove that both procedures are powerful and do not require the sample size to be large. The first procedure tackles the atypical setting of ordered variable selection through an extension of a testing procedure previously developed in the context of a linear hypothesis. The second procedure is the main contribution of this paper. It enables data analysts to perform support estimation in the general high-dimensional framework of non-ordered variable selection. A thorough simulation study and applications to real datasets using the R package mht shows that our non-ordered variable procedure produces excellent results in terms of correct support estimation as well as in terms of mean square errors and false discovery rate, when compared to common methods such as the Lasso, the SCAD penalty, forward regression or the false discovery rate procedure (FDR)

HAL-INSA Toulouse

University of Queensland eSpace

Phenotypic prediction and variable selection in high dimensional linear and linear mixed models

Author: Rohart Florian
Publication venue
Publication date: 07/12/2012
Field of study

Les nouvelles technologies permettent l'acquisition de données génomiques et post-génomiques de grande dimension, c'est-à-dire des données pour lesquelles il y a toujours un plus grand nombre de variables mesurées que d'individus sur lesquels on les mesure. Ces données nécessitent généralement des hypothèses supplémentaires afin de pouvoir être analysées, comme une hypothèse de parcimonie pour laquelle peu de variables sont supposées influentes. C'est dans ce contexte de grande dimension que nous avons travaillé sur des données réelles issues de l’espèce porcine et de la technologie haut-débit, plus particulièrement le métabolome obtenu à partir de la spectrométrie RMN et des phénotypes mesurés post-mortem pour la plupart. L'objectif est double : d'une part la prédiction de phénotypes d’intérêt pour la production porcine et d'autre part l'explicitation de relations biologiques entre ces phénotypes et le métabolome. On montre, grâce à une analyse dans le modèle linéaire effectuée avec la méthode Lasso, que le métabolome a un pouvoir prédictif non négligeable pour certains phénotypes importants pour la production porcine comme le taux de muscle et la consommation moyenne journalière. Le deuxième objectif est traité grâce au domaine statistique de la sélection de variables. Les méthodes classiques telles que la méthode Lasso et la procédure FDR sont investiguées et de nouvelles méthodes plus performantes sont développées : nous proposons une méthode de sélection de variables en modèle linéaire basée sur des tests d'hypothèses multiples. Cette méthode possède des résultats non asymptotiques de puissance sous certaines conditions sur le signal. De part les données annexes disponibles sur les animaux telles que les lots dans lesquels ils ont évolués ou les relations de parentés qu'ils possèdent, les modèles mixtes sont considérés. Un nouvel algorithme de sélection d'effets fixes est développé et il s'avère beaucoup plus rapide que les algorithmes existants qui ont le même objectif. Grâce à sa décomposition en étapes distinctes, l’algorithme peut être combiné à toutes les méthodes de sélection de variables développées pour le modèle linéaire classique. Toutefois, les résultats de convergence dépendent de la méthode utilisée. On montre que la combinaison de cet algorithme avec la méthode de tests multiples donne de très bons résultats empiriques. Toutes ces méthodes sont appliquées au jeu de données réelles et des relations biologiques sont mises en évidenceRecent technologies have provided scientists with genomics and post-genomics high-dimensional data; there are always more variables that are measured than the number of individuals. These high dimensional datasets usually need additional assumptions in order to be analyzed, such as a sparsity condition which means that only a small subset of the variables are supposed to be relevant. In this high-dimensional context we worked on a real dataset which comes from the pig species and high-throughput biotechnologies. Metabolomic data has been measured with NMR spectroscopy and phenotypic data has been mainly obtained post-mortem. There are two objectives. On one hand, we aim at obtaining good prediction for the production phenotypes and on the other hand we want to pinpoint metabolomic data that explain the phenotype under study. Thanks to the Lasso method applied in a linear model, we show that metabolomic data has a real prediction power for some important phenotypes for livestock production, such as a lean meat percentage and the daily food consumption. The second objective is a problem of variable selection. Classic statistical tools such as the Lasso method or the FDR procedure are investigated and new powerful methods are developed. We propose a variable selection method based on multiple hypotheses testing. This procedure is designed to perform in linear models and non asymptotic results are given under a condition on the signal. Since supplemental data are available on the real dataset such as the batch or the family relationships between the animals, linear mixed models are considered. A new algorithm for fixed effects selection is developed, and this algorithm turned out to be faster than the usual ones. Thanks to its structure, it can be combined with any variable selection methods built for linear models. However, the convergence property of this algorithm depends on the method that is used. The multiple hypotheses testing procedure shows good empirical results. All the mentioned methods are applied to the real data and biological relationships are emphasize

Theses.fr

Selection of fixed effects in high dimensional linear mixed models using a multicycle ECM algorithm

Author: Laurent Béatrice
Rohart Florian
San Cristobal Magali
Publication venue: 'Elsevier BV'
Publication date: 01/12/2014
Field of study

International audienceLinear mixed models are especially useful when observations are grouped. In a high dimensional setting however, selecting the fixed effect coefficients in these models is mandatory as classical tools are not performing well. By considering the random effects as missing values in the linear mixed model framework, a ℓ 1-penalization on the fixed effects coefficients of the resulting log-likelihood is proposed. The optimization problem is solved via a multicycle Expectation Conditional Maximization (ECM) algorithm which allows for the number of parameters p to be larger than the total number of observations n and does not require the inversion of the sample n×n covariance matrix. The proposed algorithm can be combined with any variable selection method developed for linear models. A variant of the proposed approach replaces the ℓ 1-penalization with a multiple testing procedure for the variable selection aspect and is shown to greatly improve the False Discovery Rate. Both methods are implemented in the MMS R-package, and are shown to give very satisfying results in a high-dimensional simulated setting

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

University of Queensland eSpace

mixOmics: An R package for 'omics feature selection and multiple data integration

Author: Gautier Benoît
Lê Cao Kim-Anh
Rohart Florian
Singh Amrit
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/11/2017
Field of study

The advent of high throughput technologies has led to a wealth of publicly available 'omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a 'molecular signature') to explain or predict biological conditions, but mainly for a single type of 'omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous 'omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple 'omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of 'omics data available from the package

Directory of Open Access Journals

University of Melbourne Institutional Repository

University of Queensland eSpace

MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms

Author: Bougeard Stéphanie
Eslami Aida
Lê Cao Kim-Anh
Matigian Nicholas
Rohart Florian
Publication venue: BioMed Central
Publication date: 27/02/2017
Field of study

Background: Molecular signatures identified from high-throughput transcriptomic studies often have poor reliability and fail to reproduce across studies. One solution is to combine independent studies into a single integrative analysis, additionally increasing sample size. However, the different protocols and technological platforms across transcriptomic studies produce unwanted systematic variation that strongly confounds the integrative analysis results. When studies aim to discriminate an outcome of interest, the common approach is a sequential two-step procedure; unwanted systematic variation removal techniques are applied prior to classification methods. Results: To limit the risk of overfitting and over-optimistic results of a two-step procedure, we developed a novel multivariate integration method, MINT, that simultaneously accounts for unwanted systematic variation and identifies predictive gene signatures with greater reproducibility and accuracy. In two biological examples on the classification of three human cell types and four subtypes of breast cancer, we combined high-dimensional microarray and RNA-seq data sets and MINT identified highly reproducible and relevant gene signatures predictive of a given phenotype. MINT led to superior classification and prediction accuracy compared to the existing sequential two-step procedures. Conclusions: MINT is a powerful approach and the first of its kind to solve the integrative classification framework in a single step by combining multiple independent studies. MINT is computationally fast as part of the mixOmics R CRAN package, available at http://www.mixOmics.org/mixMINT/ and http://cran.r-project.org/web/packages/mixOmics/ .Other UBCNon UBCReviewedFacult

Springer - Publisher Connector

PubMed Central

University of British Columbia: cIRcle - UBC's Information Repository

University of Melbourne Institutional Repository

University of Queensland eSpace