Search CORE

168 research outputs found

Semi-parametric estimation of the hazard function in a model with covariate measurement error

Author: Martin-Magniette Marie-Laure
Taupin Marie-Luce
Publication venue
Publication date: 08/06/2006
Field of study

We consider a model where the failure hazard function, conditional on a covariate

Z

is given by

R(t,\theta^0|Z)=\eta\_{\gamma^0}(t)f\_{\beta^0}(Z)

, with

\theta^0=(\beta^0,\gamma^0)^\top\in \mathbb{R}^{m+p}

. The baseline hazard function

\eta\_{\gamma^0}

and relative risk

f\_{\beta^0}

belong both to parametric families. The covariate

Z

is measured through the error model

U=Z+\epsilon

where

\epsilon

is independent from

Z

, with known density

f\_\epsilon

. We observe a

n

-sample

(X\_i, D\_i, U\_i)

i=1,...,n

, where

X\_i

is the minimum between the failure time and the censoring time, and

D\_i

is the censoring indicator. We aim at estimating

\theta^0

in presence of the unknown density

g

. Our estimation procedure based on least squares criterion provide two estimators. The first one minimizes an estimation of the least squares criterion where

g

is estimated by density deconvolution. Its rate depends on the smoothnesses of

f\_\epsilon

and

f\_\beta(z)

as a function of

z

,. We derive sufficient conditions that ensure the

\sqrt{n}

-consistency. The second estimator is constructed under conditions ensuring that the least squares criterion can be directly estimated with the parametric rate. These estimators, deeply studied through examples are in particular

\sqrt{n}

-consistent and asymptotically Gaussian in the Cox model and in the excess risk model, whatever is

f\_\epsilon

arXiv.org e-Print Archive

HAL Evry

HAL Descartes

Hal-Diderot

Estimation of the hazard function in a semiparametric model with covariate measurement error

Author: Martin-Magniette Marie-Laure
Taupin Marie-Luce
Publication venue: 'EDP Sciences'
Publication date: 01/01/2008
Field of study

International audienceWe consider a failure hazard function, conditional on a time-independent covariate , given by . The baseline hazard function and the relative risk both belong to parametric families with . The covariate has an unknown density and is measured with an error through an additive error model where is a random variable, independent from , with known density . We observe a -sample , = 1, ..., , where is the minimum between the failure time and the censoring time, and is the censoring indicator. Using least square criterion and deconvolution methods, we propose a consistent estimator of using the observations , = 1, ..., . We give an upper bound for its risk which depends on the smoothness properties of and as a function of , and we derive sufficient conditions for the -consistency. We give detailed examples considering various type of relative risks and various types of error density . In particular, in the Cox model and in the excess risk model, the estimator of is -consistent and asymptotically Gaussian regardless of the form of

HAL Evry

CiteSeerX

EDP Sciences OAI-PMH repository (1.2.0)

HAL Descartes

Numérisation de Documents Anciens Mathématiques

Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome

Author: Aubourg Sébastien
Brunaud Véronique
Bérard Caroline
Martin-Magniette Marie-Laure
Robin Stéphane
Publication venue
Publication date: 01/01/2011
Field of study

Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification

arXiv.org e-Print Archive

HAL Evry

HAL Descartes

Clustering high-throughput sequencing data with Poisson mixture models

Author: Celeux Gilles
Martin-Magniette Marie-Laure
Maugis-Rabusseau Cathy
Rau Andrea
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

In recent years gene expression studies have increasingly made use of next generation sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression has flourished, primarily in the context of normalization and differential analysis. In this work, we focus on the question of clustering digital gene expression profiles as a means to discover groups of co-expressed genes. We propose two parameterizations of a Poisson mixture model to cluster expression profiles of high-throughput sequencing data. A set of simulation studies compares the performance of the proposed models with that of an approach developed for a similar type of data, namely serial analysis of gene expression. We also study the performance of these approaches on two real high-throughput sequencing data sets. The R package HTSCluster used to implement the proposed Poisson mixture models is available on CRAN.De plus en plus, les études d'expression de gènes utilisent les techniques de séquençage de nouvelle génération, entraînant une recherche grandissante sur les méthodes les plus appropriées pour l'exploitation des données digitales d'expression, à commencer pour leur normalisation et l'analyse différentielle. Ici, nous nous intéressons à la classification non supervisée des profils d'expression pour la découverte de groupes de gènes coexprimés. Nous proposons deux paramétrisations d'un modèle de mélange de Poisson pour classer des données de séquençage haut-débit. Par des simulations, nous comparons les performances de ces modèles avec des méthodes similaires conçus pour l'analyse en série de l'expression des gènes (SAGE). Nous étudions aussi les performances de ces modèles sur deux jeux de données réelles. Le package R HTSCluster associé à cette étude est disponible sur le CRAN

HAL Evry

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL-INSA Toulouse

ProdInra

Hal-Diderot

Normalization for triple-target microarray experiments

Author: Aubert Julie
Bar-Hen Avner
Daudin Jean-Jacques
Elftieh Samira
Magniette Frederic
Martin-Magniette Marie-Laure
Renou Jean-Pierre
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Most microarray studies are made using labelling with one or two dyes which allows the hybridization of one or two samples on the same slide. In such experiments, the most frequently used dyes are <it>Cy</it>3 and <it>Cy</it>5. Recent improvements in the technology (dye-labelling, scanner and, image analysis) allow hybridization up to four samples simultaneously. The two additional dyes are <it>Alexa</it>488 and <it>Alexa</it>494. The triple-target or four-target technology is very promising, since it allows more flexibility in the design of experiments, an increase in the statistical power when comparing gene expressions induced by different conditions and a scaled down number of slides. However, there have been few methods proposed for statistical analysis of such data. Moreover the lowess correction of the global dye effect is available for only two-color experiments, and even if its application can be derived, it does not allow simultaneous correction of the raw data. Results We propose a two-step normalization procedure for triple-target experiments. First the dye bleeding is evaluated and corrected if necessary. Then the signal in each channel is normalized using a generalized lowess procedure to correct a global dye bias. The normalization procedure is validated using triple-self experiments and by comparing the results of triple-target and two-color experiments. Although the focus is on triple-target microarrays, the proposed method can be used to normalize <it>p </it>differently labelled targets co-hybridized on a same array, for any value of <it>p </it>greater than 2. Conclusion The proposed normalization procedure is effective: the technical biases are reduced, the number of false positives is under control in the analysis of differentially expressed genes, and the triple-target experiments are more powerful than the corresponding two-color experiments. There is room for improving the microarray experiments by simultaneously hybridizing more than two samples.</p

HAL Evry

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

ProdInra

Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering

Author: Celeux Gilles
Martin-Magniette Marie-Laure
Maugis Cathy
Raftery Adrian E.
Publication venue: Société Française de Statistique et Société Mathématique de France
Publication date: 01/01/2014
Field of study

International audienceWe compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K-means without variable selection. But the model selection approach is not available in a very high dimension contextNous considérons deux approches importantes pour la sélection de variables en classification non supervisée : la sélection par modèle et la régularisation. Parmi les procédures existantes de sélection de variables par sélection de modèles, nous choisissons la méthode de Maugis et al. (2009b), généralisation de celle de Raftery et Dean (2006). Pour les méthodes fondées sur la régularisation, nous nous intéressons à la méthode de Witten and Tibshirani (2010). Nous comparons les performances de classification et de sélection de variables de ces deux procédures sur des données simulées. Nous montrons que la sélection de variables permet d’améliorer la classification quand les classes sont bien séparées. Les deux procédures de sélection de variables étudiées donnent des classifications analogues dans le premier exemple, mais l’approche par sélection de modèles a de meilleures performances pour la sélection de variables. Dans le second exemple, les variables sont corrélées. Nous montrons que l’approche par sélection de modèles améliore globalement la classification et la sélection de variables par rapport à la régularisation, et les deux procédures donnent de meilleurs résultats que l’algorithme des K-means (sans sélection de variables) pour la classification. Mais, il convient de noter que la sélection par modèles est inopérante pour les très grandes dimensions. Enfin, ce travail de comparaison est également mené sur des données réelles

INRIA a CCSD electronic archive server

Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

Author: Alain Bergeron
Alain Bergeron
Arnaud Droit
Arnaud Droit
Benjamin Vittrant
Benjamin Vittrant
Marie Laure Martin-Magniette
Marie Laure Martin-Magniette
Marie Pier Scott Boyer
Marie Pier Scott Boyer
Mickael Leclercq
Mickael Leclercq
Olivier Perin
Yves Fradet
Yves Fradet
Publication venue: 'Frontiers Media SA'
Publication date: 01/05/2019
Field of study

The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

Directory of Open Access Journals

Genome-wide interacting effects of sucrose and herbicide-mediated stress in Arabidopsis thaliana: novel insights into atrazine toxicity and sucrose-induced tolerance

Author: Cabello-Hurtado Francisco
Couée Ivan
El Amrani Abdelhak
Gouesbet Gwenola
Martin-Magniette Marie-Laure
Ramel Fanny
Renou Jean-Pierre
Sulmon Cécile
Taconnat Ludivine
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Soluble sugars, which play a central role in plant structure and metabolism, are also involved in the responses to a number of stresses, and act as metabolite signalling molecules that activate specific or hormone-crosstalk transduction pathways. The different roles of exogenous sucrose in the tolerance of <it>Arabidopsis thaliana </it>plantlets to the herbicide atrazine and oxidative stress were studied by a transcriptomic approach using CATMA arrays. Results Parallel situations of xenobiotic stress and sucrose-induced tolerance in the presence of atrazine, of sucrose, and of sucrose plus atrazine were compared. These approaches revealed that atrazine affected gene expression and therefore seedling physiology at a much larger scale than previously described, with potential impairment of protein translation and of reactive-oxygen-species (ROS) defence mechanisms. Correlatively, sucrose-induced protection against atrazine injury was associated with important modifications of gene expression related to ROS defence mechanisms and repair mechanisms. These protection-related changes of gene expression did not result only from the effects of sucrose itself, but from combined effects of sucrose and atrazine, thus strongly suggesting important interactions of sucrose and xenobiotic signalling or of sucrose and ROS signalling. Conclusion These interactions resulted in characteristic differential expression of gene families such as ascorbate peroxidases, glutathione-S-transferases and cytochrome P450s, and in the early induction of an original set of transcription factors. These genes used as molecular markers will eventually be of great importance in the context of xenobiotic tolerance and phytoremediation.</p

HAL Evry

Springer - Publisher Connector

Directory of Open Access Journals

Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins

Author: Aubourg Sebastien
Celeux Gilles
Lecharny Alain
Martin-Magniette Marie-Laure
Maugis Cathy
Renou Jean-Pierre
Tamby Jean-Philippe
Publication venue: Modulad
Publication date: 01/01/2009
Field of study

Biologists are interested in predicting the gene functions of sequenced genome organisms according to microarray transcriptome data. The microarray technology development allows one to study the whole genome in different experimental conditions. The information abundance may seem to be an advantage for the gene clustering. However, the structure of interest can often be contained in a subset of the available variables. The currently available variable selection procedures in model-based clustering assume that the irrelevant clustering variables are all independent or are all linked with the relevant clustering variables. A more versatile variable selection model is proposed, taking into account three possible roles for each variable: The relevant clustering variables, the redundant variables and the independent variables. A model selection criterion and a variable selection algorithm are derived for this new variable role modelling. The interest of this new modelling for discovering the function of orphan genes is highlighted on a transcriptome dataset for the arabidopsis thaliana plant.Les biologistes s’attachent actuellement à prédire la fonction des gènes d’organismes de génome séquence à partir de données transcriptomes, issues de l’utilisation des puces à ADN. Le d´développement de cette technologie permet de tester l’expression de l’ensemble du génome dans de nombreuses conditions expérimentales. Cette quantité d’information peut alors sembler être un atout pour la classification des gènes. Pourtant il est courant que seul un sous-ensemble contienne l’information pertinente pour la classification. Les procédures de sélection des variables en classification non supervisée par mélanges gaussiens supposent généralement que les variables non informatives pour la classification sont soit toutes indépendantes, soit liées à des variables informatives. Nous proposons une nouvelle modélisation du rôle des variables plus polyvalente : les variables sont soit informatives pour la classification, soit redondantes, soit totalement indépendantes. Nous proposons un critère de sélection des variables et un algorithme pour cette nouvelle modélisation. L’intérêt de cette nouvelle modélisation pour la prédiction de la fonction des gènes orphelins est illustrée sur un ensemble de données transcriptomes obtenues chez Arabidopsis thaliana

HAL Evry

INRIA a CCSD electronic archive server

ProdInra

Hal-Diderot