168 research outputs found

    Semi-parametric estimation of the hazard function in a model with covariate measurement error

    Get PDF
    We consider a model where the failure hazard function, conditional on a covariate ZZ is given by R(t,Ξ0∣Z)=η_Îł0(t)f_ÎČ0(Z)R(t,\theta^0|Z)=\eta\_{\gamma^0}(t)f\_{\beta^0}(Z), with Ξ0=(ÎČ0,Îł0)⊀∈Rm+p\theta^0=(\beta^0,\gamma^0)^\top\in \mathbb{R}^{m+p}. The baseline hazard function η_Îł0\eta\_{\gamma^0} and relative risk f_ÎČ0f\_{\beta^0} belong both to parametric families. The covariate ZZ is measured through the error model U=Z+Ï”U=Z+\epsilon where Ï”\epsilon is independent from ZZ, with known density f_Ï”f\_\epsilon. We observe a nn-sample (X_i,D_i,U_i)(X\_i, D\_i, U\_i), i=1,...,ni=1,...,n, where X_iX\_i is the minimum between the failure time and the censoring time, and D_iD\_i is the censoring indicator. We aim at estimating Ξ0\theta^0 in presence of the unknown density gg. Our estimation procedure based on least squares criterion provide two estimators. The first one minimizes an estimation of the least squares criterion where gg is estimated by density deconvolution. Its rate depends on the smoothnesses of f_Ï”f\_\epsilon and f_ÎČ(z)f\_\beta(z) as a function of zz,. We derive sufficient conditions that ensure the n\sqrt{n}-consistency. The second estimator is constructed under conditions ensuring that the least squares criterion can be directly estimated with the parametric rate. These estimators, deeply studied through examples are in particular n\sqrt{n}-consistent and asymptotically Gaussian in the Cox model and in the excess risk model, whatever is f_Ï”f\_\epsilon

    Estimation of the hazard function in a semiparametric model with covariate measurement error

    Get PDF
    International audienceWe consider a failure hazard function, conditional on a time-independent covariate , given by . The baseline hazard function and the relative risk both belong to parametric families with . The covariate has an unknown density and is measured with an error through an additive error model where is a random variable, independent from , with known density . We observe a -sample , = 1, ..., , where is the minimum between the failure time and the censoring time, and is the censoring indicator. Using least square criterion and deconvolution methods, we propose a consistent estimator of using the observations , = 1, ..., .
We give an upper bound for its risk which depends on the smoothness properties of and as a function of , and we derive sufficient conditions for the -consistency. We give detailed examples considering various type of relative risks and various types of error density . In particular, in the Cox model and in the excess risk model, the estimator of is -consistent and asymptotically Gaussian regardless of the form of

    Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome

    Full text link
    Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification

    Clustering high-throughput sequencing data with Poisson mixture models

    Get PDF
    In recent years gene expression studies have increasingly made use of next generation sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression has flourished, primarily in the context of normalization and differential analysis. In this work, we focus on the question of clustering digital gene expression profiles as a means to discover groups of co-expressed genes. We propose two parameterizations of a Poisson mixture model to cluster expression profiles of high-throughput sequencing data. A set of simulation studies compares the performance of the proposed models with that of an approach developed for a similar type of data, namely serial analysis of gene expression. We also study the performance of these approaches on two real high-throughput sequencing data sets. The R package HTSCluster used to implement the proposed Poisson mixture models is available on CRAN.De plus en plus, les études d'expression de gÚnes utilisent les techniques de séquençage de nouvelle génération, entraßnant une recherche grandissante sur les méthodes les plus appropriées pour l'exploitation des données digitales d'expression, à commencer pour leur normalisation et l'analyse différentielle. Ici, nous nous intéressons à la classification non supervisée des profils d'expression pour la découverte de groupes de gÚnes coexprimés. Nous proposons deux paramétrisations d'un modÚle de mélange de Poisson pour classer des données de séquençage haut-débit. Par des simulations, nous comparons les performances de ces modÚles avec des méthodes similaires conçus pour l'analyse en série de l'expression des gÚnes (SAGE). Nous étudions aussi les performances de ces modÚles sur deux jeux de données réelles. Le package R HTSCluster associé à cette étude est disponible sur le CRAN

    Normalization for triple-target microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most microarray studies are made using labelling with one or two dyes which allows the hybridization of one or two samples on the same slide. In such experiments, the most frequently used dyes are <it>Cy</it>3 and <it>Cy</it>5. Recent improvements in the technology (dye-labelling, scanner and, image analysis) allow hybridization up to four samples simultaneously. The two additional dyes are <it>Alexa</it>488 and <it>Alexa</it>494. The triple-target or four-target technology is very promising, since it allows more flexibility in the design of experiments, an increase in the statistical power when comparing gene expressions induced by different conditions and a scaled down number of slides. However, there have been few methods proposed for statistical analysis of such data. Moreover the lowess correction of the global dye effect is available for only two-color experiments, and even if its application can be derived, it does not allow simultaneous correction of the raw data.</p> <p>Results</p> <p>We propose a two-step normalization procedure for triple-target experiments. First the dye bleeding is evaluated and corrected if necessary. Then the signal in each channel is normalized using a generalized lowess procedure to correct a global dye bias. The normalization procedure is validated using triple-self experiments and by comparing the results of triple-target and two-color experiments. Although the focus is on triple-target microarrays, the proposed method can be used to normalize <it>p </it>differently labelled targets co-hybridized on a same array, for any value of <it>p </it>greater than 2.</p> <p>Conclusion</p> <p>The proposed normalization procedure is effective: the technical biases are reduced, the number of false positives is under control in the analysis of differentially expressed genes, and the triple-target experiments are more powerful than the corresponding two-color experiments. There is room for improving the microarray experiments by simultaneously hybridizing more than two samples.</p

    Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering

    Get PDF
    International audienceWe compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K-means without variable selection. But the model selection approach is not available in a very high dimension contextNous considĂ©rons deux approches importantes pour la sĂ©lection de variables en classification non supervisĂ©e : la sĂ©lection par modĂšle et la rĂ©gularisation. Parmi les procĂ©dures existantes de sĂ©lection de variables par sĂ©lection de modĂšles, nous choisissons la mĂ©thode de Maugis et al. (2009b), gĂ©nĂ©ralisation de celle de Raftery et Dean (2006). Pour les mĂ©thodes fondĂ©es sur la rĂ©gularisation, nous nous intĂ©ressons Ă  la mĂ©thode de Witten and Tibshirani (2010). Nous comparons les performances de classification et de sĂ©lection de variables de ces deux procĂ©dures sur des donnĂ©es simulĂ©es. Nous montrons que la sĂ©lection de variables permet d’amĂ©liorer la classification quand les classes sont bien sĂ©parĂ©es. Les deux procĂ©dures de sĂ©lection de variables Ă©tudiĂ©es donnent des classifications analogues dans le premier exemple, mais l’approche par sĂ©lection de modĂšles a de meilleures performances pour la sĂ©lection de variables. Dans le second exemple, les variables sont corrĂ©lĂ©es. Nous montrons que l’approche par sĂ©lection de modĂšles amĂ©liore globalement la classification et la sĂ©lection de variables par rapport Ă  la rĂ©gularisation, et les deux procĂ©dures donnent de meilleurs rĂ©sultats que l’algorithme des K-means (sans sĂ©lection de variables) pour la classification. Mais, il convient de noter que la sĂ©lection par modĂšles est inopĂ©rante pour les trĂšs grandes dimensions. Enfin, ce travail de comparaison est Ă©galement menĂ© sur des donnĂ©es rĂ©elles

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    Get PDF
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    Genome-wide interacting effects of sucrose and herbicide-mediated stress in Arabidopsis thaliana: novel insights into atrazine toxicity and sucrose-induced tolerance

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Soluble sugars, which play a central role in plant structure and metabolism, are also involved in the responses to a number of stresses, and act as metabolite signalling molecules that activate specific or hormone-crosstalk transduction pathways. The different roles of exogenous sucrose in the tolerance of <it>Arabidopsis thaliana </it>plantlets to the herbicide atrazine and oxidative stress were studied by a transcriptomic approach using CATMA arrays.</p> <p>Results</p> <p>Parallel situations of xenobiotic stress and sucrose-induced tolerance in the presence of atrazine, of sucrose, and of sucrose plus atrazine were compared. These approaches revealed that atrazine affected gene expression and therefore seedling physiology at a much larger scale than previously described, with potential impairment of protein translation and of reactive-oxygen-species (ROS) defence mechanisms. Correlatively, sucrose-induced protection against atrazine injury was associated with important modifications of gene expression related to ROS defence mechanisms and repair mechanisms. These protection-related changes of gene expression did not result only from the effects of sucrose itself, but from combined effects of sucrose and atrazine, thus strongly suggesting important interactions of sucrose and xenobiotic signalling or of sucrose and ROS signalling.</p> <p>Conclusion</p> <p>These interactions resulted in characteristic differential expression of gene families such as ascorbate peroxidases, glutathione-S-transferases and cytochrome P450s, and in the early induction of an original set of transcription factors. These genes used as molecular markers will eventually be of great importance in the context of xenobiotic tolerance and phytoremediation.</p

    Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gÚnes orphelins

    Get PDF
    Biologists are interested in predicting the gene functions of sequenced genome organisms according to microarray transcriptome data. The microarray technology development allows one to study the whole genome in different experimental conditions. The information abundance may seem to be an advantage for the gene clustering. However, the structure of interest can often be contained in a subset of the available variables. The currently available variable selection procedures in model-based clustering assume that the irrelevant clustering variables are all independent or are all linked with the relevant clustering variables. A more versatile variable selection model is proposed, taking into account three possible roles for each variable: The relevant clustering variables, the redundant variables and the independent variables. A model selection criterion and a variable selection algorithm are derived for this new variable role modelling. The interest of this new modelling for discovering the function of orphan genes is highlighted on a transcriptome dataset for the arabidopsis thaliana plant.Les biologistes s’attachent actuellement Ă  prĂ©dire la fonction des gĂšnes d’organismes de gĂ©nome sĂ©quence Ă  partir de donnĂ©es transcriptomes, issues de l’utilisation des puces Ă  ADN. Le dÂŽdĂ©veloppement de cette technologie permet de tester l’expression de l’ensemble du gĂ©nome dans de nombreuses conditions expĂ©rimentales. Cette quantitĂ© d’information peut alors sembler ĂȘtre un atout pour la classification des gĂšnes. Pourtant il est courant que seul un sous-ensemble contienne l’information pertinente pour la classification. Les procĂ©dures de sĂ©lection des variables en classification non supervisĂ©e par mĂ©langes gaussiens supposent gĂ©nĂ©ralement que les variables non informatives pour la classification sont soit toutes indĂ©pendantes, soit liĂ©es Ă  des variables informatives. Nous proposons une nouvelle modĂ©lisation du rĂŽle des variables plus polyvalente : les variables sont soit informatives pour la classification, soit redondantes, soit totalement indĂ©pendantes. Nous proposons un critĂšre de sĂ©lection des variables et un algorithme pour cette nouvelle modĂ©lisation. L’intĂ©rĂȘt de cette nouvelle modĂ©lisation pour la prĂ©diction de la fonction des gĂšnes orphelins est illustrĂ©e sur un ensemble de donnĂ©es transcriptomes obtenues chez Arabidopsis thaliana
    • 

    corecore