44 research outputs found

    Block-diagonal covariance selection for high-dimensional Gaussian graphical models

    Get PDF
    Gaussian graphical models are widely utilized to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a non-asymptotic model selection procedure supported by strong theoretical guarantees based on an oracle inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network.Comment: Accepted in JAS

    An overview of variable selection procedures using regularization paths in high-dimensional Gaussian linear regression

    Full text link
    Current high-throughput technologies provide a large amount of variables to describe a phenomenon. Only a few variables are generally sufficient to answer the question. Identify them in a high-dimensional Gaussian linear regression model is the one of the most-used statistical methods. In this article, we describe step-by-step the variable selection procedures built upon regularization paths. Regularization paths are obtained by combining a regularization function and an algorithm. Then, they are combined either with a model selection procedure using penalty functions or with a sampling strategy to obtain the final selected variables. We perform a comparison study by considering three simulation settings with various dependency structures on variables. %from the most classical to a most realistic one. In all the settings, we evaluate (i) the ability to discriminate between the active variables and the non-active variables along the regularization path (pROC-AUC), (ii) the prediction performance of the selected variable subset (MSE) and (iii) the relevance of the selected variables (recall, specificity, FDR). From the results, we provide recommendations on strategies to be favored depending on the characteristics of the problem at hand. We obtain that the regularization function Elastic-net provides most of the time better results than the â„“1\ell_1 one and the lars algorithm has to be privileged as the GD one. ESCV provides the best prediction performances. Bolasso and the knockoffs method are judicious choices to limit the selection of non-active variables while ensuring selection of enough active variables. Conversely, the data-driven penalties considered in this review are not to be favored. As for Tigress and LinSelect, they are conservative methods.Comment: 29 pages, 9 figures, 3 table

    Transformation des données et comparaison de modèles pour la classification des données RNA-seq

    Get PDF
    International audienceLes données d'expression issues du séquençage haut-débit (RNA-seq) sont des données de comptage très hétérogènes. Il est naturel de les représenter par des modèles basés sur des lois discrètes comme la loi de Poisson ou la loi binomiale négative. Mais des transformations simples des données peuvent permettre de se ramener à des modèles plus répandus fondés sur des lois gaussiennes. Nous montrons comment comparer objectivement les vraisemblances de ces modèles travaillant sur des données différentes. Nous nous focalisons pour mener ces comparaisons sur des problèmes de classification où les mélanges de Poisson et gaussiens peuvent etre mis en compétition.High-throughput transcriptome sequencing data (RNA-seq) are made up of highly heterogeneous counts. Although they are often modeled with discrete distributions, including the Poisson and negative binomial distributions, Gaussian models on transformed data could alternatively be considered. We show how the likelihood of these different models can be objectively compared. We focus attention on the problem of clustering gene profiles, where Poisson mixtures on count data are compared with Gaussian mixtures on transformed data

    A model selection criterion for model-based clustering of annotated gene expression data

    Get PDF
    International audienceIn co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data

    Stable network inference in high-dimensional graphical model using single-linkage

    Full text link
    Stability, akin to reproducibility, is crucial in statistical analysis. This paper examines the stability of sparse network inference in high-dimensional graphical models, where selected edges should remain consistent across different samples. Our study focuses on the Graphical Lasso and its decomposition into two steps, with the first step involving hierarchical clustering using single linkage.We provide theoretical proof that single linkage is stable, evidenced by controlled distances between two dendrograms inferred from two samples. Practical experiments further illustrate the stability of the Graphical Lasso's various steps, including dendrograms, variable clusters, and final networks. Our results, validated through both theoretical analysis and practical experiments using simulated and real datasets, demonstrate that single linkage is more stable than other methods when a modular structure is present

    A comprehensive review of variable selection in high-dimensional regression for molecular biology

    Full text link
    15 pages, 5 tablesVariable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables

    A comprehensive review of variable selection in high-dimensional regression for molecular biology

    Full text link
    15 pages, 5 tablesVariable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables

    Data-based filtering for replicated high-throughput transcriptome sequencing experiments

    Get PDF
    Supplementary data are available at Bioinformatics online. Chantier qualité GAInternational audienceRNA sequencing is now widely performed to study differential expression among experimental conditions. As tests are performed on a large number of genes, very stringent false discovery rate control is required at the expense of detection power. Ad hoc filtering techniques are regularly used to moderate this correction by removing genes with low signal, with little attention paid to their impact on downstream analyses
    corecore