140 research outputs found

    Microbial Network Recovery by Compositional Graphical Lasso Under Additive Log-Ratio Transformation

    Get PDF
    The interactions between microbial taxa have been under great research interest in the sci- ence community given the microbiome data deluge. Several methods have been proposed to model and estimate the conditional dependency between microbial taxa for their interac- tions, in order to eliminate spurious correlation detections. However, these methods either do not account for the compositional count nature of microbiome data (such as graphi- cal lasso), or are built upon the central log-ratio transformation (such as SPIEC-EASI) that results in a degenerate covariance matrix and thus an undefined precision matrix to present the underlying network. In addition, most existing methods ignore the potential consequence of the heterogeneity nature of microbiome data that the sum of the counts within each sample, termed “sequencing depth”, can vary drastically across samples. To address these issues, we propose a novel method called “compositional graphical lasso” to identify the microbial interactions by adopting a logistic normal multinomial model which explicitly incorporates the sequencing depths. Different from most existing meth- ods, compositional graphical lasso is based on the additive log-ratio transformation, which first selects a reference taxon and then computes the log ratios of the abundances of all the other taxa with respect to that of the reference. One natural concern about the additive log-ratio transformation would be whether the estimated network is invariant with respect to the choice of the reference. To further address this concern, we establish the reference- invariance property of a subnetwork of interest based on the additive log-ratio transformed data, and propose a reference-invariant version of the compositional graphical lasso by modifying the penalty term in its objective function to penalize only the invariant subnet- work. We illustrate the advantages of the proposed methods over the existing ones under a variety of simulation scenarios and also demonstrate their efficacy by applying them to an oceanic microbiome data set

    Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies

    Get PDF
    Next generation sequencing (NGS) technologies make the studies of microbiomes in very large-scale possible without cultivation in vitro. One approach to sequencing-based microbiome studies is to sequence specific genes (often the 16S rRNA gene) to produce a profile of diversity of bacterial taxa. Alternatively, the NGS-based sequencing strategy, also called shotgun metagenomics, provides further insights at the molecular level, such as species/strain quantification, gene function analysis and association studies. Such studies generate large-scale high-dimensional count and compositional data, which are the focus of this dissertation. In microbiome studies, the taxa composition is often estimated based on the sparse counts of sequencing reads in order to account for the large variability in the total number of reads. The first part of this thesis deals with the problem of estimating the bacterial composition based on sparse count data, where a penalized likelihood of a multinomial model is proposed to estimate the composition by regularizing the nuclear norm of the compositional matrix. Under the assumption that the observed composition is approximately low rank, a nearly optimal theoretical upper bound of the estimation error under the Kullback-Leibler divergence and the Frobenius norm is obtained. Simulation studies demonstrate that the penalized likelihood-based estimator outperforms the commonly used naive estimator in term of the estimation error of the composition matrix and various bacterial diversity measures. An analysis of a microbiome dataset is used to illustrate the methods. Understanding the dependence structure among microbial taxa within a community, including co-occurrence and co-exclusion relationships between microbial taxa, is another important problem in microbiome research. However, the compositional nature of the data complicates the investigation of the dependency structure since there are no known multivariate distributions that are flexible enough to model such a dependency. The second part of the thesis develops a composition-adjusted thresholding (COAT) method to estimate the sparse covariance matrix of the latent log-basis components. The method is based on a decomposition of the variation matrix into a rank-2 component and a sparse component. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable to large covariance matrice estimations based on compositional data. The issue of the identifiability problem of the covariance parameters is rigorously characterized. In addition, rate of convergence under the spectral norm is derived and the procedure is shown to have theoretical guarantee on support recovery under certain assumptions. In the application to gut microbiome data, the COAT method leads to more stable and biologically more interpretable results when comparing the dependence structures of lean and obese microbiomes. The third part of the thesis considers the two-sample testing problem for high-dimensional compositional data and formulates a testable hypothesis of compositional equivalence for the means of two latent log-basis vectors. A test for such a compositional equivalence through the centered log-ratio transformation of the compositions is proposed and is shown have an asymptotic extreme value of type 1 distribution under the null. The power of the test against sparse alternatives is derived. Simulations demonstrate that the proposed tests can be significantly more powerful than existing tests that are applied to the raw and log-transformed compositional data. The usefulness of the proposed tests is illustrated by applications to test for differences in gut microbiome composition between lean and obese individuals and changes of gut microbiome between different time points during treatment in Crohn\u27s disease patients
    corecore