13 research outputs found

    Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies

    Get PDF
    Next generation sequencing (NGS) technologies make the studies of microbiomes in very large-scale possible without cultivation in vitro. One approach to sequencing-based microbiome studies is to sequence specific genes (often the 16S rRNA gene) to produce a profile of diversity of bacterial taxa. Alternatively, the NGS-based sequencing strategy, also called shotgun metagenomics, provides further insights at the molecular level, such as species/strain quantification, gene function analysis and association studies. Such studies generate large-scale high-dimensional count and compositional data, which are the focus of this dissertation. In microbiome studies, the taxa composition is often estimated based on the sparse counts of sequencing reads in order to account for the large variability in the total number of reads. The first part of this thesis deals with the problem of estimating the bacterial composition based on sparse count data, where a penalized likelihood of a multinomial model is proposed to estimate the composition by regularizing the nuclear norm of the compositional matrix. Under the assumption that the observed composition is approximately low rank, a nearly optimal theoretical upper bound of the estimation error under the Kullback-Leibler divergence and the Frobenius norm is obtained. Simulation studies demonstrate that the penalized likelihood-based estimator outperforms the commonly used naive estimator in term of the estimation error of the composition matrix and various bacterial diversity measures. An analysis of a microbiome dataset is used to illustrate the methods. Understanding the dependence structure among microbial taxa within a community, including co-occurrence and co-exclusion relationships between microbial taxa, is another important problem in microbiome research. However, the compositional nature of the data complicates the investigation of the dependency structure since there are no known multivariate distributions that are flexible enough to model such a dependency. The second part of the thesis develops a composition-adjusted thresholding (COAT) method to estimate the sparse covariance matrix of the latent log-basis components. The method is based on a decomposition of the variation matrix into a rank-2 component and a sparse component. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable to large covariance matrice estimations based on compositional data. The issue of the identifiability problem of the covariance parameters is rigorously characterized. In addition, rate of convergence under the spectral norm is derived and the procedure is shown to have theoretical guarantee on support recovery under certain assumptions. In the application to gut microbiome data, the COAT method leads to more stable and biologically more interpretable results when comparing the dependence structures of lean and obese microbiomes. The third part of the thesis considers the two-sample testing problem for high-dimensional compositional data and formulates a testable hypothesis of compositional equivalence for the means of two latent log-basis vectors. A test for such a compositional equivalence through the centered log-ratio transformation of the compositions is proposed and is shown have an asymptotic extreme value of type 1 distribution under the null. The power of the test against sparse alternatives is derived. Simulations demonstrate that the proposed tests can be significantly more powerful than existing tests that are applied to the raw and log-transformed compositional data. The usefulness of the proposed tests is illustrated by applications to test for differences in gut microbiome composition between lean and obese individuals and changes of gut microbiome between different time points during treatment in Crohn\u27s disease patients

    Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding

    No full text
    <p>High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some existing optimization-based estimators. We apply the proposed method to the analysis of a microbiome dataset to understand the dependence structure among bacterial taxa in the human gut.</p

    The complete mitochondrial genome of ostorhinchus fleurieu (kurtiformes: Apogonidae) and phylogenetic studies of apogoninae

    No full text
    The complete mitochondrial genome of Ostorhinchus fleurieu was first determined, which was 16,521 bp in length, containing 13 protein-coding genes, two rRNA genes, 22 tRNA genes, a putative control region and one origin of replication on the light-strand. The overall base composition included C(29.2%), A(26.7%), T(26.7%) and G(17.4%). Moreover, the 13 PCGs encoded 3800 amino acids in total, twelve of which used the initiation codon ATG except for COI started with GTG. Most of them ended with complete stop codon, whereas three protein-coding genes (COII, ND4 and Cytb) used incomplete stop codon and represented as T. The phylogenetic tree based on the Neighbour Joining method was constructed to provide relationship within Apogoninae, which could be a useful basis for management of this species

    Characterization of the complete mitochondrial genome of Hyphessobrycon herbertaxelrodi (Characiformes, Characidae) and phylogenetic studies of Characiformes

    No full text
    In this study, the complete mitochondrial genome of Hyphessobrycon herbertaxelrodi is presented, and we also discussed its mitochondrial characteristics. The full length of the mitochondrial genome was 17,417 bp, including 13 protein coding genes (PCGs), 2 ribosomal RNAs (12S and 16S), 22 transfer RNA genes, 1 non-coding control region (D-loop), and 1 origin of replication on the light-strand. The total nucleotide composition of mitochondrial DNA was 29.76%A, 29.88%T, 25.35%C, 15.01%G, and AT was 59.64%. The phylogenetic tree suggested that H. herbertaxelrodi shared the most recent common ancestor with Astyanax giton, Grundulus bogotensis, Astyanax paranae, and Oligosarcus argenteus

    Characterization of the complete mitochondrial genome of Chinese Konosirus punctatus (Clupeiformes, Clupeidae) and phylogenetic studies of Clupeiformes

    No full text
    The Dotted Gizzard Shad (Konosirus punctatus) was one of the most important commercial fish species in China, Japan and Korea. In this study, the complete mitochondrial genome of K. punctatus was presented. The full length of the mitochondrial genome was 16,705 bp, including 13 protein-coding genes (PCGs), two ribosomal RNAs, 22 transfer RNA genes, one non-coding control region (CR) and one origin of replication on the light-strand. The total nucleotide composition of mitochondrial DNA was 25.79%A, 25.09%T, 29.05%C, 20.08%G, and AT was 50.88%. The mitochondrial genome provides an important resource for solving taxonomic problems and studying molecular evolution
    corecore