179 research outputs found
Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization
We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which
is fundamental for microbiome analysis. In this problem, the goal is to
reconstruct the identity and frequency of species comprising a microbial
community, using short sequence reads from Massively Parallel Sequencing (MPS)
data obtained for specified genomic regions. We formulate the problem
mathematically as a convex optimization problem and provide sufficient
conditions for identifiability, namely the ability to reconstruct species
identity and frequency correctly when the data size (number of reads) grows to
infinity. We discuss different metrics for assessing the quality of the
reconstructed solution, including a novel phylogenetically-aware metric based
on the Mahalanobis distance, and give upper-bounds on the reconstruction error
for a finite number of reads under different metrics. We propose a scalable
divide-and-conquer algorithm for the problem using convex optimization, which
enables us to handle large problems (with species). We show using
numerical simulations that for realistic scenarios, where the microbial
communities are sparse, our algorithm gives solutions with high accuracy, both
in terms of obtaining accurate frequency, and in terms of species phylogenetic
resolution.Comment: To appear in SPIRE 1
Recommended from our members
High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions
The emergence of massively parallel sequencing technology has revolutionized microbial profiling, allowing the unprecedented comparison of microbial diversity across time and space in a wide range of host-associated and environmental ecosystems. Although the high-throughput nature of such methods enables the detection of low-frequency bacteria, these advances come at the cost of sequencing read length, limiting the phylogenetic resolution possible by current methods. Here, we present a generic approach for integrating short reads from large genomic regions, thus enabling phylogenetic resolution far exceeding current methods. The approach is based on a mapping to a statistical model that is later solved as a constrained optimization problem. We demonstrate the utility of this method by analyzing human saliva and Drosophila samples, using Illumina single-end sequencing of a 750 bp amplicon of the 16S rRNA gene. Phylogenetic resolution is significantly extended while reducing the number of falsely detected bacteria, as compared with standard single-region Roche 454 Pyrosequencing. Our approach can be seamlessly applied to simultaneous sequencing of multiple genes providing a higher resolution view of the composition and activity of complex microbial communities
SEK: sparsity exploiting k-mer-based estimation of bacterial community composition.
MOTIVATION: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment.
RESULTS: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.
AVAILABILITY AND IMPLEMENTATION: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site
Recommended from our members
SEK: sparsity exploiting k-mer-based estimation of bacterial community composition
MOTIVATION: Estimation of bacterial community composition from
a high-throughput sequenced sample is an important task in
metagenomics applications. Since the sample sequence data
typically harbors reads of variable lengths and different levels of
biological and technical noise, accurate statistical analysis of such
data is challenging. Currently popular estimation methods are
typically very time consuming in a desktop computing environment.
RESULTS: Using sparsity enforcing methods from the general sparse
signal processing field (such as compressed sensing), we derive
a solution to the community composition estimation problem by a
simultaneous assignment of all sample reads to a pre-processed
reference database. A general statistical model based on kernel
density estimation techniques is introduced for the assignment task
and the model solution is obtained using convex optimization tools.
Further, we design a greedy algorithm solution for a fast solution. Our
approach offers a reasonably fast community composition estimation
method which is shown to be more robust to input data variation than
a recently introduced related method.
AVAILABILITY: A platform-independent Matlab implementation of the
method is freely available at http://www.ee.kth.se/ctsoftware; source
code that does not require access to Matlab is currently being tested
and will be made available later through the above website.This is a pre-copy-editing, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The definitive publisher-authenticated version, Chatterjee, S., Koslicki, D., Dong, S., Innocenti, N., Cheng, L., Lan, Y., ... & Corander, J. (2014). SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition. Bioinformatics, 30(17), 2423-2431. doi:10.1093/bioinformatics/btu320, is available online at: http://bioinformatics.oxfordjournals.org/content/30/17/2423. The published article is copyrighted by the Author(s) and published by Oxford University Press
Statistical Methods for Characterizing Genomic Heterogeneity in Mixed Samples
Recently, sequencing technologies have generated massive and heterogeneous data sets. However, interpretation of these data sets is a major barrier to understand genomic heterogeneity in complex diseases. In this dissertation, we develop a Bayesian statistical method for single nucleotide level analysis and a global optimization method for gene expression level analysis to characterize genomic heterogeneity in mixed samples. The detection of rare single nucleotide variants (SNVs) is important for understanding genetic heterogeneity using next-generation sequencing (NGS) data. Various computational algorithms have been proposed to detect variants at the single nucleotide level in mixed samples. Yet, the noise inherent in the biological processes involved in NGS technology necessitates the development of statistically accurate methods to identify true rare variants. At the single nucleotide level, we propose a Bayesian probabilistic model and a variational expectation maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of relatively low coverage (27x and 298x) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants. Characterization of heterogeneity in gene expression data is a critical challenge for personalized treatment and drug resistance due to intra-tumor heterogeneity. Mixed membership factorization has become popular for analyzing data sets that have within-sample heterogeneity. In recent years, several algorithms have been developed for mixed membership matrix factorization, but they only guarantee estimates from a local optimum. At the gene expression level, we derive a global optimization (GOP) algorithm that provides a guaranteed epsilon-global optimum for a sparse mixed membership matrix factorization problem for molecular subtype classification. We test the algorithm on simulated data and find the algorithm always bounds the global optimum across random initializations and explores multiple modes efficiently. The GOP algorithm is well-suited for parallel computations in the key optimization steps
Expanding the ancient DNA bioinformatics toolbox, and its applications to archeological microbiomes
The 1980s were very prolific years not only for music, but also for molecular biology and genetics, with the first publications on the microbiome and ancient DNA. Several technical revolutions later, the field of ancient metagenomics is now progressing full steam ahead, at a never seen before pace. While generating sequencing data is becoming cheaper every year, the bioinformatics methods and the compute power needed to analyze them are struggling to catch up. In this thesis, I propose new methods to reduce the sequencing to analysis gap, by introducing scalable and parallelized softwares for ancient DNA metagenomics analysis. In manuscript A, I first introduce a method for estimating the mixtures of different sources in a sequencing sample, a problem known as source tracking. I then apply this method to predict the original sources of paleofeces in manuscript B. In manuscript C, I propose a new method to scale the lowest common ancestor calling from sequence alignment files, which brings a solution for the computational intractability of fitting ever growing metagenomic reference database indices in memory. In manuscript D, I present a method to statistically estimate in parallel the ancient DNA deamination damage, and test it in the context of de novo assembly. Finally, in manuscript E, I apply some of the methods developed in this thesis to the analyis of ancient wine fermentation samples, and present the first ancient genomes of ancient fermentation bacteria. Taken together, the tools developed in this thesis will help the researchers working in the field of ancient DNA metagenomics to scale their analysis to the massive amount of sequencing data routinely produced nowadays
Advances in Forensic Genetics
The book has 25 articles about the status and new directions in forensic genetics. Approximately half of the articles are invited reviews, and the remaining articles deal with new forensic genetic methods. The articles cover aspects such as sampling DNA evidence at the scene of a crime; DNA transfer when handling evidence material and how to avoid DNA contamination of items, laboratory, etc.; identification of body fluids and tissues with RNA; forensic microbiome analysis with molecular biology methods as a supplement to the examination of human DNA; forensic DNA phenotyping for predicting visible traits such as eye, hair, and skin colour; new ancestry informative DNA markers for estimating ethnic origin; new genetic genealogy methods for identifying distant relatives that cannot be identified with conventional forensic DNA typing; sensitive DNA methods, including single-cell DNA analysis and other highly specialised and sensitive methods to examine ancient DNA from unidentified victims of war; forensic animal genetics; genetics of visible traits in dogs; statistical tools for interpreting forensic DNA analyses, including the most used IT tools for forensic STR-typing and DNA sequencing; haploid markers (Y-chromosome and mitochondria DNA); inference of ethnic origin; a comprehensive logical framework for the interpretation of forensic genetic DNA data; and an overview of the ethical aspects of modern forensic genetics
- …