Search CORE

179 research outputs found

Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization

Author: A. Amir
B.J. Paster
C. Lozupone
C.A. Lozupone
D. Hiller
D. Kessner
D.H. Haft
E.R. Mardis
I. Eskin
J.R. Cole
M. Grant
M. Hamady
N. Segata
P. Meinicke
P.B. Eckburg
S. Pavoine
T.J. Gentry
T.Z. DeSantis
Publication venue
Publication date: 01/01/2013
Field of study

We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with

\sim10^6

species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.Comment: To appear in SPIRE 1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Recommended from our members

High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions

Author: Amir Amnon
Elgart Michael
Shamir Ohad
Shental Noam
Soen Yoav
Stern Shay
Turnbaugh Peter J.
Zeisel Amit
Zuk Or
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/03/2014
Field of study

The emergence of massively parallel sequencing technology has revolutionized microbial profiling, allowing the unprecedented comparison of microbial diversity across time and space in a wide range of host-associated and environmental ecosystems. Although the high-throughput nature of such methods enables the detection of low-frequency bacteria, these advances come at the cost of sequencing read length, limiting the phylogenetic resolution possible by current methods. Here, we present a generic approach for integrating short reads from large genomic regions, thus enabling phylogenetic resolution far exceeding current methods. The approach is based on a mapping to a statistical model that is later solved as a constrained optimization problem. We demonstrate the utility of this method by analyzing human saliva and Drosophila samples, using Illumina single-end sequencing of a 750 bp amplicon of the 16S rRNA gene. Phylogenetic resolution is significantly extended while reducing the number of falsely detected bacteria, as compared with standard single-region Roche 454 Pyrosequencing. Our approach can be seamlessly applied to simultaneous sequencing of multiple genes providing a higher resolution view of the composition and activity of complex microbial communities

Harvard University - DASH

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition.

Author: Amir
D. Koslicki
E. Aurell
J. Corander
L. Cheng
L. K. Rasmussen
M. Skoglund
M. Vehkapera
N. Innocenti
S. Chatterjee
S. Dong
Y. Lan
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

MOTIVATION: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. RESULTS: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method. AVAILABILITY AND IMPLEMENTATION: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Online Research @ Cardiff

ScholarsArchive@OSU

Digitala Vetenskapliga Arkivet - Academic Archive On-line

VTT Research System

White Rose Research Online

Recommended from our members

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition

Author: Aurell Erik
Chatterjee Saikat
Cheng Lu
Corander Jukka
Dong Siyuan
Innocenti Nicholas
Koslicki David
Lan Yueheng
Rasmussen Lars, K.
Skoglund Mikael
Vehkaperä Mikko
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

MOTIVATION: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. Since the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically very time consuming in a desktop computing environment. RESULTS: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method which is shown to be more robust to input data variation than a recently introduced related method. AVAILABILITY: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above website.This is a pre-copy-editing, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The definitive publisher-authenticated version, Chatterjee, S., Koslicki, D., Dong, S., Innocenti, N., Cheng, L., Lan, Y., ... & Corander, J. (2014). SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition. Bioinformatics, 30(17), 2423-2431. doi:10.1093/bioinformatics/btu320, is available online at: http://bioinformatics.oxfordjournals.org/content/30/17/2423. The published article is copyrighted by the Author(s) and published by Oxford University Press

ScholarsArchive@OSU

Statistical Methods for Characterizing Genomic Heterogeneity in Mixed Samples

Author: Zhang Fan
Publication venue: Digital WPI
Publication date: 12/12/2016
Field of study

Recently, sequencing technologies have generated massive and heterogeneous data sets. However, interpretation of these data sets is a major barrier to understand genomic heterogeneity in complex diseases. In this dissertation, we develop a Bayesian statistical method for single nucleotide level analysis and a global optimization method for gene expression level analysis to characterize genomic heterogeneity in mixed samples. The detection of rare single nucleotide variants (SNVs) is important for understanding genetic heterogeneity using next-generation sequencing (NGS) data. Various computational algorithms have been proposed to detect variants at the single nucleotide level in mixed samples. Yet, the noise inherent in the biological processes involved in NGS technology necessitates the development of statistically accurate methods to identify true rare variants. At the single nucleotide level, we propose a Bayesian probabilistic model and a variational expectation maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of relatively low coverage (27x and 298x) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants. Characterization of heterogeneity in gene expression data is a critical challenge for personalized treatment and drug resistance due to intra-tumor heterogeneity. Mixed membership factorization has become popular for analyzing data sets that have within-sample heterogeneity. In recent years, several algorithms have been developed for mixed membership matrix factorization, but they only guarantee estimates from a local optimum. At the gene expression level, we derive a global optimization (GOP) algorithm that provides a guaranteed epsilon-global optimum for a sparse mixed membership matrix factorization problem for molecular subtype classification. We test the algorithm on simulated data and find the algorithm always bounds the global optimum across random initializations and explores multiple modes efficiently. The GOP algorithm is well-suited for parallel computations in the key optimization steps

DigitalCommons@WPI

Expanding the ancient DNA bioinformatics toolbox, and its applications to archeological microbiomes

Author: Borry Maxime
Publication venue
Publication date: 01/01/2023
Field of study

The 1980s were very prolific years not only for music, but also for molecular biology and genetics, with the first publications on the microbiome and ancient DNA. Several technical revolutions later, the field of ancient metagenomics is now progressing full steam ahead, at a never seen before pace. While generating sequencing data is becoming cheaper every year, the bioinformatics methods and the compute power needed to analyze them are struggling to catch up. In this thesis, I propose new methods to reduce the sequencing to analysis gap, by introducing scalable and parallelized softwares for ancient DNA metagenomics analysis. In manuscript A, I first introduce a method for estimating the mixtures of different sources in a sequencing sample, a problem known as source tracking. I then apply this method to predict the original sources of paleofeces in manuscript B. In manuscript C, I propose a new method to scale the lowest common ancestor calling from sequence alignment files, which brings a solution for the computational intractability of fitting ever growing metagenomic reference database indices in memory. In manuscript D, I present a method to statistically estimate in parallel the ancient DNA deamination damage, and test it in the context of de novo assembly. Finally, in manuscript E, I apply some of the methods developed in this thesis to the analyis of ancient wine fermentation samples, and present the first ancient genomes of ancient fermentation bacteria. Taken together, the tools developed in this thesis will help the researchers working in the field of ancient DNA metagenomics to scale their analysis to the massive amount of sequencing data routinely produced nowadays

Digitale Bibliothek Thüringen

Advances in Forensic Genetics

Author
Publication venue: 'MDPI AG'
Publication date: 12/08/2022
Field of study

The book has 25 articles about the status and new directions in forensic genetics. Approximately half of the articles are invited reviews, and the remaining articles deal with new forensic genetic methods. The articles cover aspects such as sampling DNA evidence at the scene of a crime; DNA transfer when handling evidence material and how to avoid DNA contamination of items, laboratory, etc.; identification of body fluids and tissues with RNA; forensic microbiome analysis with molecular biology methods as a supplement to the examination of human DNA; forensic DNA phenotyping for predicting visible traits such as eye, hair, and skin colour; new ancestry informative DNA markers for estimating ethnic origin; new genetic genealogy methods for identifying distant relatives that cannot be identified with conventional forensic DNA typing; sensitive DNA methods, including single-cell DNA analysis and other highly specialised and sensitive methods to examine ancient DNA from unidentified victims of war; forensic animal genetics; genetics of visible traits in dogs; statistical tools for interpreting forensic DNA analyses, including the most used IT tools for forensic STR-typing and DNA sequencing; haploid markers (Y-chromosome and mitochondria DNA); inference of ethnic origin; a comprehensive logical framework for the interpretation of forensic genetic DNA data; and an overview of the ethical aspects of modern forensic genetics

Directory of Open Access Books (DOAB)