528 research outputs found

    MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs

    Get PDF
    MotifCluster finds related motifs in a set of sequences and clusters the sequences into families using the motifs they contain

    Using the nucleotide substitution rate matrix to detect horizontal gene transfer

    Get PDF
    BACKGROUND: Horizontal gene transfer (HGT) has allowed bacteria to evolve many new capabilities. Because transferred genes perform many medically important functions, such as conferring antibiotic resistance, improved detection of horizontally transferred genes from sequence data would be an important advance. Existing sequence-based methods for detecting HGT focus on changes in nucleotide composition or on differences between gene and genome phylogenies; these methods have high error rates. RESULTS: First, we introduce a new class of methods for detecting HGT based on the changes in nucleotide substitution rates that occur when a gene is transferred to a new organism. Our new methods discriminate simulated HGT events with an error rate up to 10 times lower than does GC content. Use of models that are not time-reversible is crucial for detecting HGT. Second, we show that using combinations of multiple predictors of HGT offers substantial improvements over using any single predictor, yielding as much as a factor of 18 improvement in performance (a maximum reduction in error rate from 38% to about 3%). Multiple predictors were combined by using the random forests machine learning algorithm to identify optimal classifiers that separate HGT from non-HGT trees. CONCLUSION: The new class of HGT-detection methods introduced here combines advantages of phylogenetic and compositional HGT-detection techniques. These new techniques offer order-of-magnitude improvements over compositional methods because they are better able to discriminate HGT from non-HGT trees under a wide range of simulated conditions. We also found that combining multiple measures of HGT is essential for detecting a wide range of HGT events. These novel indicators of horizontal transfer will be widely useful in detecting HGT events linked to the evolution of important bacterial traits, such as antibiotic resistance and pathogenicity

    Short pyrosequencing reads suffice for accurate microbial community analysis

    Get PDF
    Pyrosequencing technology allows us to characterize microbial communities using 16S ribosomal RNA (rRNA) sequences orders of magnitude faster and more cheaply than has previously been possible. However, results from different studies using pyrosequencing and traditional sequencing are often difficult to compare, because amplicons covering different regions of the rRNA might yield different conclusions. We used sequences from over 200 globally dispersed environments to test whether studies that used similar primers clustered together mistakenly, without regard to environment. We then tested whether primer choice affects sequence-based community analyses using UniFrac, our recently-developed method for comparing microbial communities. We performed three tests of primer effects. We tested whether different simulated amplicons generated the same UniFrac clustering results as near-full-length sequences for three recent large-scale studies of microbial communities in the mouse and human gut, and the Guerrero Negro microbial mat. We then repeated this analysis for short sequences (100-, 150-, 200- and 250-base reads) resembling those produced by pyrosequencing. The results show that sequencing effort is best focused on gathering more short sequences rather than fewer longer ones, provided that the primers are chosen wisely, and that community comparison methods such as UniFrac are surprisingly robust to variation in the region sequenced

    Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization

    Full text link
    We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with 106\sim10^6 species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.Comment: To appear in SPIRE 1

    Development of the preterm gut microbiome in twins at risk of necrotising enterocolitis and sepsis

    Get PDF
    The preterm gut microbiome is a complex dynamic community influenced by genetic and environmental factors and is implicated in the pathogenesis of necrotising enterocolitis (NEC) and sepsis. We aimed to explore the longitudinal development of the gut microbiome in preterm twins to determine how shared environmental and genetic factors may influence temporal changes and compared this to the expressed breast milk (EBM) microbiome. Stool samples (n = 173) from 27 infants (12 twin pairs and 1 triplet set) and EBM (n = 18) from 4 mothers were collected longitudinally. All samples underwent PCR-DGGE (denaturing gradient gel electrophoresis) analysis and a selected subset underwent 454 pyrosequencing. Stool and EBM shared a core microbiome dominated by Enterobacteriaceae, Enterococcaceae, and Staphylococcaceae. The gut microbiome showed greater similarity between siblings compared to unrelated individuals. Pyrosequencing revealed a reduction in diversity and increasing dominance of Escherichia sp. preceding NEC that was not observed in the healthy twin. Antibiotic treatment had a substantial effect on the gut microbiome, reducing Escherichia sp. and increasing other Enterobacteriaceae. This study demonstrates related preterm twins share similar gut microbiome development, even within the complex environment of neonatal intensive care. This is likely a result of shared genetic and immunomodulatory factors as well as exposure to the same maternal microbiome during birth, skin contact and exposure to EBM. Environmental factors including antibiotic exposure and feeding are additional significant determinants of community structure, regardless of host genetics

    Microbiome profiling by Illumina sequencing of combinatorial sequence-tagged PCR products

    Get PDF
    We developed a low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region. Amplified PCR products are sequenced using an Illumina paired-end protocol to generate millions of overlapping reads. Combinatorial sequence tagging can be used to examine hundreds of samples with far fewer primers than is required when sequence tags are incorporated at only a single end. The number of reads generated permitted saturating or near-saturating analysis of samples of the vaginal microbiome. The large number of reads al- lowed an in-depth analysis of errors, and we found that PCR-induced errors composed the vast majority of non-organism derived species variants, an ob- servation that has significant implications for sequence clustering of similar high-throughput data. We show that the short reads are sufficient to assign organisms to the genus or species level in most cases. We suggest that this method will be useful for the deep sequencing of any short nucleotide region that is taxonomically informative; these include the V3, V5 regions of the bac- terial 16S rRNA genes and the eukaryotic V9 region that is gaining popularity for sampling protist diversity.Comment: 28 pages, 13 figure

    Joint Analysis of Multiple Metagenomic Samples

    Get PDF
    The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough
    corecore