9 research outputs found

    Genomic fluidity: an integrative view of gene diversity within microbial populations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The dual concepts of pan and core genomes have been widely adopted as means to assess the distribution of gene families within microbial species and genera. The core genome is the set of genes shared by a group of organisms; the pan genome is the set of all genes seen in any of these organisms. A variety of methods have provided drastically different estimates of the sizes of pan and core genomes from sequenced representatives of the same groups of bacteria.</p> <p>Results</p> <p>We use a combination of mathematical, statistical and computational methods to show that current predictions of pan and core genome sizes may have no correspondence to true values. Pan and core genome size estimates are problematic because they depend on the estimation of the occurrence of rare genes and genomes, respectively, which are difficult to estimate precisely because they are rare. Instead, we introduce and evaluate a robust metric - genomic fluidity - to categorize the gene-level similarity among groups of sequenced isolates. Genomic fluidity is a measure of the dissimilarity of genomes evaluated at the gene level.</p> <p>Conclusions</p> <p>The genomic fluidity of a population can be estimated accurately given a small number of sequenced genomes. Further, the genomic fluidity of groups of organisms can be compared robustly despite variation in algorithms used to identify genes and their homologs. As such, we recommend that genomic fluidity be used in place of pan and core genome size estimates when assessing gene diversity within genomes of a species or a group of closely related organisms.</p

    A computational genomics pipeline for prokaryotic sequencing projects

    Get PDF
    Motivation: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data

    Algorithm development for next generation sequencing-based metagenome analysis

    Get PDF
    We present research on the design, development and application of algorithms for DNA sequence analysis, with a focus on environmental DNA (metagenomes). We present an overview and primer on algorithm development for bioinformatics of metagenomes; work on frameshift detection in DNA sequencing data; work on a computational pipeline for the assembly, feature prediction, annotation and analysis of bacterial genomes; work on unsupervised phylogenetic clustering of metagenomic fragments using Markov Chain Monte Carlo methods; and work on estimation of bacterial genome plasticity and diversity, potential improvements to the measures of core and pan-genomes.PhDCommittee Chair: Weitz, Joshua; Committee Co-Chair: Jordan, I. King; Committee Member: Bader, David; Committee Member: Bergman, Nicholas; Committee Member: Chernoff, Yur

    Unsupervised statistical clustering of environmental shotgun sequences

    Get PDF
    © 2009 Kislyuk et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/10/316DOI: 10.1186/1471-2105-10-316Background: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed. Results: We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a highperformance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin. Conclusion: An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins
    corecore