519 research outputs found

    High-throughput sequence alignment using Graphics Processing Units

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and <it>de novo </it>genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies.</p> <p>Results</p> <p>This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.</p> <p>Conclusion</p> <p>MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.</p

    Genome re-annotation: a wiki solution?

    Get PDF
    The annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution

    CGAT: a comparative genome analysis tool for visualizing alignments in the analysis of complex evolutionary changes between closely related genomes

    Get PDF
    BACKGROUND: The recent accumulation of closely related genomic sequences provides a valuable resource for the elucidation of the evolutionary histories of various organisms. However, although numerous alignment calculation and visualization tools have been developed to date, the analysis of complex genomic changes, such as large insertions, deletions, inversions, translocations and duplications, still presents certain difficulties. RESULTS: We have developed a comparative genome analysis tool, named CGAT, which allows detailed comparisons of closely related bacteria-sized genomes mainly through visualizing middle-to-large-scale changes to infer underlying mechanisms. CGAT displays precomputed pairwise genome alignments on both dotplot and alignment viewers with scrolling and zooming functions, and allows users to move along the pre-identified orthologous alignments. Users can place several types of information on this alignment, such as the presence of tandem repeats or interspersed repetitive sequences and changes in G+C contents or codon usage bias, thereby facilitating the interpretation of the observed genomic changes. In addition to displaying precomputed alignments, the viewer can dynamically calculate the alignments between specified regions; this feature is especially useful for examining the alignment boundaries, as these boundaries are often obscure and can vary between programs. Besides the alignment browser functionalities, CGAT also contains an alignment data construction module, which contains various procedures that are commonly used for pre- and post-processing for large-scale alignment calculation, such as the split-and-merge protocol for calculating long alignments, chaining adjacent alignments, and ortholog identification. Indeed, CGAT provides a general framework for the calculation of genome-scale alignments using various existing programs as alignment engines, which allows users to compare the outputs of different alignment programs. Earlier versions of this program have been used successfully in our research to infer the evolutionary history of apparently complex genome changes between closely related eubacteria and archaea. CONCLUSION: CGAT is a practical tool for analyzing complex genomic changes between closely related genomes using existing alignment programs and other sequence analysis tools combined with extensive manual inspection

    Context-driven discovery of gene cassettes in mobile integrons using a computational grammar

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene discovery algorithms typically examine sequence data for low level patterns. A novel method to computationally discover higher order DNA structures is presented, using a context sensitive grammar. The algorithm was applied to the discovery of gene cassettes associated with integrons. The discovery and annotation of antibiotic resistance genes in such cassettes is essential for effective monitoring of antibiotic resistance patterns and formulation of public health antibiotic prescription policies.</p> <p>Results</p> <p>We discovered two new putative gene cassettes using the method, from 276 integron features and 978 GenBank sequences. The system achieved <it>κ </it>= 0.972 annotation agreement with an expert gold standard of 300 sequences. In rediscovery experiments, we deleted 789,196 cassette instances over 2030 experiments and correctly relabelled 85.6% (<it>α </it>≥ 95%, <it>E </it>≤ 1%, mean sensitivity = 0.86, specificity = 1, F-score = 0.93), with no false positives.</p> <p>Error analysis demonstrated that for 72,338 missed deletions, two adjacent deleted cassettes were labeled as a single cassette, increasing performance to 94.8% (mean sensitivity = 0.92, specificity = 1, F-score = 0.96).</p> <p>Conclusion</p> <p>Using grammars we were able to represent heuristic background knowledge about large and complex structures in DNA. Importantly, we were also able to use the context embedded in the model to discover new putative antibiotic resistance gene cassettes. The method is complementary to existing automatic annotation systems which operate at the sequence level.</p

    The genome and transcriptome of Trichormus sp NMC-1: insights into adaptation to extreme environments on the Qinghai-Tibet Plateau

    Get PDF
    The Qinghai-Tibet Plateau (QTP) has the highest biodiversity for an extreme environment worldwide, and provides an ideal natural laboratory to study adaptive evolution. In this study, we generated a draft genome sequence of cyanobacteria Trichormus sp. NMC-1 in the QTP and performed whole transcriptome sequencing under low temperature to investigate the genetic mechanism by which T. sp. NMC-1 adapted to the specific environment. Its genome sequence was 5.9 Mb with a G+C content of 39.2% and encompassed a total of 5362 CDS. A phylogenomic tree indicated that this strain belongs to the Trichormus and Anabaena cluster. Genome comparison between T. sp. NMC-1 and six relatives showed that functionally unknown genes occupied a much higher proportion (28.12%) of the T. sp. NMC-1 genome. In addition, functions of specific, significant positively selected, expanded orthogroups, and differentially expressed genes involved in signal transduction, cell wall/membrane biogenesis, secondary metabolite biosynthesis, and energy production and conversion were analyzed to elucidate specific adaptation traits. Further analyses showed that the CheY-like genes, extracellular polysaccharide and mycosporine-like amino acids might play major roles in adaptation to harsh environments. Our findings indicate that sophisticated genetic mechanisms are involved in cyanobacterial adaptation to the extreme environment of the QTP

    GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes

    Get PDF
    We present 'gene prediction improvement pipeline' (GenePRIMP; http://geneprimp.jgi-psf.org/), a computational process that performs evidence-based evaluation of gene models in prokaryotic genomes and reports anomalies including inconsistent start sites, missed genes and split genes. We found that manual curation of gene models using the anomaly reports generated by GenePRIMP improved their quality, and demonstrate the applicability of GenePRIMP in improving finishing quality and comparing different genome-sequencing and annotation technologies

    The Early Stage of Bacterial Genome-Reductive Evolution in the Host

    Get PDF
    The equine-associated obligate pathogen Burkholderia mallei was developed by reductive evolution involving a substantial portion of the genome from Burkholderia pseudomallei, a free-living opportunistic pathogen. With its short history of divergence (∼3.5 myr), B. mallei provides an excellent resource to study the early steps in bacterial genome reductive evolution in the host. By examining 20 genomes of B. mallei and B. pseudomallei, we found that stepwise massive expansion of IS (insertion sequence) elements ISBma1, ISBma2, and IS407A occurred during the evolution of B. mallei. Each element proliferated through the sites where its target selection preference was met. Then, ISBma1 and ISBma2 contributed to the further spread of IS407A by providing secondary insertion sites. This spread increased genomic deletions and rearrangements, which were predominantly mediated by IS407A. There were also nucleotide-level disruptions in a large number of genes. However, no significant signs of erosion were yet noted in these genes. Intriguingly, all these genomic modifications did not seriously alter the gene expression patterns inherited from B. pseudomallei. This efficient and elaborate genomic transition was enabled largely through the formation of the highly flexible IS-blended genome and the guidance by selective forces in the host. The detailed IS intervention, unveiled for the first time in this study, may represent the key component of a general mechanism for early bacterial evolution in the host

    Comparative genomic analysis of Vibrio parahaemolyticus: serotype conversion and virulence

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Vibrio parahaemolyticus </it>is a common cause of foodborne disease. Beginning in 1996, a more virulent strain having serotype O3:K6 caused major outbreaks in India and other parts of the world, resulting in the emergence of a pandemic. Other serovariants of this strain emerged during its dissemination and together with the original O3:K6 were termed strains of the pandemic clone. Two genomes, one of this virulent strain and one pre-pandemic strain have been sequenced. We sequenced four additional genomes of <it>V. parahaemolyticus </it>in this study that were isolated from different geographical regions and time points. Comparative genomic analyses of six strains of <it>V. parahaemolyticus </it>isolated from Asia and Peru were performed in order to advance knowledge concerning the evolution of <it>V. parahaemolyticus</it>; specifically, the genetic changes contributing to serotype conversion and virulence. Two pre-pandemic strains and three pandemic strains, isolated from different geographical regions, were serotype O3:K6 and either toxin profiles (<it>tdh+</it>, <it>trh</it>-) or (<it>tdh-</it>, <it>trh</it>+). The sixth pandemic strain sequenced in this study was serotype O4:K68.</p> <p>Results</p> <p>Genomic analyses revealed that the <it>trh</it>+ and <it>tdh</it>+ strains had different types of pathogenicity islands and mobile elements as well as major structural differences between the <it>tdh </it>pathogenicity islands of the pre-pandemic and pandemic strains. In addition, the results of single nucleotide polymorphism (SNP) analysis showed that 94% of the SNPs between O3:K6 and O4:K68 pandemic isolates were within a 141 kb region surrounding the O- and K-antigen-encoding gene clusters. The "core" genes of <it>V. parahaemolyticus </it>were also compared to those of <it>V. cholerae </it>and <it>V. vulnificus</it>, in order to delineate differences between these three pathogenic species. Approximately one-half (49-59%) of each species' core genes were conserved in all three species, and 14-24% of the core genes were species-specific and in different functional categories.</p> <p>Conclusions</p> <p>Our data support the idea that the pandemic strains are closely related and that recent South American outbreaks of foodborne disease caused by <it>V. parahaemolyticus </it>are closely linked to outbreaks in India. Serotype conversion from O3:K6 to O4:K68 was likely due to a recombination event involving a region much larger than the O-antigen- and K-antigen-encoding gene clusters. Major differences between pathogenicity islands and mobile elements are also likely driving the evolution of <it>V. parahaemolyticus</it>. In addition, our analyses categorized genes that may be useful in differentiating pathogenic Vibrios at the species level.</p
    corecore