37 research outputs found

    High-Performance Meta-Genomic Gene Identification

    Get PDF
    Computational Genomics, or Computational Genetics, refers to the use of computational and statistical analysis for understanding the structure and the function of genetic material in organisms. The primary focus of research in computational genomics in the past three decades has been the understanding of genomes and their functional elements by analyzing biological sequence data. The high demand for low-cost sequencing has driven the development of highthroughput sequencing technologies, next-generation sequencing (NGS), that parallelize the sequencing process, producing thousands or millions of sequences concurrently. Moore’s Law is the observation that the number of transistors on integrated circuits doubles approximately every two years; correspondingly, the cost per transistor halves. The cost of DNA sequencing declines much faster, which implies more new DNA data will be obtained. This large-scale sequence data, produced with high throughput sequencing technologies, needs to be processed in a time-effective and cost-effective manner. In this dissertation, we present a high-performance meta-genome gene identification framework. This framework includes four modules: filter, alignment, error correction, and gene identification. The following chapters describe the proposed design and evaluation of this pipeline. The most computationally expensive kernel in the framework is the alignment procedure. Thus, the filter module is developed to determine unnecessary alignment operations. Without the filter module, the alignment module requires 1.9 hours to complete all-to-all alignment on a test file of size 512,000 sequences with each sequence average length 750 base pairs by using ten Kepler K20 NVIDIA GPU. On the other hand, when combined with the filter kernel, the total time is 11.3 minutes. Note that the ideal speedup is nearly 91.4 times faster when new alignment kernel is run on ten GPUs ( 10*9.14). We conclude that accuracy can be achieved at the expense of more resources while operating frequency can still be maintained

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Study of protein-DNA interaction using new generation data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Advanced Methods for Real-time Metagenomic Analysis of Nanopore Sequencing Data

    Get PDF
    Whole shotgun metagenomics sequencing allows researchers to retrieve information about all organisms in a complex sample. This method enables microbiologists to detect pathogens in clinical samples, study the microbial diversity in various environments, and detect abundance differences of certain microbes under different living conditions. The emergence of nanopore sequencing has offered many new possibilities for clinical and environmental microbiologists. In particular, the portability of the small nanopore sequencing devices and the ability to selectively sequence only DNA from interesting organisms are expected to make a significant contribution to the field. However, both options require memory-efficient methods that perform real-time data analysis on commodity hardware like usual laptops. In this thesis, I present new methods for real-time analysis of nanopore sequencing data in a metagenomic context. These methods are based on optimized algorithmic approaches querying the sequenced data against large sets of reference sequences. The main goal of those contributions is to improve the sequencing and analysis of underrepresented organisms in complex metagenomic samples and enable this analysis in low-resource settings in the field. First, I introduce ReadBouncer as a new tool for nanopore adaptive sampling, which can reject uninteresting DNA molecules during the sequencing process. ReadBouncer improves read classification compared to other adaptive sampling tools and has fewer memory requirements. These improvements enable a higher enrichment of underrepresented sequences while performing adaptive sampling in the field. I further show that, besides host sequence removal and enrichment of low-abundant microbes, adaptive sampling can enrich underrepresented plasmid sequences in bacterial samples. These plasmids play a crucial role in the dissemination of antibiotic resistance genes. However, their characterization requires expensive and time-consuming lab protocols. I describe how adaptive sampling can be used as a cheap method for the enrichment of plasmids, which can make a significant contribution to the point-of-care sequencing of bacterial pathogens. Finally, I introduce a novel memory- and space-efficient algorithm for real-time taxonomic profiling of nanopore reads that was implemented in Taxor. It improves the taxonomic classification of nanopore reads compared to other taxonomic profiling tools and tremendously reduces the memory footprint. The resulting database index for thousands of microbial species is small enough to fit into the memory of a small laptop, enabling real-time metagenomics analysis of nanopore sequencing data with large reference databases in the field

    RNA Backbone Validation, Correction, and Implications for RNA-Protein Interfaces

    Get PDF
    <p>RNA is the molecular workhorse of nature, capable of doing many cellular tasks, from genetic data storage and regulation, to enzymatic synthesis--even to the point of self-catalyzing its own replication. While RNA can act as a catalyst on its own, as in the hammerhead ribozyme, the added efficiency of proteins is often a necessity; the ribosome--the large ribozyme responsible for peptide chain formation, is aided by proteins which ensure correct assembly and structural stability. These complexes of RNA and proteins feature in many essential cellular processes, including the RISC silencing complex and in the spliceosome. Despite its enormous utility, structural determination of RNA is notoriously difficult--particularly in the backbone, since a nucleotide standardly has 12 torsion angles (including &#967;) and 12 non-hydrogen atoms, compared to 4 torsions (including &#967;1) and 4 non-H atoms in a typical amino acid. The abundance of backbone atoms, their conformational flexibility, and experimental resolution limitations often result in systematic errors that can have a significant impact on the interpretation. False trails due to structural errors can lead to significant loss of time and effort, especially with such high-profile complexes as the ribosome and the RISC complex. </p><p>My research has focused on harnessing the recently discovered ribosome structures and the Richardsons' RNA dataset to find trends in RNA backbone conformations and motifs that were then used to develop structural validation techniques and provide improved diagnosis and correction techniques for RNA backbone. Methods for fixing RNA structure have been developed for both NMR and X-ray crystallography. For NMR structures, a method for assigning RNA backbone structure based on NOE data was developed, leading to improved identification and building of RNA backbone conformation in NMR ensembles. For crystallography, our method of diagnosing the correct ribose pucker from clear observables allows reliable assessment of pucker in validation or refinement. Observed differences in bond-lengths, bond-angles, and dihedrals have been categorized by sugar pucker in the PHENIX refinement package. I have shown that this improves the refinement behavior of both pucker and geometry. </p><p>There have also been improvements in identifying structural motifs. Many previously identified structural motifs have now been defined in terms of backbone suitestrings, a series of 2-character code divisions of RNA backbone that show the best clustering of dihedral angle correlations. Combined with a BLAST-like alignment program called SuiteAlign, these suitestrings were quickly and easily identified in a number of structures, eventually leading to the discovery of multiple instances of T&#968;C-loop structures in the ribosome.</p><p>To facilitate error diagnosis and corrections in RNA-protein complexes, as well as to expand the knowledge base of the scientific community as a whole, a database of RNA-protein interaction motifs has been developed. This database is rooted in the quality-filtering, visualization, and analysis techniques of the Richardson lab, particularly those developed by Laura Murray specifically for RNA structures.</p><p>The consensus backbone conformers, pucker diagnosis, and all-atom contacts have been combined to develop first manual and then automated tools for RNA structure correction. I have applied all these techniques to improve the accuracy of a number of important RNA and RNA/protein complex structures.</p>Dissertatio
    corecore