280 research outputs found

    Efficient String Graph Construction Algorithm

    Get PDF
    In the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by NGS technologies. Therefore, in this thesis, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct string graph from an overlap graph. Experimental results show that SOF is at least 2 times faster than the string graph construction algorithm provided in SGA, one of the most popular string graph-based assembler, while maintaining almost the same memory footprint as SGA. Moreover, the availability of SOF as a subprogram in the SGA assembly pipeline will give user facilities to access the preprocessing and postprocessing steps for genome assembly provided in SGA

    Alternative applications of whole genome de novo assembly in animal genomics /

    Get PDF
    Genome sequencing is the process by which the sequence of deoxyribonucleic acid (DNA) residues that compromise the genome, or complete set of genetic materials of an organism or individual, is determined. Down-stream analysis of genome sequencing data requires that short reads be compiled into contiguous sequences. These methods, called de novo assembly, are based in statistical methods and graph theory. In addition to genome assembly, the research presented in this dissertation demonstrates the alternative use of these methods. Using these novel approaches, de novo assembly algorithms can be utilized to gain insight into commensal and parasitic organisms of livestock, genes containing candidate mutations for genetic defects, and population-level and species-level variation in a poorly studied organisms.Dr. Jared E. Decker, Dissertation Advisor.Includes bibliographical references (pages 101-127)

    Molecular Mechanisms of Crop Domestication Revealed by Comparative Analysis of the Transcriptomes Between Cultivated and Wild Soybeans

    Get PDF
    Soybean is one of the key crops necessary to meet the food requirement of the increasing global population. However, in order to meet this need, the quality and quantity of soybean yield must be greatly enhanced. Soybean yield advancement depends on the presence of favorable genes in the genome pool that have significantly changed during domestication. To make use of those domesticated genes, this study involved seven cultivated, G. max, and four wild-type, G. soja, soybeans. Their genomes were studied from developing pods to decipher the molecular mechanisms underlying crop domestication. Specifically, their transcriptomes were analyzed comparatively to previous related studies, with the intention of contributing further to the literature. For these goals, several bioinformatics applications were utilized, including De novo transcriptome assembly, transcriptome abundance quantification, and discovery of differentially expressed genes (DEGs) and their functional annotations and network visualizations. The results revealed 1,247 DEGs, 916 of which were upregulated in the cultivated soybean in comparison to wild type. Findings were mostly corresponded to literature review results, especially regarding genes affecting two focused, domesticated-related pod-shattering resistance and seed size traits. These traits were shown to be upregulated in cultivated soybeans and down-regulated in wild type. However, the opposite trend was shown in disease-related genes, which were down-regulated or not even present in the cultivated soybean genome. Further, 47 biochemical functions of the identified DEGs at the cellular level were revealed, providing some knowledge about the molecular mechanisms of genes related to the two aforementioned subjected traits. While our findings provide valuable insight about the molecular mechanisms of soybean domestication attributed to annotation of differentially expressed genes and transcripts, these results must be dissected further and/or reprocessed with a higher number of samples in order to advance the field

    Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

    Get PDF

    Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

    Get PDF
    Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

    Building and Improving Reference Genome Assemblies: This paper reviews the problems and algorithms of assembling a complete genome from millions of short DNA sequencing reads

    Get PDF
    A genome sequence assembly provides the foundation for studies of genotypic and phenotypic variation, genome structure, and evolution of the target organism. In the past four decades, there has been a surge of new sequencing technologies, and with these developments, computational scientists have developed new algorithms to improve genome assembly. Here we discuss the relationship between sequencing technology improvements and assembly algorithm development and how these are applied to extend and improve human and nonhuman genome assemblies. © 1963-2012 IEEE

    Targeted Computational Approaches for Mining Functional Elements in Metagenomes

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics, 2012Metagenomics enables the genomic study of uncultured microorganisms by directly extracting the genetic material from microbial communities for sequencing. Fueled by the rapid development of Next Generation Sequencing (NGS) technology, metagenomics research has been revolutionizing the field of microbiology, revealing the taxonomic and functional composition of many microbial communities and their impacts on almost every aspect of life on Earth. Analyzing metagenomes (a metagenome is the collection of genomic sequences of an entire microbial community) is challenging: metagenomic sequences are often extremely short and therefore lack genomic contexts needed for annotating functional elements, while whole-metagenome assemblies are often poor because a metagenomic dataset contains reads from many different species. Novel computational approaches are still needed to get the most out of the metagenomes. In this dissertation, I first developed a binning algorithm (AbundanceBin) for clustering metagenomic sequences into groups, each containing sequences from species of similar abundances. AbundanceBin provides accurate estimations of the abundances of the species in a microbial community and their genome sizes. Application of AbundanceBin prior to assembly results in better assemblies of metagenomes--an outcome crucial to downstream analyses of metagenomic datasets. In addition, I designed three targeted computational approaches for assembling and annotating protein coding genes and other functional elements from metagenomic sequences. GeneStitch is an approach for gene assembly by connecting gene fragments scattered in different contigs into longer genes with the guidance of reference genes. I also developed two specialized assembly methods: the targeted-assembly method for assembling CRISPRs (Clustered Regularly Interspersed Short Palindromic Repeats), and the constrained-assembly method for retrieving chromosomal integrons. Applications of these methods to the Human Microbiome Project (HMP) datasets show that human microbiomes are extremely dynamic, reflecting the interactions between community members (including bacteria and viruses)

    The Progression in Developing Genomic Resources for Crop Improvement

    Get PDF
    Sequencing technologies have rapidly evolved over the past two decades, and new technologies are being continually developed and commercialized. The emerging sequencing technologies target generating more data with fewer inputs and at lower costs. This has also translated to an increase in the number and type of corresponding applications in genomics besides enhanced computational capacities (both hardware and software). Alongside the evolving DNA sequencing landscape, bioinformatics research teams have also evolved to accommodate the increasingly demanding techniques used to combine and interpret data, leading to many researchers moving from the lab to the computer. The rich history of DNA sequencing has paved the way for new insights and the development of new analysis methods. Understanding and learning from past technologies can help with the progress of future applications. This review focuses on the evolution of sequencing technologies, their significant enabling role in generating plant genome assemblies and downstream applications, and the parallel development of bioinformatics tools and skills, filling the gap in data analysis technique

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical
    corecore