210 research outputs found

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF

    Advances in Genomic Data Compression

    Get PDF
    The rapid growth in the number of individual whole genome sequences and metagenomic datasets is generating an unprecedented volume of genomic data. This is partly due to the continuous drop in the cost of sequencing as well as growth in the utility of sequencing for research and clinical purposes. We are now reaching a point whereby the lion share of the cost is shifting from the actual sequencing to processing and storing the resulting data. With genomic datasets reaching the petabyte scale in hospitals and medium to large research groups, it is clear that there is an urgent need to store the data more efficiently - not only to reduce current costs, but also to make sequencing even more affordable to an even larger set of use cases, thereby accelerating the pace of adoption of genomic data for a widening range of research projects and clinical applications. In Chapter 1 of this thesis, I lay the groundwork for a new approach to compressing genomic data—one that is based on an extensible software platform, which I called Genozip. This initial proof of concept allows compression of data in a widely used format, namely the Variant Call Format, or VCF (Danecek et al. 2011) . In Chapter 2, I expand on the work of Chapter 1, showing how the software architecture is designed to support the addition of genomic file formats, compression methods, and codecs. Benchmarking results show that Genozip generally performs better and faster than the leading tools for compression of common genomic data formats such as VCF, SAM (Li et al. 2009) and FASTQ (Cock et al. 2010) . In Chapter 3, I take a detour from compression, and demonstrate how potentially Genozip, with its detailed internal data structures for genomic file processing, could be used for other types of data manipulation. As an example, I introduce DVCF, or Dual-coordinate VCF—an extension of the VCF format that allows representation of genetic variants concurrently in two coordinate systems defined by two different reference genomes (Lan 2021) . It is possible to use a DVCF file in a pipeline where each step of the pipeline accesses the data in either of the coordinate systems. I also developed novel methods for lifting over data from one coordinate system to another, and show the superiority of my methods compared to the two leading tools in that space, namely GATK LiftoverVCF (McKenna et al. 2010) and CrossMap (Zhao et al. 2014) . Overall, the Genozip software package is a high quality and versatile bioinformatic tool that is already adopted by dozens of research and clinical laboratories worldwide. Through reduction of the cost of whole genome sequencing data processing and storage, Genozip is likely to further encourage the use of genomics in research and clinical settings.Thesis (Ph.D.) -- University of Adelaide, School of Biological Sciences, 202

    New bounds for ternary covering arrays using a parallel simulated annealing

    Get PDF
    A covering array (CA) is a combinatorial structure specified as a matrix of N rows and k columns over an alphabet on v symbols such that for each set of t columns every t-tuple of symbols is covered at least once. Given the values of t, k, and v, the optimal covering array construction problem (CAC) consists in constructing a CA (N; t, k, v) with the minimum possible value of N. There are several reported methods to attend the CAC problem, among them are direct methods, recursive methods, greedy methods, and metaheuristics methods. In this paper, There are three parallel approaches for simulated annealing: the independent, semi-independent, and cooperative searches are applied to the CAC problem. The empirical evidence supported by statistical analysis indicates that cooperative approach offers the best execution times and the same bounds as the independent and semi-independent approaches. Extensive experimentation was carried out, using 182 well-known benchmark instances of ternary covering arrays, for assessing its performance with respect to the best-known bounds reported previously. The results show that cooperative approach attains 134 new bounds and equals the solutions for other 29 instances. © 2012 Himer Avila-George et al.The authors thankfully acknowledge the computer resources and assistance provided by Spanish Supercomputing Network (TIRANT-UV). This research work was partially funded by the following projects: CONACyT 58554; Calculo de Covering Arrays; 51623-Fondo Mixto CONACyT; Gobierno del Estado de Tamaulipas.Avila-George, H.; Torres-Jimenez, J.; Hernández García, V. (2012). New bounds for ternary covering arrays using a parallel simulated annealing. Mathematical Problems in Engineering. 2012:1-19. doi:10.1155/2012/897027S119201

    Bridging Flows: Microfluidic End‐User Solutions

    Get PDF
    corecore