    GPU-Accelerated BWT Construction for Large Collection of Short Reads

    Advances in DNA sequencing technology have stimulated the development of algorithms and tools for processing very large collections of short strings (reads). Short-read alignment and assembly are among the most well-studied problems. Many state-of-the-art aligners, at their core, have used the Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome (typical example, NCBI human genome). Recently, BWT has also found its use in string-graph assembly, for indexing the reads (i.e., raw data from DNA sequencers). In a typical data set, the volume of reads is tens of times of the sequenced genome and can be up to 100 Gigabases. Note that a reference genome is relatively stable and computing the index is not a frequent task. For reads, the index has to computed from scratch for each given input. The ability of efficient BWT construction becomes a much bigger concern than before. In this paper, we present a practical method called CX1 for constructing the BWT of very large string collections. CX1 is the first tool that can take advantage of the parallelism given by a graphics processing unit (GPU, a relative cheap device providing a thousand or more primitive cores), as well as simultaneously the parallelism from a multi-core CPU and more interestingly, from a cluster of GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100 Gigabases can be constructed in less than 2 hours using a machine equipped with a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such machines (the speedup is almost linear after excluding the first 16 minutes for loading the reads from the hard disk). The previously fastest tool BRC is measured to take 12 hours to process 100 Gigabases on one machine; it is non-trivial how BRC can be parallelized to take advantage a cluster of machines, let alone GPUs.Comment: 11 page

    Highly Scalable Short Read Alignment with the Burrows-Wheeler Transform and Cloud Computing

    Improvements in DNA sequencing have both broadened its utility and dramatically increased the size of sequencing datasets. Sequencing instruments are now used regularly as sources of high-resolution evidence for genotyping, methylation profiling, DNA-protein interaction mapping, and characterizing gene expression in the human genome and in other species. With existing methods, the computational cost of aligning short reads from the Illumina instrument to a mammalian genome can be very large: on the order of many CPU months for one human genotyping project. This thesis presents a novel application of the Burrows-Wheeler Transform that enables the alignment of short DNA sequences to mammalian genomes at a rate much faster than existing hashtable-based methods. The thesis also presents an extension of the technique that exploits the scalability of Cloud Computing to perform the equivalent of one human genotyping project in hours

    A representation of a compressed de Bruijn graph for pan-genome analysis that enables search

    Recently, Marcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an O(nlogā”g)O(n \log g) time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where nn is the total length of the genomes and gg is the length of the longest genome. In this paper, we present a construction algorithm that outperforms their algorithm in theory and in practice. Moreover, we propose a new space-efficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele - a variant form of a gene) within the pan-genome.Comment: Submitted to Algorithmica special issue of CPM201

    Medium-Space Algorithms for Inverse BWT

    As cost and throughput of second-generation sequencers continue to improve, even modestly resourced research laboratories can now perform DNA sequencing experiments that generate hundreds of billions of nucleotides of data, enough to cover the human genome dozens of times over, in about a week for a few thousand dollars. Such data are now being generated rapidly by research groups across the world, and large-scale analyses of these data appear often in high-profile publications such as Nature, Science, and The New England Journal of Medicine. But with these advances comes a serious problem: growth in per-sequencer throughput (currently about 4x per year) is drastically outpacing growth in computer speed (about 2x every 2 years). As the throughput gap widens over time, sequence analysis software is becoming a performance bottleneck, and the costs associated with building and maintaining the needed computing resources is burdensome for research laboratories. This thesis proposes two methods and describes four open source software tools that help to address these issues using novel algorithms and high-performance computing techniques. The proposed approaches build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that these algorithmic advances can be combined with MapReduce and cloud computing to solve comparative genomics problems in a manner that is scalable, fault tolerant, and usable even by small research groups

    Deterministic sub-linear space LCE data structures with efficient construction

    Given a string SS of nn symbols, a longest common extension query LCE(i,j)\mathsf{LCE}(i,j) asks for the length of the longest common prefix of the iith and jjth suffixes of SS. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter 1ā‰¤Ļ„ā‰¤n1 \leq \tau \leq n, their best deterministic solution is a data structure of size O(n/Ļ„)O(n/\tau) which allows LCE queries to be answered in O(Ļ„)O(\tau) time. However, the construction time for all deterministic versions of their data structure is quadratic in nn. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of O(Ļ„minā”{logā”Ļ„,logā”nĻ„})O(\tau\min\{\log\tau,\log\frac{n}{\tau}\}) query time using O(n/Ļ„)O(n/\tau) space, but significantly improve the construction time to O(nĻ„)O(n\tau).Comment: updated titl

    Indexing large genome collections on a PC

    Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,Ī¼\mus and with up to 3 mismatches in 373\,Ī¼\mus on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,Ī¼\mus and 917\,Ī¼\mus. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

    Space-efficient construction of compressed suffix trees

    We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and Ļƒ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlogā”Ļƒ) time using just o(nlogā”Ļƒ) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 00. This improves the previous most space-efficient algorithms, which worked in O(n) bits and O(nlogā”n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlogā”Ļƒ) time and using just o(nlogā”Ļƒ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second