36 research outputs found
Lightweight Massively Parallel Suffix Array Construction
The suffix array is an array of sorted suffixes in lexicographic order, where each sorted suffix is represented by its starting position in the input string. It is a fundamental data structure that finds various applications in areas such as string processing, text indexing, data compression, computational biology, and many more. Over the last three decades, researchers have proposed a broad spectrum of suffix array construction algorithms (SACAs). However, the majority of SACAs were implemented using sequential and parallel programming models. The maturity of GPU programming opened doors to the development of massively parallel GPU SACAs that outperform the fastest versions of suffix sorting algorithms optimized for the CPU parallel computing. Over the last five years, several GPU SACA approaches were proposed and implemented. They prioritized the running time over lightweight design.
In this thesis, we design and implement a lightweight massively parallel SACA on the GPU using the prefix-doubling technique. Our prefix-doubling implementation is memory-efficient and can successfully construct the suffix array for input strings as large as 640 megabytes (MB) on Tesla P100 GPU. On large datasets, our implementation achieves a speedup of 7-16x over the fastest, highly optimized, OpenMP-accelerated suffix array constructor, libdivsufsort, that leverages the CPU shared memory parallelism. The performance of our algorithm relies on several high-performance parallel primitives such as radix sort, conditional filtering, inclusive prefix sum, random memory scattering, and segmented sort. We evaluate the performance of our implementation over a variety of real-world datasets with respect to its runtime, throughput, memory usage, and scalability. We compare our results against libdivsufsort that we run on a Haswell compute node equipped with 24 cores. Our GPU SACA is simple and compact, consisting of less than 300 lines of readable and effective source code. Additionally, we design and implement a fast and lightweight algorithm for checking the correctness of the suffix array
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
Parallel and scalable combinatorial string algorithms on distributed memory systems
Methods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.Ph.D
GPU accelerating distributed succinct de Bruijn graph construction
The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance.
Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph.
As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing.
This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction
Algorithm-Hardware Co-Design for Performance-driven Embedded Genomics
PhD ThesisGenomics includes development of techniques for diagnosis, prognosis and therapy of
over 6000 known genetic disorders. It is a major driver in the transformation of medicine
from the reactive form to the personalized, predictive, preventive and participatory (P4)
form. The availability of genome is an essential prerequisite to genomics and is obtained
from the sequencing and analysis pipelines of the whole genome sequencing (WGS).
The advent of second generation sequencing (SGS), significantly, reduced the sequencing
costs leading to voluminous research in genomics. SGS technologies, however, generate
massive volumes of data in the form of reads, which are fragmentations of the real
genome. The performance requirements associated with mapping reads to the reference
genome (RG), in order to reassemble the original genome, now, stands disproportionate
to the available computational capabilities. Conventionally, the hardware resources used
are made of homogeneous many-core architecture employing complex general-purpose
CPU cores. Although these cores provide high-performance, a data-centric approach
is required to identify alternate hardware systems more suitable for affordable and
sustainable genome analysis.
Most state-of-the-art genomic tools are performance oriented and do not address
the crucial aspect of energy consumption. Although algorithmic innovations have
reduced runtime on conventional hardware, the energy consumption has scaled poorly.
The associated monetary and environmental costs have made it a major bottleneck to
translational genomics. This thesis is concerned with the development and validation
of read mappers for embedded genomics paradigm, aiming to provide a portable and
energy-efficient hardware solution to the reassembly pipeline. It applies the algorithmhardware co-design approach to bridge the saturation point arrived in algorithmic
innovations with emerging low-power/energy heterogeneous embedded platforms.
Essential to embedded paradigm is the ability to use heterogeneous hardware
resources. Graphical processing units (GPU) are, often, available in most modern devices
alongside CPU but, conventionally, state-of-the-art read mappers are not tuned to use
both together. The first part of the thesis develops a Cross-platfOrm Read mApper
using opencL (CORAL) that can distribute workload on all available devices for high
performance. OpenCL framework mitigates the need for designing separate kernels for
CPU and GPU. It implements a verification-aware filtration algorithm for rapid pruning
and identification of candidate locations for mapping reads to the RG.
Mapping reads on embedded platforms decreases performance due to architectural
differences such as limited on-chip/off-chip memory, smaller bandwidths and simpler
cores. To mitigate performance degradation, in second part of the thesis, we propose a
REad maPper for heterogeneoUs sysTEms (REPUTE) which uses an efficient dynamic
programming (DP) based filtration methodology. Using algorithm-hardware co-design
and kernel level optimizations to reduce its memory footprint, REPUTE demonstrated
significant energy savings on HiKey970 embedded platform with acceptable performance.
The third part of the thesis concentrates on mapping the whole genome on an
embedded platform. We propose a Pyopencl based tooL for gEnomic workloaDs
tarGeting Embedded platfoRms (PLEDGER) which includes two novel contributions.
The first one proposes a novel preprocessing strategy to generate low-memory footprint
(LMF) data structure to fit all human chromosomes at the cost of performance. Second
contribution is LMF DP-based filtration method to work in conjunction with the
proposed data structures. To mitigate performance degradation, the kernel employs
several optimisations including extensive usage of bit-vector operations. Extensive
experiments using real human reads were carried out with state-of-the-art read mappers
on 5 different platforms for CORAL, REPUTE and PLEDGER. The results show that
embedded genomics provides significant energy savings with similar performance
compared to conventional CPU-based platforms
Efficient Storage of Genomic Sequences in High Performance Computing Systems
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction