High-Performance Computing Frameworks for Large-Scale Genome Assembly by Goswami, Sayan
Louisiana State University
LSU Digital Commons
LSU Doctoral Dissertations Graduate School
June 2019
High-Performance Computing Frameworks for
Large-Scale Genome Assembly
Sayan Goswami
Louisiana State University and Agricultural and Mechanical College, sgoswa1@lsu.edu
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations
Part of the Bioinformatics Commons, Computational Biology Commons, Genomics Commons,
Other Computer Sciences Commons, and the Software Engineering Commons
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please contactgradetd@lsu.edu.
Recommended Citation
Goswami, Sayan, "High-Performance Computing Frameworks for Large-Scale Genome Assembly" (2019). LSU Doctoral Dissertations.
4942.
https://digitalcommons.lsu.edu/gradschool_dissertations/4942
HIGH-PERFORMANCE COMPUTING FRAMEWORKS FOR
LARGE-SCALE GENOME ASSEMBLY
A Dissertation
Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in
The Department of Computer Science
by
Sayan Goswami
B.Tech, National Institute of Technology Durgapur, India, 2011
August 2019
Dedicated to maa, baba, mi and tubai.
ii
Acknowledgments
This work would not have been possible without the guidance, encouragement, support,
and criticism of my advisors Dr. Kisung Lee and Dr. Seung-Jong Park. I would like to express
my most sincere gratitude to my committee members Dr. Jianhua Chen and Dr. Richard Ng
for contributing their valuable time for me. I am also grateful to my colleagues and coauthors
Arghya, Shayan, and Richard, whom I had the great pleasure of working with.
I attribute my success to the unwavering support and faith of my parents, my aunt, and
my brother. Thanks also to Dr. Guhathakurta (Parag Da) of NIT Durgapur who played a de-
cisive role in my return to grad school. I am also deeply indebted (somewhat literally) to my
friends here at Baton Rouge who have made this place my home. Many thanks to Arnab,
Sujana, Saikat, Ishita, Trina, Ananya, and my past and present flatmates Arghya, Subhajit, Sa-
tadru, and Joy. Lastly, a special note of gratitude must be set aside for my friends back home
(Abhishek, Ayan, Gourav, JD, Sandeep, Saradendu, Saubhik, Souvik, Subh, Suprakash) with
whom I spent the best years of my life.
The works presented here were partially funded by NIH grants (P20GM103458-10, P30GM-
110760-03, and P20GM103424), NSF grants (MRI-1338051, IBSS-L-1620451, SCC-1737557, and
RAPID-1762600), LA Board of Regents grants (LEQSF(2016-19)-RD-A-08 and ITRS), and IBM
faculty awards. I would also like to thank NVIDIA for its generosity in allowing me to use their
computing resources.
iii
Table of Contents
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER
1. THE GENOME ASSEMBLY PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. De Novo Assembly - de Bruijn vs String Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. MEMORY-EFFICIENT DISTRIBUTED DE BRUIJN GRAPH ASSEMBLY . . . . . . . . . . . . . 7
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3. GPU-ACCELERATED STRING GRAPH ASSEMBLY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3. Methodology for GPU-Accelerated Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4. THIRD GENERATION SEQUENCING & FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1. Third Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2. Advantages and Challenges of 3GS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
APPENDIX. COPYRIGHT INFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv
List of Tables
3.1. Illumina datasets used for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2. Single node assembly times on 128 GB host memory and 12 GB de-
vice memory (K40) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3. Single node assembly times on 64 GB host memory and 6 GB device
memory (K20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4. Peak memory usage (in GB) on 128 GB host with K40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5. Peak memory usage (in GB) on 64 GB host with K20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6. Comparison between SGA and LaSAGNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1. Overview of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2. Comparison of assembly times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3. Comparison of assembly contiguity. (H = Hiper3ga, M = Miniasm) . . . . . . . . . . . . . . . . . . 71
v
List of Figures
1.1. Overview of genome sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Overview of de Bruijn graph based assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. String graph-based assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1. Classification of assemblers wrt. scalability and memory efficiency . . . . . . . . . . . . . . . . . 8
2.2. Partitioning de Bruijn graphs using minimum substrings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3. Imbalance of distribution in intermediate data among partitions . . . . . . . . . . . . . . . . . . . 13
2.4. System overview of Lazer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Actor interactions during map-phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6. Actor interactions during reduce-phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Execution times of Lazer on the synthetic wheat W7984 dataset (1.1 TB) . . . . . . . . . . . . 24
2.8. Execution times of Lazer, Swap, and Ray on the Yoruban male dataset
(452 GB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9. Distribution of intermediate key-value pairs with various cluster configurations . . . 27
3.1. Conceptual view of memory types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2. Overview of assembly pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3. The communication and computing patterns while generating prefix
fingerprints of GATACCAGTA with radix 4 and prime 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4. The communication and computing patterns while generating suffix
fingerprints using prefix fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5. Generating contigs from paths in string graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6. Effects of different host and device block-sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7. Effects of different GPUs and host block-sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8. Execution times on different node configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1. Erroneous assembly due to repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2. Distribution of read sizes in log-normal scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
Abstract
Genome sequencing technology has witnessed tremendous progress in terms of through-
put and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn
graph-based assembly tools demand a lot of processing power and memory and cannot as-
semble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out
cluster with several dozens of nodes.
In the first part of this work, we present a distributed next-generation sequence (NGS)
assembler called Lazer, that achieves both scalability and memory efficiency by using par-
titioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the
network communication in the cluster, we can assemble large sequences such as human
genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23
minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in
18.5 hours and on 128 nodes in 1.25 hours.
In the second part, we present a new distributed GPU-accelerated NGS assembler called
LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by build-
ing string graphs from approximate all-pair overlaps in quasi-linear time. To use the limited
memory on GPUs efficiently, LaSAGNA uses a two-level semi-streaming approach from disk
through host memory to device memory with restricted access patterns on both disk and host
memory. Using LaSAGNA, we can assemble the human genome dataset on a single NVIDIA
K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s.
In the third part, we present the first distributed 3rd generation sequence (3GS) assembler
which uses a map-reduce computing paradigm and a distributed hash-map, both built on a
high-performance networking middleware. Using this assembler, we assembled an Oxford
Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas
existing 3GS assemblers could not assemble it because of memory and/or time limitations.
vii
Chapter 1. The Genome Assembly Problem
1.1. Introduction
The genome of an organism is composed of its entire set of DNA including all its genes,
and the study of its structure, function, evolution and mapping is known as Genomics. Accord-
ing to a report released in 2013, the field of genomics has contributed nearly US$1 trillion to
the US economy since the start of the Human Genome Project [25]. Since a genome encodes
enough information to construct entire organisms, and thus potentially has clues to treat,
cure, or prevent various diseases, the knowledge of DNA sequences is being widely used not
only in basic biological sciences but also in various applied fields such as medicine, forensic
science, and virology [2, 30, 51].
Obtaining an organism’s DNA information consists of two essential steps - DNA sequenc-
ing and sequence assembly. Sequencing is the process of chemically marking the nucleotides
in order to find their exact order in the DNA. Since state-of-the-art sequencing machines can-
not accurately sequence the entire genome in one go, they apply a method known as shotgun
sequencing in which they create thousands of clones from the original genome, split them
at random positions generating millions of smaller fragments, and sequence these to pro-
duce short-reads. Figure 1.1 depicts the process of cloning and extracting short reads from a
genome.
The idea behind this type of sequencing is that each clone contributes one or more frag-
ments from quasi-random locations within itself, and thus there is a high probability that
fragments from different clones will have an overlap between them. The overlap information
is then used to align and merge the short-reads to recreate the original sequence. This process
is known as genome assembly.
Genome assembly can be performed either by mapping or de novo. In mapping assem-
bly, reads are aligned with some reference genome of the same or a closely related species and
the aligned reads are used to construct the original sequence. On the other hand, in de novo
assembly, the original sequence is obtained just by finding the overlaps between reads. Hence,
1
(a) Original sequence
(b) Clones of original sequence
(c) Short-reads extracted from clones
Figure 1.1. Overview of genome sequencing
often while sequencing a new species, de novo assembly is the only option available to prac-
titioners. In this work we only focus on de novo assembly since it is more compute-intensive
and data-intensive than mapping assembly.
1.2. De Novo Assembly - de Bruijn vs String Graphs
In its simplest form, de novo assembly can be viewed as solving the shortest common
superstring problem which takes a set R of strings {r1,r2,r3, · · · ,rn}, and attempts to find the
shortest possible string s such that every ri ∈ R is a substring of s. However, the shortest com-
mon superstring problem is NP-complete and therefore all approaches to de novo assembly
are heuristics that try to find suboptimal solutions in a reasonable amount of time. The main
challenge in the assembly is to create a full-text index using all reads so that the overlaps be-
tween them can be obtained efficiently, and subsequently storing this overlap information in
a graph so that overlapping reads can be merged to generate an approximation of s.
The method used to find overlaps between short-reads in most assemblers falls into one
of two categories - de Bruijn graph based and string graph based. In the former, short-reads
are further broken down into smaller overlapping sub-sequences of length k, called k-mers,
using a sliding window of size k. These k-mers are then used to build a de Bruijn graph using
the following rules:
2
1. Each unique k-mer is represented as a vertex in the graph. If the same k-mer is gener-
ated more than once from one or more short reads, they must be mapped to the same
vertex.
2. An edge exists between two vertices vi and v j if their corresponding k-mers, ki and k j ,
have an overlap of k−1 characters between them. In other words, if the k−1 suffix of ki
is equal to the k−1 prefix of k j , then there must be an edge from vi to v j in the generated
De Bruijn graph.
Using this technique, short-reads are represented as paths in the graph and overlaps (of length
k or longer) between reads are encoded in the graph due to the fact that same k-mers from
different reads are mapped to the same vertex.
The next phase in this type of assembly is a lossless compression of each vertex chain (i.e.,
a path of vertices with one in-degree and one out-degree) into a single vertex called contig,
which represents a longer substring of the original DNA sequence. The compression can be
performed by traversing the graph starting from the vertices having an unequal in-degree and
out-degree (such as forks, joins, etc). Each contig will contain the original sub-sequence(s)
from which the participating k-mers were produced. The compressed graph can then be used
in a wide variety of contig extension strategies. Figure 1.2 presents an example of the pro-
cess described above. Figure 1.2a shows a set of overlapping short reads covering the genome
followed by figure 1.2b which depicts the generation of 4-mers from short reads. Next, in fig-
ure 1.2c, we create a de Bruijn graph using the 4-mers from the previous step. The numbers
below the k-mers denote their corresponding frequencies across all reads. Note that k-mer
TAGA appears in both the second and the fifth short reads. In the second read, the k-mer fol-
lowing TAGA is AGAT, whereas in the fifth read, the following one is AGAG.Therefore, the vertex
TAGA has two outgoing edges to vertices AGAT and AGAG. Finally, in figure 1.2d we compress all
unambiguous paths (chains of vertices) into contigs.
One of the disadvantages of the de Bruijn graph based assembly technique is that it is
prone to collapsing repeated regions of the genome that are larger than k, causing informa-
3
(a) Overlapping short-reads
(b) K-mers from short-reads
(c) De Bruijn Graph using k-mers as vertices
(d) De Bruijn graph compression
Figure 1.2. Overview of de Bruijn graph based assembly
tion loss [27]. This can be avoided if assembly is performed using string-graphs which take
a more direct approach in finding the overlaps between short-reads. Given a set R of reads
and their Watson Crick (WC) complements, a string graph G in its simplest form consists of
R as vertices and a set of directed weighted edges E . An edge e = (ri ,r j , l ), where ri ,r j ∈ R
and l ∈ [1,mi n(|ri |, |r j |)), exists in E if the l-length suffix of ri matches the l-length prefix of
4
r j . Naturally, any edge e = (ri ,r j , l ) ∈ E must also have a complementary edge e′ = (r j ′,ri ′, l ),
where ri ′ is the WC-complement of ri .
In theory, there exists a tour in G that corresponds to the original sequence from which
R was generated. However, the graph contains a lot of redundant information in the form of
transitive edges. Specifically, if a read r j overlaps with rk and read ri overlaps with both r j and
rk , then the edge (ri ,rk ) can be removed from G without any information loss. Furthermore, a
read that is completely contained in another one may also be removed. A heuristic that is often
used to prune the graph is a greedy approach, where only one outgoing edge corresponding
to the read with the longest overlap is considered for assembly, and the others are ignored [18,
28].
Once the graph is simplified, paths that can be unambiguously traversed are spelled out
to obtain contigs. Fig. 1.3 shows a string graph created from short reads and the contigs gener-
ated by traversing it. The darker lines represent the greedy edges created between reads with
the longest overlap.
Figure 1.3. String graph-based assembly
1.3. Our Contribution
The last decade has witnessed huge leaps in the advancement of sequencing technology
that has led to the development of second generation sequencers, also known as Next Gener-
ation sequencers (NGS), which produce sequence datasets at a much higher throughput and
at a much lower cost than their predecessors. For instance, in a single run, an Illumina HiSeq
4000 sequencer can generate up to five billion 150-nucleotide reads at a cost of about $0.05
per million bases. This phenomenon has resulted in an explosion of raw datasets both in
5
terms of size and quantity. Although several tools have been developed to process these high-
throughput sequence datasets, assembling a real-world human genome dataset whose size is
nearly half a terabyte still requires either a scale-up machine with terabytes of RAM or a scale-
out cluster with several dozens of nodes. Even with the pay-as-you-go pricing model of cloud
services and availability of various big data frameworks, access to such high-end machines or
large-scale clusters is often too expensive for most researchers.
When processing large genomic datasets, both the de Bruijn graph and the string graph
based approaches require lots of memory as well as compute power. Our common objective
in the following works is to find new ways to reduce the memory requirements of the assembly
process as well as its execution time. To that end, we propose two new assembly frameworks
called Lazer and LaSAGNA, that are capable of processing large-scale sequence datasets using
modest computing resources in a reasonable amount of time. Lazer is a fully distributed de
Bruijn graph-based assembler than can assemble the human genome on two nodes with 64
GB memory each, but can scale up all the way to 128 nodes, and to the best of our knowledge,
is the only distributed assembler to be able to do so. On the other hand, LaSAGNA is a string
graph based assembler that is capable of assembling the large datasets on a single GPU but
can also speed up the process by using multiple GPUs across a high-performance computing
cluster. To the best of our knowledge, LaSAGNA is the only GPU-based assembler than can
assemble a 400 GB human genome dataset on a single GPU with 6 GB device memory.
The rest of the literature is organized as follows. In Chapter 2, we explain some of the
the issues with de Bruijn graph based de novo assembly, specially with respect to memory re-
quirement and scalability and demonstrate how Lazer alleviates them by swapping in and out
graph partitions from disk to memory. In Chapter 3, we focus on string graph based assembly
and propose a semi-streaming approach to utilize the high computational capabilities of gen-
eral purpose GPUs to assemble large datasets despite the limited device-memories. Finally, in
Chapter 4, we provide a preliminary exposition on third generation sequencers, their advan-
tages, challenges and opportunities and use them to motivate our proposed future work.
6
Chapter 2. Memory-Efficient Distributed De Bruijn Graph Assembly
2.1. Introduction
When it comes to de Bruijn graph based assembly, most of the widely used distributed
assemblers are designed with the assumption that users will have enough resources to deal
with all sizes of data. However, this is not always the scenario, particularly in case of terabyte-
scale datasets where these tools run out of memory on small clusters. Furthermore, the trends
in sequencing technology shows an increasing demand for processing power and memory in
order for the sequences to be assembled. Consequently, researchers will soon be unable to
assemble the larger genomes on the HPC clusters available to them. Therefore, to keep in-
tune with the decreasing cost of sequencing, there is a need for assembly tools to be more
memory efficient.
Figure 2.1 categorizes representative assemblers based on memory utilization and scal-
ability. On one hand, we have the first generation assemblers which are multithreaded ap-
plications that run on a single node but require terabytes of memory to assemble the larger
genomes. On the other hand, we have distributed assemblers that still need huge amounts
memory, but can be run on a distributed environment if there isn’t enough memory on a single
node. At the other end of the spectrum, there are a few assemblers that use succinct data struc-
tures to assemble large genomes on single nodes, but are not designed to run on distributed
systems. In this chapter, we present Lazer (large-scale genome assembly on ZeroMQ), a dis-
tributed assembly tool designed to have a low memory footprint and yet is capable of scaling
up to hundreds of nodes.
Experiments on a human genome dataset (452 GB) demonstrate that Lazer’s performance
is often better than others in terms of execution times, while having better scalability with a
much smaller memory requirement. Moreover, Lazer successfully assembles a 1.1 TB syn-
thetic wheat dataset whereas the other assemblers have failed to execute on it.
This chapter was previously published by Sayan Goswami et.al. as "Lazer: Distributed memory-efficient
assembly of large-scale genomes" in 2016 IEEE International Conference on Big Data (Big Data). Reprinted by
permission of IEEE.
7
H
ig
h 
M
em
or
y 
R
eq
ui
re
m
en
t
Velvet
SOAPdenovo
ALLPATHS-LG
ABySS
Ray
SWAP
Spaler
L
ow
 M
em
or
y 
R
eq
ui
re
m
en
t
Minia
Platanus
MEGAHIT
Lazer
Low Scalability High Scalability
Figure 2.1. Classification of assemblers wrt. scalability and memory efficiency
The rest of the chapter is organized as follows. In Section 2.2, we discuss the current
state of the art in the realm of distributed assemblers. In Sections 2.3 and 2.4, we explain
the methodology and implementation details of our proposed assembler. Section 2.5 evalu-
ates our tool on a variety of datasets and cluster configurations followed by a conclusion of
our study in Section 2.6.
2.2. Related Work
De novo assembly is a widely studied problem so there are many assembly tools to choose
from. The computation of older assemblers such as Velvet [62], ALLPATHS-LG [45], SOAP-
denovo [61], etc. are restricted to a single node, and hence these tools cannot address the
challenges involved in next-generation sequencing data. Furthermore, these assemblers are
severely limited due to inefficient memory management and are incapable of assembling ter-
abytes of data in most practical cases.
To reduce the memory required to assemble large datasets, Minia [15], Platanus [29],
MegaHit [36] make use of several succinct data structures (e.g., Bloom filter, etc.) and en-
coding schemes. However, their execution times are limited by the number of cores available
in a single machine. Moreover, given the trends of NGS dataset sizes, it is not clear if these as-
8
semblers will be able to accommodate the de Bruijn graphs in the memory of a single machine
in the future. Hence, a distributed approach is necessary to alleviate the problems involved in
large scale NGS dataset.
Assemblathon2 [11] evaluates several de novo assemblers using a variety of datasets. The
only distributed assemblers evaluated in Assemblathon2 are ABySS [56] and Ray [8], both of
which are based on MPI (Message Passing Interface). We have experimentally observed that
on large datasets, Ray could not scale up, and ABySS was unable to run at all. PASHA [41] is
another distributed assembler, also based on MPI but is severely constrained due to that fact
that some of its stages are not distributed. SWAP [47] outperforms ABySS, Ray, and PASHA
both in terms of scalability as well as execution time, but has a large memory overhead.
To tackle the scalability problem, assembly tools based on Apache Hadoop or Apache
Spark have also been proposed. Contrail [54] was the first de Bruijn graph-based assembler
built on Hadoop, followed by CloudBrush [13], which was based on string graphs. However, in
case of extremely large datasets their performances deteriorate due the disk-based computa-
tion (i.e., huge amount of disk I/O is required) of Hadoop. Some assemblers such as GiGA [32]
used a hybrid approach by using Hadoop for building the graph and Apache Giraph for com-
pressing it. Other developments include Spaler [1], which is a de novo assembler built on
Spark and GraphX. Although Spaler appears to perform well on datasets up to 100 GB, we
could not access its code to evaluate it on the terabyte-scale datasets.
2.3. Methodology
Our tool comprises of the first two phases of the assembly pipeline - de Bruijn graph cre-
ation and compression. In principle, the graph can be created simply by loading the k-mers
into an in-memory hash map, thus implicitly merging unique k-mers to the same vertex. A
more memory-efficient option would have been to use a map-reduce-based approach where
key-value pairs are partitioned into buckets and then each bucket is reduced by sorting or in-
dexing the keys. However, regardless of the method used to build the graph, it must reside in
an in-memory hash table in order to be traversed.
9
This approach poses a few significant challenges. Firstly, either there must be an enor-
mous amount of memory in a single machine, or the hash map must be distributed. Secondly,
in case the hash map is distributed, the lookup time is dominated by the network latency,
which is generally in the order of microseconds as opposed to the nanosecond latencies of
main memory. This network latency has a significant impact on the total execution time.
We alleviate the problems of scalability and memory efficiency to some extent by parti-
tioning the de Bruijn graph. Our partitioned graph has two advantages:
• It does not require the entire graph to be loaded in memory at once. The vertices within
each partition can be indexed, locally compressed, and spilled onto disk until all parti-
tions are processed. Finally, all partitions can be loaded at once and compressed one
last time.
• Each partition can be compressed by a thread or a group of threads within a single node
in the cluster. This local processing reduces the necessity of synchronization and com-
munication between threads and cluster nodes, thus enhancing the degree of paral-
lelism.
We use minimum substring partitioning (MSP) [38] to distribute the k-mers among the
different buckets. In this scheme, a smaller window of size m is slid over each k-mer Ki to
generate a set Si of k −m +1 substrings. Since |Σ| = 4, each substring in Si can be uniquely
mapped to one of 4m partitions. The partition to which Ki would be mapped is determined
by the alphabetically smallest substring in Si . The motivation behind using this approach
instead of an ordinary hash function is that overlapping k-mers (where the prefix of one is
the suffix of the another), which will be close to each other in the final de Bruijn graph, will
have a better chance of being mapped to the same bucket. In addition, this heuristic allows us
to partition the graph while it is being generated from the reads. Therefore, we can reap the
obvious benefits of working with smaller sub-graphs without having a separate partitioning
phase in the assembly pipeline.
10
In order to reduce the memory footprint even further, we propose a two-level partitioning
scheme. The first level (L1) partition of a k-mer is determined by the minimum substring
as explained earlier. The second level (L2) is determined by the position of the minimum
substring in the k-mer. Thus, for each L1 partition, there are k −m +1 L2 partitions. With this
scheme, whenever an L1 partition is locally compressed, at most two consecutive L2 partitions
are required to be loaded in memory at once, and the compression proceeds in a lockstep
fashion. Figure 2.2a depicts the partitioning scheme described above and 2.2b shows the local
compression of L1 partition Pi . In this example, the k-mer size is 17 and the substring size is
9. All the overlapping k-mers here have the same minimum substring (AAAGCTATA) and are
(a) Partitioning using minimum substring
(b) Local compression of partitions
Figure 2.2. Partitioning de Bruijn graphs using minimum substrings
thus mapped to the same L1 partition. However in each of the k-mers, the substring occurs at
different positions and are hence mapped to different L2 partitions. The compression begins
with the vertices in L2 partition Pi8 as seeds. The vertices in Pi7 are loaded in a hash map,
merged with the seeds and removed from map. This process continues until the vertices in L2
11
partitions Pi7 through Pi0 are merged with those in Pi8 . We describe our proposed techniques
for the two phases in more detail as follows.
2.3.1. De Bruijn graph creation
Map phase. In this phase, each mapper thread reads the sequences of nucleotides from files
assigned to it. For each sequence of length l , a mapper slides a window of size k to generate
l − k intermediate key-value pairs. A key is a byte-encoded k-mer, and the value is a tuple
consisting of the two nucleotides at the either end of the window. We call them "intermedi-
ate" because multiple short reads may generate the same key whose values must be merged
to obtain a final key-value pair. Conceptually, each key represents a vertex of the de Bruijn
graph, and its intermediate value represents at most one incoming edge and at most one out-
going edge. K -mers at the leftmost and rightmost end of the sequence are padded with special
characters in the value tuple to represent no incoming and outgoing edges respectively. Each
key-value pair is then placed into the appropriate partition based on the minimum substring
of the key.
Load balancing. Unlike conventional hash functions, the minimum substring partitioning
scheme is locality sensitive, and its main purpose is to maximize the probability of a collision
for similar strings. Consequently, the distribution of k-mers across different L1 and L2 parti-
tions is severely skewed. Moreover, we have observed that on a number of datasets, almost
half of the partitions remain empty. The distribution of intermediate key-value pairs among
the non-empty L1 partitions in the Yoruban male dataset can be seen in figure 2.3
There are a few partitions which are much larger than the rest. In the reduce phase where
each node handles a subset of the partitions, if the distribution is not uniform, one of the
peer nodes of the cluster might get more than its share of data consequently dominating the
total execution time, or worse, running out of memory. To tackle this problem, after all reads
are processed, the total number of intermediate key-value pairs for each partition is collected
from all the nodes, and the partitions are stored in a max-heap based on their sizes. They are
then distributed across the nodes using the Longest Processing Time (LPT) scheme.
12
0e+00
2e+08
4e+08
6e+08
8e+08
0 20000 40000 60000
L1 partition id
N
um
be
r 
of
 in
te
rm
ed
ia
te
 k
ey
−
va
lu
e 
pa
irs
Figure 2.3. Imbalance of distribution in intermediate data among partitions
Reduce phase. In this phase, each reducer thread picks up one of the L1 partitions assigned
to it and starts reducing the values corresponding to each unique key in the list. This reduce
phase has two objectives: Firstly, the same k-mer generated from different sequences can have
different nucleotides at its ends. These nucleotides which form the incoming and outgoing
edges of the vertex represented by the k-mer need to be merged in the final de Bruijn graph.
Since the maximum in-degree and out-degree of a vertex is 4, the edges can be encoded in a
single byte. Secondly, the number of times a k-mer (regardless of the outgoing edges) occurs
in the entire dataset needs to be tracked. This frequency information is used in subsequent
stages to recognize erroneous structures in the graph caused by sequencing errors. After this
phase, all subgraphs within each partition are locally compressed.
13
2.3.2. Graph compression phase
Following reduction and local compression, the vertices are distributed among the nodes
for further extension. The vertices that have exactly one incoming edge and at most one out-
going edge are loaded in a distributed hash map. All other vertices are appended to a dis-
tributed list and act as seeds for the extension. After the graph is loaded, the workers traverse
the linear chains, starting from the vertices in the list. Each worker looks up the hash map in
order to get the next vertex in the chain it is working on.
We guarantee that a single path cannot be traversed by more than one thread making our
algorithm lock-free and enhancing scalability. Clearly, there can be an imbalance in load dis-
tribution between different threads depending on the lengths of the paths. However, we have
observed that with real datasets, the imbalance is not severe enough to warrant the overhead
of synchronization.
2.4. Implementation
Our assembler is built on top of ZeroMQ (http://zeromq.org/), which is an embeddable
framework for concurrency and communication. It provides a threading library based on He-
witt’s Actor Model [24] of concurrent computation. The semantics of this model states that an
actor is an independent “universal primitive” for concurrency that can only modify its own
state and exchange atomic messages with other actors. These concepts differ from the idea
of a global state shared by multiple threads, hence actors don’t require locks, semaphores or
critical sections.
ZeroMQ also provides an intelligent transport layer for low-latency communication and
is best suited for small packets. Moreover, it provides uniform message passing semantics
between actors (threads), processes, and boxes (cluster nodes). Accordingly, a pair of ZeroMQ
sockets can talk to each other over inproc, ipc, or tcp protocols.
Figure 2.4 shows the high level system overview of Lazer. It consists of a master process,
which is responsible for the distribution of tasks, synchronization between nodes and load-
14
Master
tcp://
inproc://
Worker
tcp://
Node-manager
inproc://
inproc://
Worker
inproc://
Worker
tcp://
Node-manager
inproc://
inproc://
Worker
Distributed File System
High Speed Interconnect
Figure 2.4. System overview of Lazer
balancing. Moreover, each node runs a slave process consisting of a Node-manager actor
which spawns and manages worker actors. Since a manager and its workers reside in the
same address space, they can connect to each other via inproc protocol. The node-managers
connect to the master via tcp. All processes have access to a distributed file system and can
send/receive messages to/from each other via a high speed network.
This architecture forms the basis on which all the phases are implemented. The subsec-
tions that follow, describe the implementations and actor interactions during each phase in
more details.
2.4.1. Map phase
Figure 2.5 shows the different actors at play and their communication patterns during
the map phase. The node-manager spawns three types of workers: mappers, partitioners
and sinks. All these actors have the same parent process and can communicate via inproc.
Mappers ask node-managers for input batches and the request is relayed to the master. On
receiving a batch, mapper threads parse the reads into (k+2)-mers (i.e. the k-mer and the
nucleotides at its ends). Subsequently, each (k+2)-mer is mapped to one of the available par-
titions and sent to the partitioner actor responsible for it. The process is described in Algo-
rithm 1.
15
Master
tcp://rep
inproc://req
Mapper
inproc://push
tcp://req
Node-manager
inproc://rep
inproc://pull
Partitioner
inproc://push
inproc://req
Mapper
inproc://push
inproc://pull
Partitioner
inproc://push
tcp://dealer
inproc://pull
Sink
Local Storage
inproc://req
Mapper
inproc://push
tcp://req
Node-manager
inproc://rep
inproc://pull
Partitioner
inproc://push
inproc://req
Mapper
inproc://push
inproc://pull
Partitioner
inproc://push
tcp://dealer
inproc://pull
Sink
Local Storage
Figure 2.5. Actor interactions during map-phase
Algorithm 1 Mapper actor
1: function MAPPER
2: while mor e_batches do
3: ZMQ_SEND(manag er,GET _B ATC H)
4: r ead_batch ← ZMQ_RECV(manag er )
5: for read r in r ead_batch do
6: tupl e_ar r ay〈kmer, i n,out〉T ← PARSE_READ(r )
7: for tuple k2 in T do
8: 〈PL1 ,PL2〉← MIN_SUBSTR_HASH(k2.kmer )
9: i d ← PL1 mod |par ti t i oner s|
10: ZMQ_SEND(par ti t i oner s[i d ],k2,PL1 ,PL2 )
11: end for
12: end for
13: end while
14: end function
16
Partitions are maintained as in-memory lists and each partitioner is assigned a subset of
those lists. The amount of intermediate data generated at each node is often much more than
the memory available. Hence, a subset of the data must be spilled onto disks. Here, we answer
the following questions pertaining to data spillage: 1) How do we decide what to spill? and 2)
How is the spilled data stored on disk? Assuming there are p partitions, each node begins with
p empty lists. Partitioners receive intermediate key-value pairs from mappers and append
them to appropriate lists, according to their minimum substrings. Data is appended to lists
in fixed-size chunks, and the total number of chunks across all partitions in a node is tracked.
Whenever a new chunk is created, the total chunk count is compared with a threshold, and if
it is larger than the threshold, the chunk is spilled to disk. By configuring the threshold, one
can limit the memory-per-node that can be used to store the intermediate data. If there is
enough memory, intermediate data isn’t spilled onto disk at all, thus removing the I/O bottle-
neck during reduction.
Since the data must be spilled in a way that allows entire partitions to be read from the
disk at once, storing key-value pairs in flat files is not efficient. Instead, we use a disk-based
NoSQL database to index the spilled data. If each partition maintains a current chunk count,
the unique keys for the spilled blocks can be generated from the partition id and the chunk
id. During reduction, all the ni blocks spilled in a partition Pi can be retrieved by making GET
calls to the database using keys ranging from i .0 to i .(ni −1).
We use RocksDB (http://rocksdb.org/) as the underlying NoSQL database. It is an embed-
dable persistent key-value store based on log-structured merge trees and can provide a partial
ordering of keys based on fixed-length prefixes. Since we dont care how the data is ordered
on disk as long as they are grouped by partitions, we can assign the first log (p) bits of keys as
the prefix. During reduction of a partition Pi , we can iterate over the key-space with prefix i .
Under this scheme, insertion and retrieval perform better than those in databases with total
ordering of keys. Furthermore, RocksDB provides a number of useful features such as multi-
threaded compactions, thread-safe reads, bloom-filters for caching, and several other param-
17
eters that can be configured based on the underlying storage system. All of our experiments
were run on spinning disks, but we believe the reduce phase will get a significant performance
boost if the database is hosted on SSDs. Although RocksDB supports multi-threaded reads, all
writes to the database must be serialized. Hence write access to the database is limited to a
single actor (sink) that handles all PUT requests from the partitioners, as shown in Figure 2.5
and described in Algorithm 2.
Algorithm 2 Partitioner actor
1: function PARTITIONER
2: cache ← l i st s[|L1|][|L2|]
3: while more_kmers do
4: (k2,PL1 ,PL2 ) ← ZMQ_RECV(mapper s)
5: LIST_APPEND(cache[PL1 ][PL2 ],k2)
6: if cache_overflow then
7: ZMQ_SEND(si nk,cache[PL1 ][PL2 ])
8: end if
9: end while
10: end function
2.4.2. Reduce phase
Note that in the map phase, there is no communication between nodes. In other words,
if there are p partitions, each of the nodes will initially contain p buckets to hold the interme-
diate data. This is done in order to achieve a balanced distribution of partitions in the reduce
phase. Before the the intermediate data is merged, they must be shuffled around so that for a
partition Pi all intermediate data generated within Pi in all the nodes must be moved to the
the node responsible for reducing Pi . Figure 2.6 shows how the reducers fetch intermediate
data from the NoSQL database servers located at the different nodes.
Shuffle phase. If the intermediate data is fully contained in memory, the performance of the
shuffle phase becomes almost entirely dependent on the network throughput. We have ex-
perimentally observed that with the high-speed networks available in most HPC clusters, the
memory to memory data transfer does not create any bottlenecks in the process (i.e., the pro-
cessor utilization remains fairly high). However, this changes when the intermediate data is
18
Master
tcp://rep
inproc://req
Reducer
tcp://dealer
tcp://req
Node-manager
inproc://rep
tcp://router
Sink
inproc://req
Reducer
tcp://dealer
inproc://req
Reducer
tcp://dealer
tcp://req
Node-manager
inproc://rep
tcp://router
Sink
inproc://req
Reducer
tcp://dealer
Local Storage Local Storage
Figure 2.6. Actor interactions during reduce-phase
spilled onto disk. In such scenarios, shuffle performance is dictated by the disk read through-
put, which is typically much less than the network throughput. Moreover, if there is a single
disk per node, as is the case in most HPC clusters, multiple threads reading from the disk
increases contention and adversely affects the performance.
The impact due to disk I/O bottleneck is offset to some extent by using a block pre-
fetching scheme on the database server side and a scatter-gather approach on the client side.
Both of these features are enabled by ZeroMQ’s high-speed asynchronous messaging engine.
On the client side, each reducer starts with a set containing all the nodes and scatters requests
for a chunk of intermediate data. This scheme serves two purposes:
1. Since each reducer sends requests to the servers running on all nodes, the database
servers share the load uniformly. Moreover, the fact that each reducer only asks for a
small chunk prevents starvation of other reducers, albeit at the cost of a higher number
of messages in the network.
19
2. Since messages are transmitted and received asynchronously by ZeroMQ I/O threads,
reducers can immediately start working on the chunks as soon as they arrive. Therefore,
the latency is reduced to the time between the transmission of the last request to the
arrival for the first response.
On the database side, when a server receives a request for a chunk of partition Pi from a re-
ducer, it checks if any blocks in Pi reside in memory. Otherwise it fetches one from disk and
responds with it. As the response is being sent, the server checks if the next block is in mem-
ory. If it is not, it pre-fetches one from disk. The pseudo-code for the reducer actor is shown
in Algorithm 3
An interesting observation worth mentioning at this point is that on some datasets, a sin-
gle partition can be so much larger than the others that the reducer handling it might keep on
running long after the others are finished. This occurs more often when the program is run
on a huge number of cores, where each reducer deals with a very small number number of
partitions, most of which have a very small amount of intermediate data, and a select few are
massive.
This effect can be seen in figure 2.3, where partition id 0 clearly stands out from the rest.
This is an inherent disadvantage of using the minimum substring partitioning scheme, some-
thing that comes as a compromise for the reduced memory footprint. Although the effect of
these mega-partitions can be somewhat offset when run in a small cluster, they completely
destroy the scalability as the number of nodes is increased.
Its apparent that the size of the substring chosen for MSP will have a significant impact
of the skewness of the distribution. Thus with some trial-and-error, the substring size can be
chosen in a way such that the problem can be alleviated. However, these optimum sizes may
widely vary across different datasets. Therefore, the substring size must be fixed by the end
user for each new experiment, which is a very undesirable side effect.
To relieve the user of this burden, we only process the smaller partitions with a single
thread. All other partitions which are larger than a preset threshold are reduced and locally
20
Algorithm 3 Reducer actor
1: function REDUCER(PL1 )
2: l i st ← NU LL
3: for PL2 ←|L2|to1 do
4: hashmap ← MERGE(PL1 ,PL2 )
5: l i st ← EXTEND(l i st ,hashmap)
6: end for
7: end function
8: procedure MERGE(PL1 ,PL2 )
9: while nodeset_not_empty do
10: ZMQ_PUBLISH(nodeset ,PL1 ,PL2 )
11: for i = 1 to |nodeset | do
12: (n,chunk) ← ZMQ_RECV(nodeset )
13: if chunk_is_empty then
14: REMOVEFROM(nodeset ,n)
15: else
16: add/merge chunk to hashmap
17: end if
18: end for
19: end while
20: end procedure
21: procedure EXTEND(l i st ,hashmap)
22: for all vertex v in list do
23: if OUT(v) = 1 and v.next ∈ hashmap then
24: APPEND(v,hashmap[v.next ])
25: ERASE(hashmap, v.next )
26: end if
27: end for
28: Move all vertices from hashmap to list
29: end procedure
21
compressed by all available threads within one of the nodes. After running experiments on a
number of datasets we have observed 5×108 to be a good threshold value (it can be modified
by the user if required). The multithreaded algorithm is very similar to the one described in
Algorithm 3, barring a few modifications for synchronization, and is therefore omitted from
this text.
2.4.3. Graph compression phase
During this phase, all partitions of the graph must be loaded into memory for further
compression and be globally visible to all nodes in the cluster. Moreover, traversal of the graph
demands an O (1) lookup of vertices, which implies that they must be stored in a distributed
in-memory hash-map. Since, at this stage there are no guarantees that adjacent vertices will
reside in the same address space, the traversal may require network communication for every
other vertex in the chain which imposes strict latency constraints on the hash-map
We observed that the off-the-shelf distributed hash-tables fail to satisfy the low-latency
constraints required for graph traversal, and the low memory footprint we are trying to achieve.
Moreover, these general-purpose tools come with features such as load-distribution and fault
tolerance, which add extra complexity not necessary for our workload.
To get around these issues, we mimic a distributed hash-map, by using an array of inde-
pendent instances of hash-maps running on each node, and a pre-defined hash-function fh
known to all clients. The servers are oblivious to one another, and the clients maintain con-
nections to all servers in the cluster. For any key-value pair 〈K ,V 〉, the server id at which the
pair will reside is determined by fh(K ) mod n, where n is the number of server instances. This
scheme provides a one-hop access to the hash-map since any client would be able to GET a
value residing at any node, as long as the key is known to it.
We have chosen Sparsehash (code.google.com/p/sparsehash) as the underlying data
structure for the distributed hash-map, since it has the lowest memory footprint. Moreover,
using ZeroMQ’s low-latency message transmission and one-hop accesses to each server in-
stance, we achieve a high-performance map, even with random accesses on billions of keys.
22
To increase the throughput even further, we employ an asynchronous pipeline of GET
requests, where workers simultaneously work on a batch of b paths (simultaneously extend
b seeds). At each step, a worker requests b values from the hash-map corresponding to the
outgoing edges of paths in the current batch. Once again, since message transmissions are
asynchronous, a request returns immediately with a placeholder for a future result. The ra-
tionale here is that if the batch size is large enough, by the time the last request is sent, the
response for the first one would have arrived, thus hiding the network latency. When a path
reaches its end, it is removed from the batch and replaced with a new seed.
2.5. Evaluation
2.5.1. Overview of datasets
To analyze the performance characteristics of Lazer, we use two different datasets: 1) a
synthetic wheat (Triticum aestivum) genome dataset of size 1.1TB from Joint Genome Insti-
tute 1 and 2) a Yoruban male genome dataset of size 452GB from National Center for Biotech-
nology Information 2. During the process of assembly, the first dataset (i.e., the wheat genome)
produces more than 4 TB of temporary data, whereas the second one (i.e., the human genome)
produces almost 750GB.
2.5.2. Overview of experimental testbed
To evaluate the performance of Lazer in a distributed environment, we use QueenBeeII3
supercomputing cluster located at LSU. Each compute node of QueenBeeII has two 10-core
2.8 GHz E5-2680v2 Xeon processors, 64GB DRAM and 500GB hard disk drive. Although Queen-
BeeII has a total of 480 nodes with this configuration, only 128 of them are available for run-
ning a single job. All the nodes are connected by a 56Gb/sec InfiniBand (FDR) interconnect
with 2:1 blocking ratio.
1https://gold.jgi.doe.gov/projects?id=Gp0039864
2http://www.ncbi.nlm.nih.gov/sra/?term=SRA000271
3http://www.hpc.lsu.edu/resources/hpc/system.php?system=QB2
23
2.5.3. Demonstrating the scalability of Lazer
Assembly of wheat genome (1.1TB). To demonstrate the scalability of Lazer, we first assem-
ble the relatively larger wheat genome (1.1TB) with different number of nodes in QueenBeeII
cluster. Figure 2.7 shows the execution time of different phases of Lazer while assembling
the wheat genome. Neither Ray nor SWAP could assemble the 1.1 TB wheat dataset with 128
nodes due to insufficient memory or wall-clock timeout in the cluster.
0
20000
40000
60000
8 16 32 64 128
Number of nodes
E
xe
cu
tio
n 
tim
e 
in
 s
ec
on
ds
Phases
Map
Reduce
Compress
Figure 2.7. Execution times of Lazer on the synthetic wheat W7984 dataset (1.1 TB)
Note that the map phase is the most scalable of all phases which is not surprising since
this phase is embarrassingly parallel and there is no network communication involved in it.
It can also be observed that the compression phase makes up a small fraction of the total ex-
ecution time and is also scalable even though it involves a distributed key-value store which
in-turn is dependent on network latency. We speculate that the asynchronous pipelined ac-
cesses to the hash-map and the lock-less traversal algorithm have a significant impact on the
scalability of compression.
24
On the other hand, the reduce phase appears to scale worse than the others due to the
large amount of spilled intermediate data that must be shuffled during reduction. Since each
node has a single disk, it presents a bottleneck while reading data and prevents a high CPU
utilization.
It is worth mentioning here that we could not assemble this dataset using any other as-
sembler even when using the maximum amount resources (i.e., 128 nodes of QueenBeeII clus-
ter) available to us.
Assembly of human genome (452GB). To compare Lazer’s overall execution time and scal-
ability with other assemblers, we used a relatively smaller human genome data set (452GB).
Figure 2.8 compares the execution time of Lazer with two other genome assemblers, Ray and
Swap while assembling the 452GB human genome. Both axes are in log scale. It can be eas-
ily observed that Lazer scales almost linearly with increase in number of nodes whereas Ray
cannot complete assembly on 4 nodes and SWAP runs out of memory with 32 nodes.
Lazer Ray SWAP
2048
8192
32768
2 4 8 16 32 64 128 2 4 8 16 32 64 128 2 4 8 16 32 64 128
Number of nodes
E
xe
cu
tio
n 
tim
e 
in
 s
ec
on
ds
Figure 2.8. Execution times of Lazer, Swap, and Ray on the Yoruban male dataset (452 GB)
We were unable to run SWAP on 32 nodes and Ray on 4 nodes due to out-of-memory
errors. We observed that Ray’s scalability degrades after 32 nodes, and the execution time
increases going from 64 to 128 nodes. Lazer, on the other hand, can assemble the dataset on
just 2 nodes, and yet shows a near-linear scalability when running on different configurations.
25
2.5.4. Demonstrating memory-efficiency of Lazer
The memory efficiency of Lazer is evident from the fact that we could not run the 1.1TB
wheat genome using the other assemblers even using the maximum amount of resources
available to us i.e., 128 nodes of QueenBeeII providing 8TB (64GB × 128 nodes) of aggre-
gated DRAM. On the other hand, Lazer was able to assemble it using only 8 nodes i.e., 512GB
(64GB × 8 nodes) of memory. Figure 2.8 also demonstrates the memory efficiency of Lazer
compared to Ray and Swap 4. It is apparent that Lazer successfully assembles the human
genome using only 2-nodes i.e., 128GB (64GB ×2 nodes), whereas Ray fails to complete using
4-nodes (i.e., 64GB ×4 nodes = 256GB of DRAM) or higher. SWAP, although better than Ray in
terms of execution time, was found to be the least memory efficient in our evaluation. It re-
quires at least 16 times more aggregate memory than Lazer since it runs out on memory with
32 nodes (i.e. 64GB ×32 nodes = 2T B of DRAM).
As mentioned earlier, there is a significant imbalance in the size of the partitions can eas-
ily overwhelm the hardware resources if they are not carefully distributed among the cluster
nodes. In Figure 2.9, we show the distribution of intermediate key-value pairs among peers
after the intermediate data is shuffled. It can be observed that the larger partitions are scat-
tered across different nodes in a fairly uniform manner so that no single node is assigned more
than one. The low memory requirement can be very significant to bioinformatics researchers
who may not have access to large scale compute clusters with hundreds of nodes. In such
scenarios, Lazer can provide real time solution for assembling huge NGS data set.
2.6. Conclusion
In this chapter, we introduced Lazer, a distributed assembler built using ZeroMQ’s ac-
tor and low-latency communication framework that utilizes a smart partitioning scheme for
de Bruijn graphs to achieve both scalability and memory efficiency. Experimental results on
terabyte-scale datasets show that our framework significantly reduces the memory footprint
4In order to have a fair comparison, we only evaluate the contig generation phases of SWAP and Ray with
single end reads.
26
0e+00
1e+09
2e+09
3e+09
Nodes(32)
To
ta
l n
um
be
r 
of
 in
te
rm
ed
ia
te
 k
ey
−
va
lu
e 
pa
irs
(a) Distribution on 32 nodes
0.0e+00
5.0e+08
1.0e+09
1.5e+09
2.0e+09
2.5e+09
Nodes(64)
To
ta
l n
um
be
r 
of
 in
te
rm
ed
ia
te
 k
ey
−
va
lu
e 
pa
irs
(b) Distribution on 64 nodes
0.0e+00
5.0e+08
1.0e+09
1.5e+09
Nodes(128)
To
ta
l n
um
be
r 
of
 in
te
rm
ed
ia
te
 k
ey
−
va
lu
e 
pa
irs
(c) Distribution on 32 nodes
Figure 2.9. Distribution of intermediate key-value pairs with various cluster configurations
while ensuring fast assembly. As future work, we plan to add support for reading input data
from Hadoop distributed filesystem (HDFS) so that Lazer can be more accommodating to
commodity clusters. We also plan to include error removal and scaffolding phases to create a
complete assembly pipeline.
27
Chapter 3. GPU-Accelerated String Graph Assembly
3.1. Introduction
Amidst the extraordinary progress of accelerator technologies we are observing a grow-
ing interest in general-purpose GPUs (Graphics Processing Units) in various scientific do-
mains such as bioinformatics, molecular dynamics, computer vision, and most notably in
deep learning. We have also noticed a widening gap between the computational capabilities
of general-purpose CPUs and GPUs. For example, the recently introduced NVIDIA Tesla V100
theoretically has 15 TFLOP/s of single precision (FP32) performance and delivers 900 GB/sec
peak memory bandwidth, which is an order of magnitude higher than the 85 GB/sec found
in the latest Intel Xeon Processor (E7-8894 v4). This meteoric rise of GPUs is very timely, es-
pecially when viewed in the context of Next-Generation Sequencing (NGS) technology and
its rapid advancement. The increasing size and decreasing cost of raw datasets has made it
imperative to utilize hardware acceleration in addition to using general-purpose CPUs to ex-
pedite the processing of large genome sequences economically.
Ordinarily, a traditional CPU has a few cores optimized for sequential processing while
a single GPU has thousands of cores for massively parallel processing. Several specialized
tools have already benefited from the extreme computational capabilities of GPUs across a
wide range of bioinformatics applications. These include NVIDIA’s NVBIO [50], CUSHAW [42],
CUSHAW2-GPU [40], BarraCUDA [31], SOAP3 [39], SARUMAN [7], PUNAS [12], and GPU-
based tools with optimized BWT (Burrows - Wheeler Transformation) [14,20] and Smith - Wa-
terman algorithm [60]. However, most of these studies focus on read alignment, whereas the
problem of sequence assembly is not well-addressed due to the limited device memory sizes
in GPUs.
Although few GPU-based genome assemblers have been proposed, they are designed
for small error-free simulated datasets which inherently limits their assembly performance.
This chapter was previously published by Sayan Goswami et.al. as "GPU-Accelerated Large-Scale Genome
Assembly" in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). Reprinted by per-
mission of IEEE.
28
Specifically, the largest dataset they have assembled is in a few tens of gigabytes, which is
much smaller than typical human genome datasets. Given that the high computational ca-
pability of GPUs has not been fully utilized by existing genome assembly tools, it is crucial to
develop a new GPU-based assembly framework that uses adequate techniques to scale up to
the size of real-world human genomes without requiring large amounts of expensive comput-
ing resources.
Several studies have reported that the most challenging step in genome assembly is build-
ing a graph (either a string graph or de Bruijn graph) from the short reads since it requires a lot
of computing power and a large amount of memory [43] [10]. Although GPUs can alleviate the
computational aspect of this issue, the large memory requirement renders them incapable
of processing even the medium-sized datasets produced by sequencers today. To address
the fundamental limitation of GPUs in large-scale genome assembly, we present a new GPU-
accelerated genome assembler, called LaSAGNA, that can assemble datasets with billions of
sequences using a single GPU by building string graphs from approximate all-pair overlaps.
LaSAGNA significantly reduces the memory requirement by employing a semi-streaming ap-
proach that minimizes the number of disk accesses based on the available memory. It can also
run on multiple GPUs across multiple compute nodes to expedite the assembly pipeline.
This work makes several contributions as follows. 1) We develop a new genome assembly
framework that can build an approximate overlap graph from a real-world human genome
sequence dataset (several hundred GB in size) using a single GPU equipped with only 6 GB
device memory. 2) We present a two-level semi-streaming model (from disk to host memory
and from host memory to device memory) that effectively utilizes the memory hierarchy to
reduce the disk I/O in large-scale genome assembly. 3) We implement a distributed version of
LaSAGNA to facilitate faster operation on a cluster of compute nodes. Given that the assembly
workload is heavily I/O-intensive, LaSAGNA can benefit from the larger aggregated I/O band-
width available in distributed computing. 4) We evaluate LaSAGNA using several real-world
genome sequence datasets to exhibit its efficiency in various computing environments. Our
29
experimental results show that LaSAGNA can assemble a 400 GB human genome dataset in 17
hours using a single GPU (NVIDIA K40).
The rest of the chapter is organized as follows. In Section 3.2, we present an exposition of
the related literature followed by our proposed approach and its distributed implementation
in Section 3.3. Lastly, we show our experimental results in Section 3.4 and conclude this study
in Section 3.5.
3.2. Related Work
GPU-Euler [46] is the first genome assembler built for GPUs, followed by GAGM [26],
which proposes paired de Bruijn graph construction and contig generation using a single
GPU. Both assemblers were evaluated using small simulated error-free reads generated from
bacterial genomes. Another GPU-based de Bruijn graph construction tool [43] proposes a
staged graph construction pipeline and reports its results on a 13 GB human chromosome
dataset. GAMS [27] is the first string graph-based assembler that uses GPUs and reports the as-
sembly of a human chromosome using a 16-node GPU cluster. Some works, such as FAssem [58],
have studied the application of FPGAs in assembly, but are also evaluated on small bacterial
datasets.
There are many CPU-based assembly tools, and most of them are based on de Bruijn
graphs. Since the advent of first-generation assemblers such as Velvet [62] and SOAPden-
ovo [44], several assemblers, such as Minia [15] and MEGAHIT [36], have been proposed to
use succinct data structures (e.g., bloom filters) to store large graphs in memory. Many dis-
tributed assemblers have also emerged to expedite the assembly of high-throughput sequenc-
ing datasets. Notable among them are Ray [8] and SWAP [47] built on MPI, Lazer [22] which
uses ZeroMQ, Contrail [54] and Giga [32] based on Hadoop, and HipMer [21] based on the
global-address space model. Among the string graph-based assemblers, SGA [55] is the only
one that can process large datasets on a single node using compressed data structures. To the
best of our knowledge, CloudBrush [13] is the only distributed string graph-based assembler,
but has only been evaluated on small datasets and is still under development.
30
3.3. Methodology for GPU-Accelerated Assembly
The most expensive stage in any string graph-based assembler is to find the overlaps be-
tween all pairs of reads. In theory, one can generate all suffixes and prefixes from R, and create
two lists of key-value pairs, S and P , using suffixes and prefixes respectively as keys and their
corresponding read-IDs (unique identifiers of reads) as values. Next, for each key-value pair
< k,ri > in S, one can add an edge e = (ri ,r j , |k|) in G if < k,r j > exists in P .
Naturally, this is not practical since storing all suffixes (or prefixes) for an input of size n
will require O (n2) space, where n is the total number of bases in R. The space can be reduced
if the actual suffix (or prefix) is replaced by a tuple < l , f > where l is the length of the suffix (or
prefix) and f is its fingerprint. Therefore, for each << li , fi >,r j > in S, we can conclude that
an edge exists between r j and rk with a high probability if P contains << li , fi >,rk >.
However, even with fingerprints, the space requirement (i.e., O (n log2 q) bits where q is
the largest fingerprint value) is often too large to fit in the main memory. Indeed, this approach
has been attempted [5] but evaluated only on small simulated low-coverage datasets. This
problem intensifies in case of large high-coverage real datasets where the reads have a lot of
overlaps and require larger fingerprints to minimize the number of false positive edges.
The situation deteriorates with GPUs since their available device memories are typically
an order of magnitude smaller than the main memory of even a mid-range workstation. To
address this problem, we present a semi-streaming approach for building the string graph by
conceptually dividing the memory hierarchy into a read-only memory, a write-only memory,
and a working memory, as depicted in Fig. 3.1.
Read-only memory
Write-only memory DiskWorking 
memory
Figure 3.1. Conceptual view of memory types
31
The read-only memory can only be read sequentially, whereas the write-only memory
can only be written sequentially. Note that both the read- and write-only memories reside on
disks, but are treated as separate to suggest that a file cannot be read and written at the same
time. The working memory consists of a slower host memory and a faster device memory.
Unlike the read- and write-only memories, we allow random accesses in the host memory.
However, most of the computation is done on the faster device memory, and we minimize
the processing done on the host memory. This two-level streaming model (from disk to host
memory and from host memory to device memory) efficiently utilizes the memory hierarchy
to reduce the amount of disk I/O and enables large-scale sequence assembly. Fig. 3.2 shows
the overview of our assembly pipeline in LaSAGNA consisting of four main phases: map, sort,
reduce, and traverse. We describe each phase in detail below.
3.3.1. Map: generate pairs of fingerprints and read-IDs
In this phase, batches of reads are loaded in the GPU where their reverse complements
are obtained. Next, for each read and its complement, the fingerprints of all their prefixes and
suffixes are generated using the Rabin Karp’s rolling hash. Since short-reads in a sequence
dataset are of similar lengths, the fingerprint generation can be easily parallelized by assigning
each read to a thread. However, on GPUs, this scheme fails to perform as expected due to
excessive memory throttling, a scenario where threads block because of numerous pending
memory accesses. Moreover, this scheme does not utilize shared memory.
These issues can be addressed if a block of threads processes each read in parallel. To
do so, we express prefix fingerprint generation as a Hillis-Steele scan problem. Here, each
batch of reads is processed by a grid of thread blocks where the number of blocks equals the
number of reads in the batch, and the number of threads per block equals the read-length.
Fig. 3.3 illustrates the execution of the kernel used to generate fingerprints of all prefixes of
a read whose length is 10 with a prime modulo (q) value of 13 and a radix (σ) 4. In practice,
the radix is a small prime larger than the alphabet size, and the prime modulo is a large prime
number.
32
Figure 3.2. Overview of assembly pipeline
33
G A T A C C A G T A
3 0 2 0 1 1 0 3 2 0
3 12 2 8 1 5 4 3 1 8
3 12 11 5 7 3 7 5 0 4
3 12 11 5 8 7 2 11 11 5
3 12 11 5 8 7 2 11 7 2
1 4 3 12 9 10 1 4 3 12
40 41 42 43 44 45 46 47 48 49
M
R
E
P
f[n]= (f[n] +
f[n-1]*M[1])%13
f[n]= (f[n] +
f[n-2]*M[2])%13
f[n]= (f[n] +
f[n-4]*M[4])%13
f[n]= (f[n] +
f[n-8]*M[8])%13
Figure 3.3. The communication and computing patterns while generating prefix fingerprints
of GATACCAGTA with radix 4 and prime 13.
Before the kernel is launched, the place values (e.g., σ0, σ1) for all positions of the read are
computed, and their remainders after division by the prime number q are stored in an array M.
This step is done once for the entire program and reused for all reads. In the beginning, each
thread in the grid encodes the corresponding base in the read (R=GATACCAGTA) to the radix σ
(σ = 4) and places it in its corresponding position in array E. Next, each thread iteratively adds
its element to the product of previous element (the current offset determines the previous
element), and its place value. The offset starts at one and is doubled at every step until its
length becomes larger than the read itself. At the end of the algorithm, each position i in the
output array P will store the fingerprint of the prefix ending at position i (i.e., fingerprint of
the prefix of length i +1). For example, in Fig. 3.3, the fingerprints of G , G A, and G AT are 3,
12, and 11 respectively.
34
The suffix fingerprints are computed using the place values (M) and prefix fingerprints (P )
calculated during the previous step as shown in Fig. 3.4. As before, the position i will now store
the fingerprint of the suffix starting from position i . For instance, in Fig. 3.4, the fingerprints
of GATACCAGTA and ATACCAGTA are 2 and 5 respectively.
3 12 11 5 8 7 2 11 7 2P
2 36 36 44 5 80 63 24 33 28P’
*M[9] *M[8] *M[7] *M[6] *M[5] *M[4] *M[3] *M[2] *M[1]
2 10 10 5 5 2 11 11 7 2P’
2 5 5 10 10 0 4 4 8 0S
S[i]=
(2+13-P’[i])%13
P’[i]=
P’[i]%13
Figure 3.4. The communication and computing patterns while generating suffix fingerprints
using prefix fingerprints
This scheme reduces the number of memory transactions and raises the functional unit
utilization of GPUs. Moreover, prefixes and suffixes are processed in a single kernel using
shared memory, thus improving performance and avoiding scattered writes during suffix fin-
gerprint generation.
Partitioning. To find overlaps between suffixes and prefixes, their lengths ( j ) and source
read-IDs (ri ) are appended to their fingerprints ( fi j ) to create a set of 3-tuples. Since we need
to find matches between suffixes and prefixes of the same length, it is logical to partition these
tuples by their length. Therefore, we sort all ( fi j ,ri ) pairs by their corresponding lengths j
and count the number of occurrences of each unique j to obtain the size of each partition. In
other words, the partitioning process converts a list of ( j , fi j ,ri ) tuples to lmax lists of ( fi j ,ri )
tuples, where lmax is the length of the sequences. Out of these lists, the ones corresponding to
a length smaller than a user-defined minimum overlap-length (lmi n) are discarded. Moreover,
the partition belonging to lmax is also dropped to avoid self-loops in the graph. The remaining
ones are written to disk, each into a file corresponding to the partition.
35
3.3.2. Sort: sort read-IDs by fingerprints
To enable binary searches of suffixes in a list of prefixes, in this phase we sort read-IDs by
their fingerprints belonging to each partition corresponding to length li ∈ [lmi n , lmax). Since
the number of suffixes/prefixes is much larger than the GPU memory, we use an external-
memory sorting scheme comprising two phases. In the first phase, chunks of key-value pairs
(of size M such that M elements fit in GPU memory) are read from disk, sorted by keys, and
written back to disk. In the next phase, these chunks are iteratively merged into a single sorted
one.
Note that the size of the chunks being merged will double every iteration, and hence the
merging must also be done using external-memory. Algorithm 4 explains our approach to
combine two sorted lists on disk, by processing at least M/2 and at most M elements at a
time. It is adapted from the k-way merging scheme and operates under the guiding principle
that the input cannot be randomly accessed.
The merge algorithm consumes two sorted lists of key-value pairs (kvl A, kvl B ) and emits
a single sorted list (kvlC ). We slide windows of size M/2 on both lists (lines 3, 4) and read
chunks of data from both. If either of the windows reaches the end of the list, the function
copies the remaining elements of the other list to the output (line 19) and finishes. Otherwise,
when the largest key in one chunk is not larger than the smallest key in the other, all items in
the smaller chunk are appended to the output (lines 5, 6). On the other hand, if the chunks
cannot be ordered entirely, we resize the windows in a way that the largest key within the
current pair of windows is not larger than the smallest key in the subsequent ones. This is
done by finding the upper-bound of the smaller of the last keys in both chunks, in the chunk
containing the larger of those two keys.
The upper-bound of a key k in a sorted array A is the index i such that A[ j ] ≤ k,∀ j < i
and A[ j ] > k,∀ j > i . This can be easily obtained with a binary search in the host memory
(lines 8 to 15). Note that this step is only required if the last elements in both windows are
unequal, but this check is omitted from the pseudo-code for brevity. After the windows are
36
Algorithm 4 External memory merging
Input: A sorted list kvl A of key-value pairs.
Input: A sorted list kvlB of key-value pairs.
Input: A count M of key-value pairs that fit in memory
Output: kvl A and kvlB are merged into sorted list kvlC
1: procedure MERGE(kvl A ,kvlB )
2: repeat
3: A ← next M/2 key-value pairs from kvl A
4: B ← next M/2 key-value pairs from kvlB
5: if A ≺ B then kvlC ← A
6: else if B ≺ A then kvlC ← B
7: else
8: k ← MIN_KEY(AM/2,BM/2)
9: if k = AM/2 then
10: r ank ← UPPER_BOUND(k,B)
11: RESIZE(B ,r ank)
12: else
13: r ank ← UPPER_BOUND(k, A)
14: RESIZE(A,r ank)
15: end if
16: kvlC ← GPU_MERGE(A,B)
17: end if
18: until one of the lists is empty
19: kvlC ← any remaining elements from A or B
20: return kvlC
21: end procedure
37
equalized, the pairs in A and B are copied to device memory where they are merged by keys
and appended to the output (line 16).
Algorithm 4 takes O (r /md ) iterations where r is the number of keys to be merged, and
md is the maximum number of elements that fit in the device memory. Since each iteration
performs radix sorting in O (md ) [48], Algorithm 4 has time complexity O (r ).
Sorting in hybrid-memory. To reduce the number of disk passes during this phase, we use
host memory as a buffer between the disk and device memory. This hybrid-memory sort pro-
cedure works exactly as before, but in two levels. In the top level, it reads large chunks of size
mh (such that mh key-value pairs fit in host memory) from disk to host memory, sorts them,
and writes them back to disk. Next, these large sorted blocks are streamed from disk and iter-
atively merged (like Algorithm 4, but in host memory) in batches of mh/2 elements per block,
until a single sorted block remains.
Underneath the top level, the sorting and merging of large blocks in host memory are
done by streaming them in smaller chunks of md to device memory and sorting and merging
them on the GPU. This makes up the second level of the scheme. With this optimization, even
though the number of merge passes in device remains the same, the number of disk passes
can be reduced to 1+ log(n/mh), or by a factor of log(mh/md ), which is typically about 3-4
times.
3.3.3. Reduce: find suffix-prefix matches
This phase consumes two lists Sl and Pl containing tuples of the read-IDs and fingerprints
of their l -length suffixes and prefixes respectively, both sorted by fingerprints. For each read-
ID in Sl , it searches Pl for read-IDs having the same fingerprint, and adds an edge to the string
graph for each matching read-ID. Having sorted lists enables us to stream two windows WS
and WP from suffixes and prefixes respectively, such that they fit in memory and a fingerprint
present in WS cannot be present in any window except WP . Therefore, we can process each
partition with a single disk pass. We repeat this process for all pairs of Si and Pi for each
i ∈ [lmax , lmi n).
38
Algorithm 5 Overlap detection
Input: A sorted list kvls f x of suffix-fingerprints & read-IDs
Input: A sorted list kvlp f x of prefix-fingerprints & read-IDs
Input: A count M of key-value pairs that fit in memory
Input: A string graph G
Output: Updated string graph G
1: procedure REDUCE(kvls f x ,kvlp f x )
2: repeat
3: S ← next M/2 key-value pairs from kvls f x
4: P ← next M/2 key-value pairs from kvlp f x
5: f ← MIN_KEY(SM/2,PM/2)
6: RESIZE(S, LOWER_BOUND( f ,S))
7: RESIZE(P, LOWER_BOUND( f ,P ))
8: L ← GPU_VEC_LOWER_BOUND(S,P )
9: U ← GPU_VEC_UPPER_BOUND(S,P )
10: C ← GPU_VEC_DIFFERENCE(U ,L)
11: for rsi ∈ S do
12: if ci ∈C > 0 then
13: for j ∈ [Li ,Li + ci ) do
14: G ← (rsi ,rp j ),rp j ∈ P
15: end for
16: end if
17: end for
18: until one of the lists is empty
19: return G
20: end procedure
Algorithm 5 shows the pseudo-code of the approach described above. We stream data
from the list of suffixes and prefixes into host memory with at most M/2 key-value pairs per
window (lines 3 and 4). Next, we find the smaller of the largest fingerprints from either window
( f ) and calculate the lower-bound of f in both the windows. The lower-bound of a key k in a
sorted array A is the index i such that A[ j ] < k,∀ j < i and A[ j ] ≥ k,∀ j > i . Based on the lower-
bound values, we resize both windows (lines 6 and 7) to contain the same range of fingerprints.
Once the windows are resized, we obtain the number of occurrences of each suffix finger-
print in the prefix window. Note that, by the definition of upper- and lower-bounds, if a sorted
array A contains p consecutive occurrences of an item k starting at index i , the lower-bound
of k in A is i , and its upper-bound is i +p. On the other hand, if k is not present in A, there
must exist an index j such that A[ j ] < k < A[ j +1] and, both the upper- and lower-bounds are
39
j . Thus, the number of occurrences of an element in a sorted array is the difference between
its upper- and lower-bound in the array. If the difference is non-zero, the lower-bound yields
the position of its first occurrence in the array. Therefore, we load both the suffix and prefix
windows in the GPU and calculate the upper-bound (U ) and lower-bound (L) of each suffix
fingerprint in the array of prefix fingerprints and the difference (C ) between them (lines 8 -
10). We then iterate through C and, for all i such that ci > 0,ci ∈C , we create an edge from the
vertex corresponding to the read with suffix si ∈ S to that with prefix p j ∈ P for j ∈ [Li ,Ui ).
Our approach of building the graph is greedy, so each vertex (read-ID) will have at most
one incoming edge and at most one outgoing edge. We maintain a bit-vector to store the out-
degree information of all vertices. Upon receiving a request to add a candidate edge (u, v, l ), l ∈
[lmi n , lmax), we check the bit-vector to find out if either the vertex u or v ′ (WC complement of
v) has an outgoing edge, and if so, discards the edge. If both vertices have no outgoing edge,
we add edges (u, v, l ) and (v ′,u′, l ) to the graph and update the bit-vector.
Note that the graph is built in the host instead of the GPU since for large datasets it can
easily exceed the device memory. For instance, the graph of a human genome with 2.5 bil-
lion edges, each containing a 4-byte vertex-ID and a 1-byte overlap length, takes about 12 GB
of memory. Besides, due to of the nature of string graphs, adding an edge (u, v) to a graph
that is being updated by multiple threads involves acquiring locks for u and v ′. We have ob-
served (using CUDA’s native atomics library) that such an approach detrimentally influences
the performance when implemented in GPUs.
Like Algorithm 4, Algorithm 5 takes O (r /md ) iterations where r is the number of reads,
and md is the maximum number of elements that can be stored in the device memory. More-
over, the amortized time complexity of each iteration is O (md logmd /p +md × cav g ) where p
is the number of processors for the vectorized upper- and lower-bound calculation and cav g
is the average number of overlaps per read. Since logmd < p in practice (e.g., NVIDIA P100
has 16 GB memory and 3,584 cores), Algorithm 5 has time complexity O (r +C ) where C is the
number of overlaps between reads.
40
3.3.4. Compress: traverse paths and generate contigs
This phase consists of two stages. In the first stage, we traverse the string graph to obtain a
set of paths, where each path pi is represented as a sequence of tuples, and each tuple consists
of a read-ID (ri j ) and its overhang-length (li j ). The overhang-length of a read ru having an
overlap with another read rv is defined as (lu −ouv ) where lu is the length of ru and ouv is the
overlap between ru and rv . Since a vertex in our string graph can have at most one outgoing
edge, a read can only have a single overhang-length. Reads having no overlap with others have
overhang-lengths equal to their lengths.
Traversal begins with vertices with in-degree 0 and out-degree 1 as seeds. Next, from each
seed, we continue to extend the path by appending the read-ID and overhang-length of the
current vertex to the sequence of tuples, and stop after we encounter a vertex with no outgoing
edge. Since the graph resides in host memory, this stage is performed using multiple threads
in the host. However, even for our largest test dataset (i.e., real-world human genome), it takes
less than a minute to obtain the paths in host memory, so its effect is insignificant.
In the second stage, we convert read-IDs belonging to the paths in the string graph to their
corresponding input sequences to generate contiguous sections (contigs) of the original DNA
sequence. This is done by laying out the overhang-lengths of paths in order of their read-IDs,
streaming the original reads, and placing their overhangs in their appropriate offsets.
Fig. 3.5 shows the process in greater detail. For any path pi in the list of paths P , let its
length li be defined as the number of (ri j , li j ) pairs in it. We load P to GPU and, for each
pi ∈ P , we calculate its offset oi within P using an exclusive prefix-scan operation. Next, for all
reads ri j in path pi , we perform another scan of their overhang-lengths li j to obtain the size
of memory required to store the contig corresponding to pi , as well as the offset oi j of each
read in the set of all contigs.
Since a vertex has at most one incoming and one outgoing edge, a read can belong to
at most one path. Hence, each overhang-offset tuple is copied to the unique location corre-
sponding to its read-ID with a gather operation in GPU (i.e., using the array of read-IDs as a
41
Path lengths
Path offsets
Read ID's
Overhang lengths
Overhang offsets
Overhang lengths
Overhang offsets
Gather overhang lengths and offsets 
according to read-IDs
Figure 3.5. Generating contigs from paths in string graph
stencil). Finally, short-reads are streamed from disk and, for each read ri , we place the sub-
string ri [1 · · · lri ] into offset ori in the memory allocated for contigs. This concludes our contig
generation phase.
3.3.5. Distributed graph-building
Among all phases in the pipeline, sorting takes the longest (more than 50% of the total ex-
ecution time), closely followed by fingerprint generation (about 25% of the total time). Since
fingerprints can be independently generated for each sequence, and each partition is sepa-
rately sorted, these phases can be distributed across multiple GPUs. However, we have ob-
served that the most prominent bottleneck in the pipeline is the I/O throughput. To mitigate
this, we distribute the computation across multiple nodes for exploiting a higher aggregate
I/O bandwidth.
The distributed implementation of LaSAGNA assumes the presence of a distributed file
system that stores the input sequences and the output graph. Each node also has access to
private storage for shuffling and sorting intermediate data. The storage can be located in a
network-mounted file system, but must not be shared across nodes since the bulk of the I/O is
performed here. In practice, LaSAGNA will benefit from the use of local disks and faster media
42
such as solid-state drives. GASNet 1 active messaging library handles the remote spawning
of processes and subsequent communications. Each node runs a process that communicates
and synchronizes with peers and calls CUDA functions. One of the peers is chosen as a master
and charged with load balancing.
Map. In the map phase, each node (including the master itself) requests the master for the
address of an input block. It then processes the sequences in the blocks, generates (finger-
print, read-ID) tuples, and writes them into local disk. Like the single-node implementation,
the tuples are split into partitions based on the length of suffixes and prefixes, and each parti-
tion is maintained in a separate file.
Shuffle and sort. Each node works on separate partitions and must aggregate the data as-
signed to it from other peers before sorting. To do this, each node sends a sequence of active
messages to its peers. On reaching the destination, a message reads from the file correspond-
ing to the partition requested and responds with a chunk of data.
Reduce. Since edges are added to the graph in descending order of their lengths, and (finger-
print, read-ID) pairs are also partitioned by lengths, a node reducing partition pi with i -length
suffixes/prefixes must wait for the node reducing pi+1 to finish. Even with a different parti-
tioning scheme, if multiple nodes greedily update a graph in parallel, it would not be scalable
since adding an edge (u, v) needs locks on both u and v , which may reside on different nodes.
Nonetheless, some scalability can still be attained using the observation that most of the ex-
ecution time is spent on reading the suffixes and prefixes from disk and finding the overlaps
between them, whereas adding edges is relatively much faster.
We apply this insight to distribute the reduce phase as follows. Each node in charge of
a set of partitions processes them in descending order of their corresponding lengths. There
exists a bit-vector v that stores the number of outgoing edges of all vertices and initially resides
inside the node containing partition Plmax−1 corresponding to length lmax−1. Upon obtaining
the set of suffix-prefix overlaps for partition pi , a node waits until it receives v from the node
1https://gasnet.lbl.gov/
43
in charge of pi+1. On arrival of v , the node then uses it to create greedy edges from overlaps
and forwards v to the node processing pi−1.
Note that this scheme avoids storing the entire graph in a globally accessible data struc-
ture. Instead, it is stored as disjoint sets of edges, and the sets are distributed among the dif-
ferent nodes. On the downside, however, if to is the average time spent to find overlaps, and tg
is the average time spent to build the graph (and to send the bit-vector to the next node), the
expected total time to finish processing p partitions using an n-node cluster is to×p/n+tg ×p.
The scalability of this phase is therefore limited to nmax = to/tg nodes, after which the graph
building step dominates the computation time.
3.4. Evaluation
In this section, we report the experimental results of LaSAGNA for various real-world se-
quence datasets across multiple computing environments.
3.4.1. Datasets
Table 3.1 shows the datasets used in the evaluation of LaSAGNA. All datasets used were
obtained from Illumina sequencing machines, and the longest sequence length varies from
100 to 150 base pairs. The minimum overlap lengths used to build the string graphs are 63 for
H.Chr 14 and H.Genome (lengths of 101 and 100 respectively), 85 for Bumblebee (length of
124), and 111 for Parakeet (length of 150), as suggested by the SGA assembler.
Table 3.1. Illumina datasets used for evaluation
Dataset Length Reads Bases Size
H.Chr 14 [53] 101 45,711,162 4,559,613,772 9.2 GB
Bumblebee [53] 124 316,172,570 33,562,702,234 85 GB
Parakeet2 150 608,709,922 91,306,488,300 203 GB
H.Genome3 100 1,247,518,392 124,751,839,200 398 GB
2https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP002324
3http://www.ncbi.nlm.nih.gov/sra/?term=SRA000271
44
3.4.2. Testbeds and implementation
To evaluate LaSAGNA, we use three types of computing environments. On QueenBee II4
cluster, we use a node with two NVIDIA Tesla K40 GPUs, two 10-core 2.8 GHz E5-2680v2 Xeon
processors, and 128 GB main memory. We use only one GPU for our evaluation. On SuperMic5
cluster, we use a node with one NVIDIA Tesla K20X GPU, two 10-Core 2.8GHz Ivy E5-2680 Xeon
processors, and 64 GB main memory. For evaluating the distributed version of LaSAGNA, we
use multiple nodes of this type, connected by 56 Gb/s InfiniBand. On NVIDIA PSG6 cluster,
we use three types of nodes, each equipped with NVIDIA V100, P100, or P40 GPUs. Each
node has two 16-core Haswell E5-2698v3 processors and 256 GB main memory. Even though
this cluster provides several high-end GPUs, we could not use them for the entire assembly
pipeline because of the limited disk space per account.
LaSAGNA uses 128-bit fingerprints (two 64-bit values generated with different radixes and
primes) since we have observed that it yields zero false positive edges across all the datasets
that we have tested. LaSAGNA is built primarily with the programming primitives and data
structures provided by NVIDIA’s Thrust 7 library included with CUDA 7.5 toolkit.
3.4.3. Single-GPU genome assembly
Genome assembly times. Table 3.2 and Table 3.3 show the total assembly times with de-
tails of each phase for different datasets on a QueenBee II node (128 GB host memory and
one NVIDIA K40 with 12 GB device memory) and on a SuperMic node (64 GB host memory
and one NVIDIA K20X with 6 GB device memory) respectively. The results demonstrate that
LaSAGNA can finish the entire genome assembly pipeline for a 398 GB human genome dataset
within 16.5 hours using a single GPU (NVIDIA K40). To the best of our knowledge, this is the
first GPU-based assembler that can assemble a real-world human genome dataset on a single
node.
4http://www.hpc.lsu.edu/resources/hpc/system.php?system=QB2
5http://www.hpc.lsu.edu/resources/hpc/system.php?system=SuperMIC
6http://psgcluster.nvidia.com/
7https://developer.nvidia.com/thrust
45
Table 3.2. Single node assembly times on 128 GB host memory and 12 GB device memory
(K40)
H.Chr 14 Bumblebee Parakeet H.Genome
Map 5m 32s 33m 20s 1h 40m 58s 2h 43m 15s
Sort 9m 36s 1h 21m 0s 4h 57m 56s 11h 05m 45s
Reduce 4m 47s 26m 6s 1h 17m 31s 2h 20m 33s
Compress 6s 20s 26s 57s
Load 25s 3m 9s 5m 57s 10m 39s
Total 20m 26s 2h 23m 55s 8h 2m 48s 16h 21m 09s
Table 3.3. Single node assembly times on 64 GB host memory and 6 GB device memory (K20)
H.Chr 14 Bumblebee Parakeet H.Genome
Map 5m 59s 36m 8s 1h 47m 58s 2h 50m 28s
Sort 11m 12s 1h 35m 25s 5h 41m 23s 14h 53m 21s
Reduce 4m 26s 27m 35s 1h 14m 13s 2h 31m 43s
Compress 5s 19s 26s 56s
Load 23s 2m 51s 5m 31s 11m 48s
Total 22m 5s 2h 42m 18s 8h 49m 31s 20h 28m 16s
Even with a smaller host and device memory (64 GB and 6 GB respectively), LaSAGNA
can assemble the H.Genome dataset in 20.5 hours. Note that the execution times increase
only slightly for all other datasets except H.Genome. This is because, for H.Genome, a host
with 128 GB memory can sort an entire partition (2.5B keys) in a single disk pass, but one with
64 GB memory needs to merge two sorted lists with an additional disk pass. The run-times of
the other phases are similar on both machines because LaSAGNA performs the same amount
of I/O.
Memory usage. Table 3.4 shows the peak host and device memory usage during the various
phases of assembly for different datasets on a QueenBee II node with 128 GB host memory and
one NVIDIA K40 with 12 GB device memory. Table 3.5 shows the results when we run on a Su-
perMic node with 64 GB host memory and one NVIDIA K20 with 6 GB device memory. In both
cases, the device memory usage is almost identical for all datasets because a fixed amount of
device memory is allocated for each phase regardless of the data size, and the device memory
46
Table 3.4. Peak memory usage (in GB) on 128 GB host with K40
Dataset
Peak Host Memory Peak Device Memory
Map Sort Red. Contig Map Sort Reduce
H.Chr 14 14.48 14.92 16.87 16.78 10.74 6.46 4.89
Bumblebee 14.64 34.40 19.55 22.14 10.74 9.02 4.92
Parakeet 16.82 59.21 28.64 28.39 10.73 9.02 4.92
H.Genome 16.39 103.73 38.11 44.24 10.73 9.02 4.92
Table 3.5. Peak memory usage (in GB) on 64 GB host with K20
Dataset
Peak Host Memory Peak Device Memory
Map Sort Red. Contig Map Sort Reduce
H.Chr 14 7.23 9.71 8.99 9.01 5.41 4.54 2.47
Bumblebee 9.03 30.04 13.34 18.14 5.41 4.54 2.50
Parakeet 8.84 54.20 19.48 22.79 5.40 4.54 2.50
H.Genome 9.18 54.66 31.31 38.95 5.40 4.54 2.50
assigned is fully utilized except for H.Chr 14 on K40. On the other hand, the host memory
usage exhibits wide variations depending on the dataset because we maximize the host mem-
ory usage. Note that, during the sorting phase of H.Genome on 128 GB RAM, the maximum
available memory is used to minimize the number of disk passes although LaSAGNA succeeds
even on 64 GB RAM. The host memory used during the contig generation phase depends on
the size of the graph and contigs.
Performance comparisons. Table 3.6 compares the assembly times (in seconds) of LaSAGNA
with those of SGA (version 0.10.15), the representative string graph-based assembler. We
choose SGA since, to the best of our knowledge, it is the only string graph-based assembler
that can handle large datasets on a single node. We do not include the results of de Bruijn
graph-based assemblers because most of them are not designed for processing large datasets
on a single machine (i.e., failed with out-of-memory error). The SGA pipeline consists of mul-
tiple phases including error correction, but for fair comparisons, we only consider the prepro-
cess, index, and overlap phases. We built the index with the ropebwt algorithm, which is faster
and takes a smaller amount of memory. Our results demonstrate that LaSAGNA is 1.89x-3.05x
47
faster than SGA while also being more memory-efficient due to a better utilization of the mem-
ory hierarchy.
Table 3.6. Comparison between SGA and LaSAGNA
Dataset
SGA LaSAGNA
64 GB 128 GB 64 GB 128 GB
H.Chr 14 3081s 3039s 1325s (2.33×) 1226s (2.48×)
Bumblebee 26360s 23958s 9738s (2.71×) 8635s (2.77×)
Parakeet 93747s 88229s 31771s (2.95×) 28968s (3.05×)
H.Genome OOM 111024 73696s 58869s (1.89×)
Effects of host and device memory sizes in sorting. We also study the effects of host and
device memory sizes on the sorting phase, the most time-consuming task in our assembly
pipeline. For these experiments, we use the data generated from H.Genome, containing about
2.5 billion pairs of 128-bit keys and 32-bit values per partition. We use the term block-size to
denote the number of key-value pairs in a block. Fig. 3.6 shows the effects of using different
block-sizes in host and device on the average sorting time per partition on an NVIDIA K40
GPU. It shows that increasing the device block-size decreases the execution time since it re-
duces the number of merge passes on the GPU. More importantly, it shows that the effect of
device block-sizes is small compared to that of host block-sizes. The significant performance
improvement with a larger host block-size (because of fewer disk passes) indicates that the
I/O is the most critical factor in sorting. Note that, since the sorting can be performed in a
single disk pass with a host block-size of 2.56 billion, we do not expect any performance im-
provement beyond this point.
Effects of different GPUs. Fig. 3.7 shows the average sorting times for various host block-
sizes with a fixed device block-size of 20 million on different GPUs (K40, P40, P100, and V100).
Increasing the host block-size results in a logarithmic decrease in the execution time until it
reaches 2.56 billion. V100, NVIDIA’s latest high-end GPU, is the fastest since it has much more
CUDA cores (5,120) at higher GPU Boost clock rate (1530 MHz), and provides the highest peak
memory bandwidth (900 GB/sec). Interestingly, P40 is consistently slower than P100 even
48
20 320 640 1280 2560
6
0
0
8
0
0
1
0
0
0
1
2
0
0
Host Block Size (millions of key−value pairs)
E
xe
c
u
ti
o
n
 T
im
e
 (
s
e
c
o
n
d
s
)
Device Block Sizes
20 million
40 million
80 million
160 million
Figure 3.6. Effects of different host and device block-sizes
20 320 640 1280 2560
4
0
0
6
0
0
8
0
0
1
0
0
0
1
2
0
0
1
4
0
0
Host Block Size (millions of key−value pairs)
E
xe
c
u
ti
o
n
 T
im
e
 (
s
e
c
o
n
d
s
)
Device
K40
P40
P100
V100
Figure 3.7. Effects of different GPUs and host block-sizes
49
though P40 has a larger device memory (24 GB vs. 16 GB) and more cores (3840 vs. 3584). This
is because P40 has much smaller memory bandwidth than P100 (346 GB/s vs. 732 GB/s).
Another interesting observation is that, as the host block-size decreases (increasing the
number of disk passes), the performances of the different GPUs appear to converge since
the sorting phase increasingly becomes more I/O bound. This reinforces the efficacy of our
hybrid-memory model in saturating GPU performance during the sorting phase. We also ar-
gue that the hybrid-memory model can apply to other types of workloads (e.g., MapReduce-
like processing) that require sorting since we can process large-scale datasets for such work-
loads on a single GPU (or a few GPUs) in a scalable and efficient manner.
3.4.4. Distributed execution times
To evaluate the scalability of LaSAGNA, we report the execution times of the different
phases by varying the number of compute nodes from 1 to 8 on the SuperMic cluster, as shown
in Fig. 3.8. The results show that LaSAGNA can assemble a 398 GB human genome dataset in a
little over 5 hours on an 8-node cluster of NVIDIA K20s. LaSAGNA performs better with more
nodes because of the distribution of data and computations during the map phase, and the
larger aggregated I/O bandwidth during the sort phase.
However, LaSAGNA does not demonstrate the optimal scalability because scaling out
from a single node introduces the additional overhead of an all-to-all data transfer before the
sorting phase, as shown in Fig. 3.8. Moreover, as previously explained, the data dependency
among GPUs in the graph-building stage of the reduce phase limits the scalability of the cur-
rent implementation. To resolve this, are working on partitioning the suffixes/prefixes based
on their fingerprints rather than on lengths. We also plan on processing the string graph in
parallel using a bulk-synchronous processing model. We leave further optimizations to the
shuffle and reduce phases as our future work.
3.5. Conclusion
In this chapter, we have presented LaSAGNA, a new GPU-accelerated genome assembler,
that can assemble large-scale sequence datasets on a single GPU by constructing string graphs
50
1 2 4 8
H.Chr 14
0
2
0
0
4
0
0
6
0
0
8
0
0
1
2
0
0
1 2 4 8
Bumblebee
0
2
0
0
0
4
0
0
0
6
0
0
0
8
0
0
0
1 2 4 8
Parakeet
0
5
0
0
0
1
5
0
0
0
2
5
0
0
0
1 2 4 8
H.Genome
0
2
0
0
0
0
4
0
0
0
0
6
0
0
0
0
Map Shuffle Sort Reduce
Number of Nodes
E
xe
c
u
ti
o
n
 t
im
e
 (
s
e
c
o
n
d
s
)
Figure 3.8. Execution times on different node configurations
from approximate overlaps using fingerprints. Our approach uses a two-level semi-streaming
model that exploits the speed of GPU device memory as well as the large capacity of host
memory. Our experimental results demonstrate that LaSAGNA can assemble a 400 GB hu-
man genome dataset in 17 hours using a single GPU (NVIDIA K40). In a distributed setting,
51
LaSAGNA can finish the entire genome assembly pipeline for the dataset in a little over 5 hours
using an 8-node cluster. To the best of our knowledge, LaSAGNA is the first GPU-based assem-
bler that can assemble a real human genome dataset on a single node.
52
Chapter 4. Third Generation Sequencing & Future Work
4.1. Third Generation Sequencing
Next-generation sequencing (NGS) technology has been the workhorse for most sequenc-
ing projects in the past decade and has enabled inexpensive sequencing and assembly of
many new genomes and population-scale [35] analysis. However, the assemblies produced
by NGS machines were not as high-quality as those yielded by their more expensive predeces-
sors since they are generally quite fragmented and have large sections of the genome missing
from them, especially for mammalian datasets. Some of the issues with NGS can be allevi-
ated using a new class of sequencing techniques known as third generation sequencing (3GS).
3GS sequencers are capable of producing much longer reads than those of NGS methods and
are generally applied to a single DNA molecule as opposed to the millions or even billions of
clones used in NGS.
Examples of long-read single-molecule sequencing technologies include PacBio, Oxford
Nanopore, Illumina Tru-seq, although this list is by no means, exhaustive. PacBio, which in-
troduced itself in 2010 with the RS single-molecule real-time (SMRT) sequencer, is arguably
the most popular 3GS sequencing technology. Since then, PacBio has released several versions
of instruments and reagents that are expected to increase the throughput and reduce the cost
per base pair. However, whole genome sequencing with PacBio is orders of magnitude more
expensive than that using NGS technologies and the error rates vary from 10% to 15% which
is also much higher than NGS.
On the other end of the spectrum, we have Oxford Nanopore, which aims at making low-
cost sequencing solutions accessible to end users. Although their error rates are higher than
those of PacBio datasets, the reads produced by Nanopore technologies are longer and much
less expensive to produce than PacBio. Moreover, Oxford Nanopore seems to lay special em-
phasis on portability and this is evident from their recent product offerings. For instance,
Oxford Nanopore MinION, which has been available since 2015, is a handheld device weigh-
ing under 100g and the only computing resource it requires is a PC or a laptop with a USB
53
connection. The throughput of Nanopore is much lower than PacBio but it has been under
active development and both the sequence quality and throughput is expected to get better.
4.2. Advantages and Challenges of 3GS
One of the fundamental challenges of genome assembly is that contrary to the random-
string model, DNAs have long repetitive sections known as repeats. These can be very difficult
to resolve if the short-reads are not long enough to cover either side of it. This scenario is de-
picted in figure 4.1 where the genome has several repeated sections, shown by the purple, yel-
low and green segments. The length of the short reads in Figure 4.1a is the same as the length
of repeats although, the repeats may be several orders of magnitude longer than the reads. In
(a) Short reads from a genome with repeats
(b) Aligning short-reads to recreate original sequence
Figure 4.1. Erroneous assembly due to repeats
case of de novo assembly, since there is no reference genome to align the short reads, there
54
can be more than one way to merge them and the final assembly may not correspond to the
original sequence. As shown in Figure 4.1b, a perfectly valid arrangement of the short-reads
obtained in Figure 4.1a leads to an incorrect assembly due to the ambiguity of the placement
of repeats.
The main advantage of third generation sequencers is that the reads are orders of magni-
tude larger than NGS reads. For instance, read lengths of Illumina datasets mostly lie between
100 to 250 base pairs whereas the average length of PacBio reads can be several thousand.
Moreover, the longest reads generated by PacBio are generally in tens of thousands and those
generated by Nanopore may be in hundreds of thousands. 3GS sequence datasets are there-
fore much better at resolving repeats since the reads are able to span entire repeats and align
reads without ambiguity. Indeed 3GS datasets report much more contiguous assemblies than
their NGS counterparts.
The primary disadvantage of 3GS technology, however, is the high error rate in the se-
quences. This necessitates long read assemblers to dedicate a lion’s share of the execution
time to detect and correct errors in the dataset. Moreover, unlike NGS datasets which contain
mostly substitution errors, 3GS datasets have a much higher number of indel errors which
requires a much more computationally expensive gapped-alignment of reads in order to cor-
rect them. However, no distributed 3rd generation sequence assembler exists and as a conse-
quence, practitioners have to wait for days to get the assembly results even though they have
access to multiple machines in a supercomputing cluster.
In this chapter, we present Hiper3ga, a distributed long-read assembler that is capable
of assembling large datasets by scaling-up on multiple machines. To the best of our knowl-
edge, this the first distributed 3rd-generation assembler reported in the literature. Although
Hiper3ga does not include a base-level error correction step, the resulting contigs can be pol-
ished using standalone consensus generation tools. Using Hiper3ga, we were able to assemble
the reads in a Nanopore human genome dataset in under 30 minutes which is almost two or-
ders of magnitude faster than the existing state-of-the-art assemblers. Moreover, Hiper3ga
55
supports both high-performance and commodity clusters by using GASNet’s low-level com-
munication primitives over high-throughput Infiniband networks or generic Ethernet net-
works.
The rest of the chapter is structured as follows. In Section 4.3 we provide a background
of long read assembly and summarize the related works before stating the motivation for
this work. Next, in section 4.4 we discuss the methodology and implementation details of
Hiper3ga . Finally, in section 4.5 we evaluate our work on various datasets against the state-
of-the-art and present our conclusions in section 4.6.
4.3. Background and Related Work
Given a set of sequences S (|S | = r ) over the alphabet Σ = {A,C ,T,G} our goal is to find
a set of alignments A ⊂ S ×S ×N4, where each ai ∈ A is a 6-tuple (si , s j ,bi ,b j ,ei ,e j ) that
corresponds to an overlap between the sequences si and s j starting from positions bi and
b j respectively and ending at positions ei and e j respectively. For each sequence si ∈ S we
define a list Ki of k-mers such that a k-mer ki j ∈ Ki is the k-length substring of si starting
from position j . If two sequences si and s j have at least one overlapping region of length
larger than k, Ki ∩K j ̸= ;.
k-mers are commonly used in the seed-and-extend method of finding overlaps where
for any two sequences si and s j , we find k-mers common to both Ki and K j and align the
sequences such that these common k-mers (known as seeds) are also aligned. In order to
perform efficient all-vs-all alignment, these triplets are indexed in a fashion that allows us to
find the list of all sequences and the corresponding positions where any given k-mer occurs.
Therefore, all k-mers in K =< K1,K2, · · · ,Kr > along with their locations are used to form
triplets (x, s, p), where s ∈S , p ∈N, and x ∈K is the k-mer at the pth position if s.
Instead of indexing the triplets for all k-mers in a sequence, one can index only a subset
of triplets corresponding to the k-mers that are representative of the sequence. Such a set
of representative k-mers can be chosen in a variety of ways. BLAST [3], which is arguably the
most popular tool for sequence alignment uses k-mers at fixed intervals (i.e., every pth k-mer)
56
of the reference sequence and indexes them on a hash-table. On receiving a query sequence,
it generates all k-mers from it and searches for them in the hash-table of reference k-mers. If
there are several matches in a relatively small window, it deduces that the query and reference
sequences are aligned in that window.
On the other hand, MHAP [6] uses a set of m distinct hash-functions {h1,h2, · · · ,hm} to
choose m representative k-mers {kr 1,kr 2, · · · ,kr m} from each sequence s such that
kr i = argmin
x∈Ks
hi (x)
These representative k-mers are used to create a min-hash sketch for each sequence and the
similarity between two sequences is detected by approximating their Jaccard coefficients us-
ing their respective sketches. While this method is memory-efficient, it only provides a cluster
of similar sequences instead of a set of candidate seeds/anchors where these sequences are
aligned to each other.
A more sensitive method of selecting anchor k-mers is by using the Minimizers [52]. In
this method, a k-mer x is chosen as an anchor if the sequence s contains a region of w con-
secutive k-mers Ks i ⊂ Ks for which x = argminy∈Ks i f (y). This technique is similar to that
used in MHAP but unlike MHAP, which picks a fixed number of candidates from the entire se-
quence, this selects one candidate from each fixed-size window of the sequence. This method
is also more efficient that BLAST since it requires us to check a fraction of all the k-mers in the
query sequence. The minimizers are then used as seeds to align sequences even if they do not
have regions that exactly match by virtue of having a large number of minimizers in common.
The techniques mentioned above are generic to sequence alignment but are widely used
by 3rd generation sequence assemblers because of their high error rates. Once the erroneous
reads are aligned, these assemblers merge the overlapped regions of the reads to create longer
contiguous sequences known as contigs. Most of these assemblers can be categorized into
three types - direct, hierarchical and hybrid [33]. Direct assembly approaches try to assemble
57
erroneous reads in a single overlap-consensus pass without trying to correct the erroneous
reads. Examples of direct assemblers include Celera [49] and the Minimap + Miniasm [37]
pipeline which uses minimizers to find multiple-sequence alignment.
On the other hand, some assemblers attempt to align shorter reads to the longer reads
in the same dataset and uses a consensus of shorter reads to correct the long ones. Almost
always, the alignment and consensus step are performed multiple times to improve the qual-
ity of base calls after which the high-quality long reads are used to obtain contigs. Such type
of assemblers are called hierarchical (or self-correction) assemblers and include HGAP [16],
Falcon [17], and Canu [34] that uses the min-hash technique mentioned before.
Lastly, the third link known as hybrid assemblers work on the same principal as hierarchi-
cal assemblers but instead of using shorter reads from the same library, they use high-quality
short-reads from other (next-generation sequencing) datasets to correct the longer ones. Ex-
amples of hybrid assemblers include HybridSPAdes [4], Cerulean [19] and dbg2olc [19]. Hy-
brid strategies are more effective when the coverage of the dataset falls below 30×. On the
contrary, hierarchical approaches generally require more coverage (about 50×) to work, but
the assemblies are more reliable than those of hybrid tools.
4.3.1. Motivation and goal.
Both the hierarchical and hybrid strategies involve a computationally expensive base-
level error-correction step consisting of multiple-sequence alignment and subsequent con-
sensus generation. These steps require dynamic programming approaches such as Smith-
Waterman or Needleman-Wunsch, both of which have a quadratic time complexity. As a con-
sequence, both these approaches require long execution times. On the other hand, direct
approaches are faster but their outputs cannot be directly used by downstream applications.
However, it has been shown that even though they contain lots of base-level errors, the contigs
generated by Minimap+Miniasm can be fed back to Minimap along with the original reads,
and then the contigs can be polished using a partial order alignment (POA) based consensus
generation tool called RACON [59]. This approach was reported to produce contigs as good
58
as, or better than the existing state of the art assemblers in terms of quality while being an
order of magnitude faster than them.
Despite the rising popularity of third generation sequencing and advances in long-read
assembly, there are no distributed assemblers available today. This poses a two-pronged prob-
lem for sequencing the larger datasets. Assemblers processing these datasets not only take
days to finish but often run out of memory even on big-memory machines. Thus even with
access to a Supercomputing cluster or other distributed computing infrastructure, practition-
ers may be unable to assemble and use long-read sequence datasets. Our goal in this work is
to alleviate these issues by building a distributed long-read aligner and assembler. We choose
the Minimap+Miniash approach because they are not only quick and amenable to paralleliza-
tion, but also produce high-quality contigs with the help of RACON.
4.4. Methodology
Hiper3ga is designed to run on a cluster and assumes the existence of a distributed filesys-
tem (DFS) that is globally accessible to all nodes in the cluster. Each node i runs a process pi
that - (a) reads the input sequences from the DFS, (b) finds alignments among all sequences
read by itself and its peers, (c) builds a graph using those alignments, (d) generates contigs,
and (d) writes the contigs back to the DFS. Moreover, each node has local storage which is
used to temporarily store the sequence alignments before they are used to generate contigs.
The spawning of processes on the nodes and all subsequent communication and synchro-
nization between them is handled by GASNet [9] communication library.
Although the processes at each node are ranked, they work in a peer-to-peer as opposed
to a master-slave model during most of the execution. However, in certain scenarios involving
a gather-scatter communication pattern, the process ranked 0 (p0) acts as a master, aggregat-
ing data from its peers and broadcasting the results. The process at each node runs multiple
worker threads and has a master thread that is responsible for synchronizing them.
The assembly pipeline consists of four main stages - 1) generating fingerprints, 2) index-
ing triplets, 3) aligning sequences, and 4) generating contigs. In addition to the ones men-
59
tioned above, there is a preprocessing stage in which p0 reads the input and splits it into
chunks such that each chunk starts with a sequence and its size is less than or equal to some
user-specified limit. Chunks are represented as 3-tuples (o f ,or ,nc ) where o f is the file offset
of the chunk, or is the sequence-id of the first sequence in the chunk and nc is the number
of sequences in the chunk. The chunks are written to the DFS so that during the subsequent
phases when multiple processes need to access the input sequences in parallel, each process
can read the list of chunks and work on every j th chunk of input such that j (mod n) = i . Al-
though preprocessing is performed by a single thread, it is a one-time phase and takes a few
minutes even for the largest dataset on which we evaluate Hiper3ga.
4.4.1. Overview of GASNet
GASNet is a high-performance communication middleware that provides active messag-
ing primitives using Remote Direct Memory Access (RDMA) on a wide range of communica-
tion conduits such as Infiniband, Intel Omnipath, Ethernet, etc. It also provides an extended
API for remote memory-copy using the core active messaging primitives and a bootstrapping
interface that can be used to launch jobs on clusters. During the bootstrap process, each
node registers a user-specified amount of memory for communication. The address of this
registered memory, henceforth called GASNet-attached segment, and the size of the memory
segment for each node is made known to all other nodes.
Communication between nodes via active messaging can be done in three ways -
1. Without any payload (ShortAM) - Node A sends an active message (AM) with a handler-
ID and up to 16 32-bit integer arguments to node B. Upon reaching B, the AM runs the
handler code using the arguments and optionally replies to node A with another AM.
2. With a small payload (MediumAM) - Node A sends an AM with a handler-ID, a pay-
load of up to n bytes and up to 16 32-bit integer arguments to node B. The value of n
depends on the conduit and is about 4KB for Infiniband and about 64KB for Ethernet.
Upon reaching B, the AM runs the handler code using the payload and arguments and
60
optionally replies with another AM. The payload is neither required to be in node A’s
registered memory segment nor copied to that of B’s. The payload on node B only ex-
ists within the scope of the handler function and must be copied if it is required beyond
that. The allocation of memory on node B for the payload, the data transfer, and the
subsequent deallocation is handled transparently by GASNet.
3. With a large payload (LongAM) - Node A sends an AM with a handler-ID source address
(in node A) of a payload of up to n bytes, the destination address (in node B’s GASNet-
attached segment) where the payload is to be copied and up to 16 integer arguments.
The value of n depends on implementation, and configuration but is generally as large
as 2GB. Upon reaching B, the AM runs the handler code using the payload and argu-
ments and optionally replies with another AM. Unlike the previous case, the payload’s
lifetime on node B extends beyond that of the handler.
The handler-IDs which are provided as arguments to AMs and their function pointers are
registered to GASNet during bootstrapping. Furthermore, each AM request handler can in-
voke a single AM reply handler which in turn cannot invoke any more AMs. GASNet also pro-
vides API’s for polling the network which must be used to ensure the progress of AM-delivery.
In cases where the objective of a longAM is just to copy data from one node to another,
i.e., its handler does not perform any operation on the payload, GASNet provides simpler APIs
to get and put data. These APIs can be synchronous or asynchronous. In case of remote-gets,
the source data must lie in GASNet-attached segment of the remote node, whereas in case of
remote-puts, the memory region between the destination address and destination + payload
size must lie within the peer’s attached segment. Unlike the longAMs, the size of the data
transfer is limited only the size of the registered memory segment.
4.4.2. Map phase: generating minimizers
Each process streams chunks of input data from the DFS and for each chunk, it calculates
the number of sequences and their lengths. Next, each chunk along with the lengths and
61
the offsets of the sequences in it is sent to a thread-pool for processing. The threads in the
thread-pool work on the sequences in parallel and for each sequence, a thread generates a list
of 4-tuples f = (m,d , i , j ) where m ∈N is the minimizer, d ∈ {0,1} denotes the strand (original
or Watson-Crick complement) from where m is generated, i is the sequence-id and j is the
location in si from where m is generated.
Each tuple is stored in a 128-bit integer and after a sequence is processed, the thread re-
sponsible for it copies the list of tuples into the GASNet-attached memory segment. The des-
tination where the tuples are copied is obtained by atomically incrementing an offset shared
between the threads in the thread-pool.
Partitioning. After all the reads are processed, the minimizer tuples generated are parti-
tioned into buckets so that they can be dispatched to the appropriate process for indexing.
Partitioning is performed in two rounds. In the first round, each process pi splits the list of
minimizer tuples Fi into c regions {Fi 1,Fi 2, · · · ,Fi c } where c is the number of worker threads.
Each worker thread t j further splits F j 1 into b partitions by performing a radix sort on the least
significant logb bits of the minimizer values. Next, t j counts the number of tuples in each of
the buckets and emits a pair of b-sized arrays containing the size and offset of each bucket. In
the second round, thread t j aggregates the tuples belonging to bucket bk for all k that satisfy
k (mod c) = j . These aggregated buckets are copied back to GASNet-attached memory from
where they are dispatched to all nodes in the cluster.
4.4.3. Reduce phase: indexing minimizers
Shuffle. During the map phase, all processes partition their respective minimizer tuples into
buckets based on their minimizer values. These buckets are then mapped to different pro-
cesses which index the tuples in a hash table using the minimizers as keys. This requires each
process to fetch the data belonging to a partition assigned to it but generated by a peer node.
Each node copies the array of partition sizes to GASNet-attached memory segment and per-
forms a barrier-synchronization. Subsequently, each node copies the partition sizes from all
other nodes and calculates (1) the total sizes of the partitions belonging to itself and (2) the
62
source offsets from where it would copy the data residing in its peers. Note that the mini-
mizers are already copied to the GASNet-attached memory during the map phase, so all that
remains for each node is to copy the remote data from the offset calculated before to its local
memory.
Indexing. After the shuffle phase, the minimizer tuples are indexed in a hash-map. In the
first step, the tuples in each partition are sorted by their minimizers. Next, the number of
unique minimizers is calculated and a hash-map is pre-allocated with the exact number of
keys. Finally, the key-value pairs are inserted into the hash-map where each key is a minimizer
and its value is a pair consisting of the offset of the location (read-id, position) and the number
of such locations. Each partition is processed in parallel using multiple threads.
The hash-tables located within each node together with GASNet’s communication prim-
itives are used to simulate a read-only distributed multi-map. A client thread searching for
the occurrences of a minimizer first finds the partition it belongs to and then finds the node
responsible for that partition. Next, it sends an AM with the key to the destination node which
then replies back to the caller with the offset and count of values. The client allocates memory
and sends another AM with the offset and count of values and the destination address which
then copies the values to the specified address at the caller node. Note that in our implemen-
tation, the requests and responses are batched so that the caller sends a block of keys to each
node where the responses are buffered and sent together using two large messages - one for
the addresses and counts and the other for the actual values.
4.4.4. Aligning sequences
For a query sequence, we generate list of minimizer-location tuples (m,d , i , j ) where m
is the minimizer hash value, d is the strand from which m was generated, i is the sequence-
id and j is the position in read i from where m was obtained. For each tuple in the list, we
check for occurrences (hits) of minimizer in other sequences, i.e. check for tuples of the form
(m,d ′, i ′, j ′) in the distributed multi-map. A hit is stored as a 5-tuple (i ′, x, a, p, j ′) where i ′
is the id of the target sequence, x is the strand relative to d where the minimizer appears
63
(x = d ⊕d ′), j ′ is the position in i ′ where the minimizer appears, a is the relative alignment
of sequence i ′ with sequence i ( j ′− j if x = 0, j + j ′ otherwise), and p is the position of the
minimizer in the list of tuples generated from the query sequence i . These tuples are stored
in an array and sorted by their elements in order.
The minimizers that are common between two sequences are clustered using Hough trans-
fromation [57] in order to use them as seeds in the alignment of two sequences. Informally, let
a minimizer m occurring on two sequences si and s j in positions pi and p j can be represented
as a point (pi , p j ) in a 2-dimensional plane. If the si and s j have a significant overlap with each
other, we will have several minimizers common to both and all of them can be represented in
the same space. Furthermore, if the overlapping regions are error-free, i.e., if si and s j both
contain the same substring o, and if the minimizers generated from the o region of si and s j
are plotted on the 2D-plane, they will form a straight line. This line will be inclined at a 45°
angle since the distance between pi and p j for each unique minimizer m will be constant.
However, if there are errors the minimizers will not be in exactly a straight line, but clustered
around one. The objective of Hough Transformation is to find a straight line such that a ma-
jority of the minimizers are incident on it. Since the slope is known to be 1, the line takes the
form y = x + c =⇒ c = y − x. For each point (minimizer), we can calculate the intercept c and
use the value supported by the majority of the points. This line is used as a crude alignment
of the sequences.
Two minimizer hits between sequences i and i ′ (i ′, x, a1, p1, j ′1) and (i
′, x, a2, p2, j ′2) are
said to be approximately collinear within a bandwidth ϵ (500bp by default) if ∥a1 −a2∥ < ϵ. A
cluster of approximately collinear hits is called an interval. Intervals are stored in an array and
sorted in descending order of their sizes. For each interval starting from the largest, we find the
longest increasing sequence (LIS) of the list of positions in the target sequence which generate
the same minimizers as the query sequence. The hits in these LISes are merged to create a
region of overlap between the target and query sequences where the first and last hit positions
in the sequence denote the beginning and the end of the overlapped region respectively. Each
64
node writes the list of overlaps to disk using a key-value format where the key is the read-id of
a sequence and the value is the list of its alignments.
4.4.5. Assembly
The assembly phase can be split into two sub-phases - graph building and graph sim-
plification. The first step is to distribute the set of alignments created in the previous phase
between the different nodes. The alignments are first split among the worker threads at each
node where they are sorted by their partitions. The partition of a (read-id, alignments) pair is
defined as read-id (mod N ), where N is the number of nodes in the cluster. Each thread then
generates two N -length arrays of partition sizes and their offsets. The master thread gathers
all the partition sizes and broadcasts the total partition sizes to all other nodes.
The broadcasting is done as follows. Each node places the N -length array of partition
sizes in its GASNet-attached memory and performs a barrier synchronization. Since nodes
have knowledge of each other attached memory segment addresses, they can copy data from
a remote peer’s memory to itself. Thus, each node has knowledge of the partition sizes of all
other nodes which are stored in an N ×N sized array. Moreover, since each node is uniquely
ranked, they can calculate the offset at the remote peer at which it is supposed to copy the
alignments belonging to the peer. For instance, p0 will copy the data belonging to process pi
at node i to the address of the GASNet-attached segment Ai and p1 will copy its data to Ai+n0i
where p0i is the size of the partition i at p0. Once, the alignments are shuffled among nodes,
they are distributed among threads by sorting them according to (i /N ) (mod T ), where i is
the read-id and T is the number of threads.
Each thread sorts the (read-id, list of alignments) pairs by their read-ids and for each read
(query), it classifies the alignments with other reads (targets) into one of five categories -
1. Internal overlaps - if the target read has regions that are aligned to the query and there
are regions on either side that are not aligned (overhangs) such that they are longer
than some threshold, the overlap is classified as internal, i.e. the overlap is considered
to have arisen from repeats rather than from sequencing the same region of the genome.
65
Internal overlaps are ignored since they do not contribute to read extension.
2. Query read contained - if the query read is completely contained in one of the target
reads, i.e., the overlapping region of the query spans almost the entire read and the
overhang regions are small enough, the query read is marked for deletion. The thread
processing the alignment uses an atomic bit-vector to set the bit corresponding to the
query read-id to 1.
3. Target read contained - like the previous case, if the target read is fully contained in the
query read, the target read is marked for deletion.
4. Query-suffix target-prefix overlap - if the suffix of the query read has a significant overlap
with the prefix of the target read (such that the overhang lengths are smaller than some
threshold) we create (query, target, overlap length) triple which will correspond to a
weighted edge in the assembly graph.
5. Query-prefix target-suffix overlap - if the prefix of the query read has a significant over-
lap with the suffix of the target read, we create a triple (target, query, overlap length).
Aggregating deleted reads. Once all hits are classified by all nodes, and the contained reads
are marked for deletion, the atomic bit-vectors from all nodes are gathered by the master node,
aggregated and scattered back to the peers. The gather-scatter process works like the parti-
tioning phase mentioned before. Each node copies its bit-vector into the GASNet-attached
segment and performs a barrier synchronization. The node ranked 0 performs a remote-
memory-get of r bits (r is the number of reads) from each peer and performs a bitwise-OR
of the elements in its local vector with those of the remote vector. Once all data from all nodes
are aggregated, the master performs remote-memory-puts to all peers and performs another
barrier synchronization.
Deleting edges connecting deleted reads. Once the nodes get updated set of deleted reads,
each node removes any suffix-prefix overlaps where either the suffix or the prefix or both be-
long to one of the deleted reads. Since most of the popular third-generation sequencers have
66
a long-tail distribution of read lengths (as shown later in Section 4.5), it is highly likely most
of the smaller reads will be contained within a few longer reads. Hence, removal of contained
reads significantly reduces the number of edges in the final graph by many folds. This makes
it possible for downstream phases to process the graph on a single machine and avoid inter-
node communication and synchronization.
Assembly graph creation. After edges originating from and/or incident upon contained reads
are removed, all nodes send their remaining edge list to the master node. Each peer copies the
number of edges and the list of edges into their GASNet-attached segment which is fetched by
the master using remote-memory-gets. Next, the master creates a graph and sanitizes it us-
ing two rules - (1) if two vertices (reads) have more than one edges (overlaps) it chooses the
one with the highest weight (longest overlap) and deletes the rest of the edges; (2) if an edge
exists from vertex u to v but not from v ′ to u′, i.e. the Watson-Crick complements of u and v
respectively, the edge from u to v is deleted.
Once the graph is made consistent, its edges are transitively reduced whenever possible.
For any 3 vertices u,v and w , if there exist edges u → v , v → w and u → w , we can remove the
edge u → w and its complement edge u′ → w ′ without affecting the connectivity of the graph.
Moreover, in order to handle repeats, if some vertex v has multiple outgoing edges Ev , and
there exists a subset of edges E ′v ∈ Ev with weight less than some fraction (.5 by default) of the
weight of the heaviest edge in Ev , all edges in E ′v are removed [37]. Both the steps mentioned
above help reduce the number of branches in the graph and simplify it for the subsequent
untig-generation step.
Graph simplification and untig generation. The assembly graph G is used to create a set U
of untigs where each u ∈U is a sequence generated by traversing a path in G where each vertex
in the path has at most one incoming edge and at most one outgoing edge. Untigs are created
by concatenating the reads corresponding to the vertices in the path and merging their over-
lapping regions. Intuitively, each untig is formed by merging two or more overlapping reads
in the original dataset. In the ideal scenario, we would have a single untig representing the
67
genome from which the original reads in the dataset were sequenced. However, it is possible
that subgraphs within the graph are not reachable from one another due to low coverage re-
gions formed as a result of sequencer bias. Moreover, the graph often contains vertices that
have more than one incoming edge (joins) or outgoing edge (forks) which results in an ambi-
guity in path selection during untig generation.
Some forks and joins can form due to sequencing errors and can be categorized as either
tips or bubbles. A tip is a short path consisting of at most 4 vertices by default (the last vertex
in the path has no outgoing edge), and originating from some vertex with out-degree greater
than 1. A bubble is comprised of two or more paths between two vertices u and v such that
each path is an untig and shorter than some threshold (50k bases by default). Bubbles are
popped by keeping the longest path between the source and sink vertices and deleting all
others. The tip-removal, bubble-popping, and simplification phases are performed iteratively
until the graph cannot be simplified any further.
Note that each untig generated in this method is represented as a list p of tuples, where the
i th tuple ( ji , li ) consists of ji , the read-id and li , the length of the prefix of read ji . However, the
untig in this form is not usable for downstream applications. Instead, each ( ji , li ) tuple in each
path must be replaced by the actual li -length prefix of the read ji and concatenated in order
of their occurrence in the path [23]. In order to do this using a single disk pass, the lengths
and offsets of all paths are calculated and for each tuple in the path, we calculate and store
the prefix length its offset within the path. Next, the reads along with the lists of their prefix-
lengths and offsets are indexed using the read-id as the key. Finally, the reads are streamed
from disk and for each read, we place the prefixes at the appropriate offsets. When all reads
are processed, the resulting strings are then written back to file.
4.5. Evaluation
4.5.1. Datasets
To evaluate our assembly approach, we use publicly available PacBio and Nanopore datasets
of several species and of various sizes. An overview of the characteristics is shown in table 4.1.
68
Figure 4.2 shows the distribution of read lengths in the datasets used. It is evident that the
read lengths have a long-tail distribution with a few very long reads and a majority of shorter
reads. These shorter reads are generally contained within the longer ones and can be used for
subsequent consensus generation.
Table 4.1. Overview of datasets
Name Reads Bases Longest Average
C. elegans 740,776 8,117,663,505 55,460 10,958
A. thaliana 2,258,540 18,522,571,743 47,445 8,201
Human 15,599,452 114,380,310,980 1,537,349 7,332
4.5.2. Testbed
To evaluate Hiper3ga, we use the SuperMIC, which is a 360-node supercomputing clus-
ter. Each node has two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit processors, 64 GB
DDR3 1866 MHz RAM and a 500 GB hard disk drive. The nodes are connected with a 56 Giga-
bit/sec Infiniband network and a 1 Gigabit/sec Ethernet network.
4.5.3. Comparison of assembly times
Table 4.2 provides a comparative analysis between Minimap and Miniasm with Hiper3ga
with respect to all-vs-all sequence alignment and subsequent assembly. On the smaller Caeno-
rhabditis elegans dataset, Minimap performs well because it manages to avoid network com-
munication as well as disk I/O. Even so, Hiper3ga manages to get a 7.5× speedup on 16 nodes.
However, on larger datasets, the performance gap between Minimap and Hiper3ga increases
because Minimap performs more disk I/O. Specifically, Minimap maintains two file pointers
to the input data, one of which is used to read chunks of data, generate minimizers and index
them, and the other is used to stream all reads and align them to the indexed minimizers. In
the case of the A.thaliana dataset, Minimap makes 5 passes over the input and in case of the
Human dataset, it makes 29 passes. On the other hand, Hiper3ga performs significantly bet-
ter by avoiding multiple passes and achieves a 28× speedup on 32 nodes for the A. thaliana
dataset and 79× speedup on 128 nodes for the Human dataset. Moreover, because it is limited
69
0 10000 20000 30000 40000 50000
Read length
100
101
102
103
104
R
ea
d 
co
un
t (
Lo
g 
sc
al
e)
C.elegans
0 10000 20000 30000 40000
Read length
100
101
102
103
104
R
ea
d 
co
un
t (
Lo
g 
sc
al
e)
A.thaliana
0 200000 400000 600000 800000 1000000 1200000 1400000
Read length
100
101
102
103
104
105
106
107
R
ea
d 
co
un
t (
Lo
g 
sc
al
e)
Human
Figure 4.2. Distribution of read sizes in log-normal scale
70
Table 4.2. Comparison of assembly times
Minimap + Miniasm Hiper3ga
Alignment Assembly N Alignment Assembly
C. elegans 15m 47s 2m 24s 16 2m 7s (7.5x) 1m 12s
A. thaliana 1h 4m 31s OOM 32 2m 18s (28x) 1m 30s
Human 1d 13h 2m 59s OOM 128 28m 0s (79x) 5m 12s
Table 4.3. Comparison of assembly contiguity. (H = Hiper3ga, M = Miniasm)
# Contigs Total length (bp) Longest (bp) N50 (bp) Median (bp)
H M H M H M H M H M
C. elegans 85 87 107,719,688 107,463,727 6,398,071 5,250,887 3,442,037 2,386,264 650,410 876,164
A. thaliana 257 – 123,690,012 – 13,840,842 – 5,267,120 – 34,364 –
Human 1,831 – 2,382,106,489 – 9,525,169 – 2,363,812 – 849,532 –
by the memory capacity of a single node, Miniasm runs out of memory for both these datasets.
4.5.4. Comparison of assembly contiguity
Table 4.3 shows a comparison of assembly contiguity between Hiper3ga and Miniasm.
As mentioned before, Miniasm couldn’t assemble A. thaliana and Human genome datasets
because it ran out of memory. However, on the C. elegans dataset, Hiper3ga’s results are com-
parable or even slightly better than that of Miniasm.
4.6. Conclusion
In this chapter, we presented Hiper3ga, a distributed long-read assembler inspired by
Minimap and Miniasm that can assemble large datasets generated from third generation se-
quences on multiple nodes in a matter of minutes. Our experimental results demonstrate that
Hiper3ga can scale up to hundreds of nodes and provide about two orders of magnitude per-
formance improvements over the state-of-the-art long-read assemblers. Moreover Hiper3ga’s
contigs are comparable in contiguity to Miniasm. As future work, we plan to improve contig
quality by including the base-level error correction phase using a distributed Partial-Order-
Alignment based consensus generation with hierarchical or hybrid strategies.
71
Appendix. Copyright Information
In reference to IEEE copyrighted material which is used with permission in this thesis, the
IEEE does not endorse any of Louisiana State University’s products or services. Internal or
personal use of this material is permitted. If interested in reprinting/republishing IEEE copy-
righted material for advertising or promotional purposes or for creating new collective works
for resale or redistribution, please go to http://www.ieee.org/publications_standards/
publications/rights/rights_link.html to learn how to obtain a License from RightsLink.
If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada
may supply single copies of the dissertation.
72
5/21/2019 Rightslink® by Copyright Clearance Center
https://s100.copyright.com/AppDispatchServlet#formTop 1/1
 
Title: Lazer: Distributed memory-efficient
assembly of large-scale genomes
Conference
Proceedings:
2016 IEEE International Conference
on Big Data (Big Data)
Author: Sayan Goswami
Publisher: IEEE
Date: Dec. 2016
Copyright © 2016, IEEE
LOGINI
If you're a copyright.com user,
you can login to RightsLink using
your copyright.com credentials.
Already a RightsLink user or
want to learn more?
 
Thesis / Dissertation Reuse
The IEEE does not require individuals working on a thesis to obtain a formal reuse license, however, you may print out
this statement to be used as a permission grant: 
  
Requirements to be followed when using any portion (e.g., figure, graph, table, or textual material) of an IEEE copyrighted paper
in a thesis:
  
1) In the case of textual material (e.g., using short quotes or referring to the work within these papers) users must give full credit to
the original source (author, paper, publication) followed by the IEEE copyright line © 2011 IEEE. 
 2) In the case of illustrations or tabular material, we require that the copyright line © [Year of original publication] IEEE appear
prominently with each reprinted figure and/or table. 
 3) If a substantial portion of the original paper is to be used, and if you are not the senior author, also obtain the senior author's
approval. 
  
Requirements to be followed when using an entire IEEE copyrighted paper in a thesis: 
  
1) The following IEEE copyright/ credit notice should be placed prominently in the references: © [year of original publication] IEEE.
Reprinted, with permission, from [author names, paper title, IEEE publication title, and month/year of publication] 
 2) Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your thesis on-line.
 3) In placing the thesis on the author's university website, please display the following message in a prominent place on the
website: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of
[university/educational entity's name goes here]'s products or services. Internal or personal use of this material is permitted. If
interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new
collective works for resale or redistribution, please go to
http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. 
  
If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply single copies of the
dissertation.
    
 
Copyright © 2019 Copyright Clearance Center, Inc. All Rights Reserved. Privacy statement. Terms and Conditions. 
Comments? We would like to hear from you. E-mail us at customercare@copyright.com 
 
5/21/2019 Rightslink® by Copyright Clearance Center
https://s100.copyright.com/AppDispatchServlet#formTop 1/1
 
Title: GPU-Accelerated Large-Scale
Genome Assembly
Conference
Proceedings:
2018 IEEE International Parallel and
Distributed Processing Symposium
(IPDPS)
Author: Sayan Goswami
Publisher: IEEE
Date: May 2018
Copyright © 2018, IEEE
LOGINI
If you're a copyright.com user,
you can login to RightsLink using
your copyright.com credentials.
Already a RightsLink user or
want to learn more?
 
Thesis / Dissertation Reuse
The IEEE does not require individuals working on a thesis to obtain a formal reuse license, however, you may print out
this statement to be used as a permission grant: 
  
Requirements to be followed when using any portion (e.g., figure, graph, table, or textual material) of an IEEE copyrighted paper
in a thesis:
  
1) In the case of textual material (e.g., using short quotes or referring to the work within these papers) users must give full credit to
the original source (author, paper, publication) followed by the IEEE copyright line © 2011 IEEE. 
 2) In the case of illustrations or tabular material, we require that the copyright line © [Year of original publication] IEEE appear
prominently with each reprinted figure and/or table. 
 3) If a substantial portion of the original paper is to be used, and if you are not the senior author, also obtain the senior author's
approval. 
  
Requirements to be followed when using an entire IEEE copyrighted paper in a thesis: 
  
1) The following IEEE copyright/ credit notice should be placed prominently in the references: © [year of original publication] IEEE.
Reprinted, with permission, from [author names, paper title, IEEE publication title, and month/year of publication] 
 2) Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your thesis on-line.
 3) In placing the thesis on the author's university website, please display the following message in a prominent place on the
website: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of
[university/educational entity's name goes here]'s products or services. Internal or personal use of this material is permitted. If
interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new
collective works for resale or redistribution, please go to
http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. 
  
If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply single copies of the
dissertation.
    
 
Copyright © 2019 Copyright Clearance Center, Inc. All Rights Reserved. Privacy statement. Terms and Conditions. 
Comments? We would like to hear from you. E-mail us at customercare@copyright.com 
 
References
[1] A. Abu-Doleh and Ü. V. Çatalyürek. Spaler: Spark and graphx based de novo genome
assembler. In Big Data (Big Data), 2015 IEEE International Conference on, pages 1013–
1018. IEEE, 2015.
[2] A. Acevedo, L. Brodsky, and R. Andino. Mutational and fitness landscapes of an rna virus
revealed through population sequencing. Nature, 505(7485):686–690, 2014.
[3] S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman.
Gapped blast and psi-blast: a new generation of protein database search programs. Nu-
cleic acids research, 25(17):3389–3402, 1997.
[4] D. Antipov, A. Korobeynikov, J. S. McLean, and P. A. Pevzner. hybridspades: an algorithm
for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009–1015, 2015.
[5] I. Ben-Bassat and B. Chor. String graph construction using incremental hashing. Bioin-
formatics, 30(24):3515–3523, 2014.
[6] K. Berlin, S. Koren, C.-S. Chin, J. P. Drake, J. M. Landolin, and A. M. Phillippy. Assembling
large genomes with single-molecule sequencing and locality-sensitive hashing. Nature
biotechnology, 33(6):623, 2015.
[7] J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye, and A. Goesmann. Ex-
act and complete short-read alignment to microbial genomes using Graphics Processing
Unit programming. Bioinformatics, 27(10):1351–1358, 2011.
[8] S. Boisvert, F. Laviolette, and J. Corbeil. Ray: simultaneous assembly of reads from
a mix of high-throughput sequencing technologies. Journal of computational biology,
17(11):1519–1533, 2010.
[9] D. Bonachea and P. Hargrove. Gasnet specification, v1.8.1. GAS-
Net Specification, v1.8.1. Lawrence Berkeley National Laboratory. Report:
LBNL-2001064, 2017. http://dx.doi.org/10.2172/1398512 Retrieved from
https://escholarship.org/uc/item/03b5g0q4.
[10] P. Bonizzoni, G. Della Vedova, Y. Pirola, M. Previtali, and R. Rizzi. FSG: Fast String Graph
Construction for De Novo Assembly. Journal of Computational Biology, 2017.
[11] K. R. Bradnam, J. N. Fass, A. Alexandrov, P. Baranay, M. Bechner, I. Birol, S. Boisvert, J. A.
Chapman, G. Chapuis, R. Chikhi, et al. Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. GigaScience, 2(1):10, 2013.
[12] Y. Chan, K. Xu, H. Lan, W. Liu, Y. Liu, and B. Schmidt. PUNAS: A Parallel Ungapped-
Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read
Alignment. In IEEE International Parallel and Distributed Processing Symposium
(IPDPS), 2017.
75
[13] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho. A de novo next generation genomic
sequence assembler based on string graph and MapReduce cloud computing framework.
BMC genomics, 13(7):S28, 2012.
[14] S. Chen and H. Jiang. An exact matching approach for high throughput sequencing based
on bwt and gpus. In IEEE 14th International Conference on Computational Science and
Engineering (CSE), 2011.
[15] R. Chikhi and G. Rizk. Space-efficient and exact de Bruijn graph representation based on
a Bloom filter. Algorithms for Molecular Biology, 8(1):22, 2013.
[16] C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum,
A. Copeland, J. Huddleston, E. E. Eichler, et al. Nonhybrid, finished microbial genome
assemblies from long-read smrt sequencing data. Nature methods, 10(6):563, 2013.
[17] C.-S. Chin, P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T. Concepcion, A. Clum, C. Dunn,
R. O’Malley, R. Figueroa-Balderas, A. Morales-Cruz, et al. Phased diploid genome assem-
bly with single-molecule real-time sequencing. Nature methods, 13(12):1050, 2016.
[18] M. de la Bastide and W. R. McCombie. Assembling genomic DNA sequences with PHRAP.
Current Protocols in Bioinformatics, 2007.
[19] V. Deshpande, E. D. Fung, S. Pham, and V. Bafna. Cerulean: A hybrid assembly using high
throughput short and long reads. In International workshop on algorithms in bioinfor-
matics, pages 349–363. Springer, 2013.
[20] A. Drozd, N. Maruyama, and S. Matsuoka. A multi GPU read alignment algorithm with
model-based performance optimization. In International Conference on High Perfor-
mance Computing for Computational Science, pages 270–277. Springer, 2012.
[21] E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar,
and K. Yelick. HipMer: an extreme-scale de novo genome assembler. In International
Conference for High Performance Computing, Networking, Storage and Analysis, pages
1–11, 2015.
[22] S. Goswami, A. K. Das, R. Platania, K. Lee, and S.-J. Park. Lazer: Distributed memory-
efficient assembly of large-scale genomes. IEEE International Conference on Big Data
(Big Data), pages 1171–1181, 2016.
[23] S. Goswami, K. Lee, S. Shams, and S.-J. Park. Gpu-accelerated large-scale genome assem-
bly. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS),
pages 814–824. IEEE, 2018.
[24] C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial
intelligence. In Proceedings of the 3rd international joint conference on Artificial intelli-
gence, pages 235–245. Morgan Kaufmann Publishers Inc., 1973.
[25] B. M. Institute. The Impact of Genomics on the U.S. Economy, 2013.
76
[26] A. Jain, A. Garg, and K. Paul. GAGM: Genome assembly on GPU using mate pairs. In 20th
International Conference on High Performance Computing (HiPC), pages 176–185. IEEE,
2013.
[27] G. Jain, L. Rathore, and K. Paul. GAMS: Genome Assembly on Multi-GPU Using String
Graph. In IEEE 18th International Conference on High Performance Computing and Com-
munications (HPCC), 2016.
[28] W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham, V. Magrini, E. R. Mardis,
J. L. Dangl, and C. D. Jones. Extending assembly of short DNA sequences to handle error.
Bioinformatics, 23(21):2942–2944, 2007.
[29] R. Kajitani, K. Toshimoto, H. Noguchi, A. Toyoda, Y. Ogura, M. Okuno, M. Yabana,
M. Harada, E. Nagayasu, H. Maruyama, et al. Efficient de novo assembly of highly
heterozygous genomes from whole-genome shotgun short reads. Genome research,
24(8):1384–1395, 2014.
[30] M. Kayser and P. De Knijff. Improving human forensics through advances in genetics,
genomics and molecular biology. Nature Reviews Genetics, 12(3):179–192, 2011.
[31] P. Klus, S. Lam, D. Lyberg, M. S. Cheung, G. Pullan, I. McFarlane, G. S. Yeo, and B. Y. Lam.
BarraCUDA-a fast short read sequence aligner using graphics processing units. BMC
research notes, 5(1):27, 2012.
[32] P. Koppa, A. Das, S. Goswami, R. Platania, and S. Park. Giga: Giraph-based genome as-
sembler for gigabase scale genomes. In Proceedings of the 8th International Conference
on Bioinformatics and Computational Biology (BICOB 2016), pages 55–63, 2016.
[33] S. Koren and A. M. Phillippy. One chromosome, one contig: complete microbial genomes
from long-read sequencing and assembly. Current opinion in microbiology, 23:110–120,
2015.
[34] S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy. Canu:
scalable and accurate long-read assembly via adaptive k-mer weighting and repeat sep-
aration. Genome research, 27(5):722–736, 2017.
[35] H. Lee, J. Gurtowski, S. Yoo, M. Nattestad, S. Marcus, S. Goodwin, W. R. McCombie, and
M. Schatz. Third-generation sequencing and the future of genomics. BioRxiv, page
048603, 2016.
[36] D. Li, C.-M. Liu, R. Luo, K. Sadakane, and T.-W. Lam. Megahit: an ultra-fast single-node
solution for large and complex metagenomics assembly via succinct de bruijn graph.
Bioinformatics, 31(10):1674–1676, 2015.
[37] H. Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long se-
quences. Bioinformatics, 32(14):2103–2110, 2016.
77
[38] Y. Li, P. Kamousi, F. Han, S. Yang, X. Yan, and S. Suri. Memory efficient minimum substring
partitioning. In Proceedings of the VLDB Endowment, volume 6, pages 169–180. VLDB
Endowment, 2013.
[39] C.-M. Liu, T.-W. Lam, T. Wong, E. Wu, S.-M. Yiu, Z. Li, R. Luo, B. Wang, C. Yu, X. Chu,
et al. SOAP3: GPU-based compressed indexing and ultra-fast parallel alignment of short
reads. In 3th Workshop on Massive Data Algorithms, 2011.
[40] Y. Liu and B. Schmidt. CUSHAW2-GPU: empowering faster gapped short-read alignment
using GPU computing. IEEE Design & Test, 31(1):31–39, 2014.
[41] Y. Liu, B. Schmidt, and D. L. Maskell. Parallelized short read assembly of large genomes
using de bruijn graphs. BMC bioinformatics, 12(1):354, 2011.
[42] Y. Liu, B. Schmidt, and D. L. Maskell. CUSHAW: a CUDA compatible short read aligner to
large genomes based on the Burrows–Wheeler transform. Bioinformatics, 28(14):1830–
1837, 2012.
[43] M. Lu, Q. Luo, B. Wang, J. Wu, and J. Zhao. GPU-accelerated bidirected De Bruijn
graph construction for genome assembly. In Asia-Pacific Web Conference, pages 51–62.
Springer, 2013.
[44] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, et al. SOAP-
denovo2: an empirically improved memory-efficient short-read de novo assembler. Gi-
gascience, 1(1):18, 2012.
[45] I. MacCallum, D. Przybylski, S. Gnerre, J. Burton, I. Shlyakhter, A. Gnirke, J. Malek,
K. McKernan, S. Ranade, T. P. Shea, et al. Allpaths 2: small genomes assembled accu-
rately and with high continuity from short paired reads. Genome biology, 10(10):1, 2009.
[46] S. F. Mahmood and H. Rangwala. Gpu-euler: Sequence assembly using gpgpu. In IEEE
13th International Conference on High Performance Computing and Communications
(HPCC), pages 153–160, 2011.
[47] J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji. SWAP-Assembler: scalable and effi-
cient genome assembly towards thousands of cores. In BMC bioinformatics, volume 15.
BioMed Central Ltd, 2014.
[48] D. G. Merrill and A. S. Grimshaw. Revisiting sorting for gpgpu stream architectures. In
Proceedings of the 19th international conference on Parallel architectures and compilation
techniques, pages 545–546. ACM, 2010.
[49] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A.
Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, et al. A whole-genome assem-
bly of drosophila. Science, 287(5461):2196–2204, 2000.
[50] J. Pantaleoni and N. Subtil. NVBIO A library of reusable components designed by NVIDIA
corporation to accelerate bioinformatics applications using CUDA, 2014.
78
[51] R. A. Power, J. Parkhill, and T. de Oliveira. Microbial genome-wide association studies:
lessons from human gwas. Nature Reviews Genetics, 18(1):41–50, 2017.
[52] M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storage require-
ments for biological sequence comparison. Bioinformatics, 20(18):3363–3369, 2004.
[53] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C.
Schatz, A. L. Delcher, M. Roberts, et al. GAGE: A critical evaluation of genome assemblies
and assembly algorithms. Genome research, 22(3):557–567, 2012.
[54] M. C. Schatz, D. Sommer, D. Kelley, and M. Pop. De novo assembly of large genomes
using cloud computing. In Proceedings of the Cold Spring Harbor Biology of Genomes
Conference, 2010.
[55] J. T. Simpson and R. Durbin. Efficient de novo assembly of large genomes using com-
pressed data structures. Genome research, 22(3):549–556, 2012.
[56] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. ABySS: a parallel
assembler for short read sequence data. Genome research, 19(6):1117–1123, 2009.
[57] I. Sović, M. Šikić, A. Wilm, S. N. Fenlon, S. Chen, and N. Nagarajan. Fast and sensi-
tive mapping of nanopore sequencing reads with graphmap. Nature communications,
7:11307, 2016.
[58] B. S. C. Varma, K. Paul, M. Balakrishnan, and D. Lavenier. Fassem: Fpga based accelera-
tion of de novo genome assembly. In Field-Programmable Custom Computing Machines
(FCCM), 2013 IEEE 21st Annual International Symposium on, pages 173–176. IEEE, 2013.
[59] R. Vaser, I. Sović, N. Nagarajan, and M. Šikić. Fast and accurate de novo genome assembly
from long uncorrected reads. Genome research, 27(5):737–746, 2017.
[60] J. Wang, X. Xie, and J. Cong. Communication Optimization on GPU: A Case Study of Se-
quence Alignment Algorithms. In IEEE International Parallel and Distributed Processing
Symposium (IPDPS), 2017.
[61] Y. Xie, G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, W. Huang, G. He, S. Gu, S. Li, et al.
Soapdenovo-trans: de novo transcriptome assembly with short rna-seq reads. Bioinfor-
matics, 30(12):1660–1666, 2014.
[62] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de
Bruijn graphs. Genome research, 18, 2008.
79
Vita
Sayan Goswami graduated with a B.Tech in Computer Science & Engineering from Na-
tional Institute of Technology, Durgapur, India in 2011. After working in the energy trading
and risk management industry for 1.5 years, he joined Louisiana State University for pursu-
ing a doctoral degree. His primary area of interest in big data with special emphasis on bio-
informatics pipelines.
80
