FPGA acceleration of DNA sequencing analysis and storage by Arram, James
Imperial College London
Department of Computing
FPGA Acceleration of DNA Sequencing
Analysis and Storage
James Arram
July 2017
Submitted in part fullment of the requirements for the degree of
Doctor of Philosophy in Computing of Imperial College London
and the Diploma of Imperial College London
1
Abstract
In this work we explore how Field-Programmable Gate Arrays (FPGAs) can be
used to alleviate the data processing bottlenecks in DNA sequencing. We focus our
eorts on accelerating the FM-index, a data structure used to solve the computationally
intensive string matching problems found in DNA sequencing analysis such as short read
alignment. The main contributions of this work are:
1. We accelerate the FM-index using FPGAs and develop several novel methods for
reducing the memory bottleneck of the search algorithm. These methods include
customising the FM-index structure according to the memory architecture of the
FPGA platform and minimising the number of memory accesses through both
architectural and algorithmic optimisations.
2. We present a new approach for accelerating approximate string matching using
the backtracking FM-index. This approach makes use of specialised approximate
string matching modules and a run-time recongurable architecture in order to
achieve both high sensitivity and high performance.
3. We extend the FM-index search algorithm for reference-based compression and
accelerate it using FPGAs. This accelerated design is integrated into fastqZip and
fastaZip, two new tools that we have developed for the fast and eective compres-
sion of sequence data stored in the FASTQ and FASTA formats respectively.
We implement our designs on the Maxeler Max4 Platform and show that they are
able to outperform state-of-the-art DNA sequencing analysis software. For instance,
our hardware-accelerated compression tool for FASTQ data is able to achieve a higher
compression ratio than the best performing tool, fastqz, whilst the average compression
and decompression speeds are 25 and 43 times faster respectively.
2
Declaration of Originality
I hereby declare that I am the sole author of this thesis and that the material within
has not been submitted for a degree in any other university or institution. The materials
and information used or derived from other published sources has been clearly cited and
appropriately acknowledged.
3
Copyright Declaration
The copyright of this thesis rests with the author and is made available under a
Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers
are free to copy, distribute or transmit the thesis on the condition that they attribute it,
that they do not use it for commercial purposes and that they do not alter, transform
or build upon it. For any reuse or redistribution, researchers must make clear to others
the licence terms of this work
4
Acknowledgements
First and foremost I would like to thank my supervisor Wayne Luk for all the advice
and encouragement he has given me over the past four years. His forming of the col-
laboration between myself and the Department of Chemical Pathology in the Chinese
University of Hong Kong (CUHK) gave this project real purpose and allowed me to work
on truly exciting applications. From CUHK I would like to thank Peiyong Jiang, Rossa
Chiu and Dennis Lo for their advice on DNA methylation and for their hospitality when
I visited Hong Kong. I would like to thank my colleagues, in particular Paul Grigoras,
Xinyu Niu, Pavel Burovskiy and Kuen Hung Tsoi, for the technical support they have
given me throughout this project and for the many hours of coee breaks shared. Last
but not least, I would like to thank my friends and family for lling my life with love
and support. Thank you all.
5
Contents
1 Introduction 13
2 Background and related work 20
2.1 Indexed String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Notation and basic denitions . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Sux Tries, Trees and Arrays . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Burrows-Wheeler Transform . . . . . . . . . . . . . . . . . . . . . . 24
2.1.4 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Application Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 FPGA acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 FPGA accelerator platforms . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Exact string matching 40
3.1 Base design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 FM-index structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.3 Performance model . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Algorithmic optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 k-step FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Precomputed intervals . . . . . . . . . . . . . . . . . . . . . . . . . 55
6
3.2.3 Oversampled Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.4 Optimisation procedure . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 Base design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.3 k-step FM-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.4 Precomputed intervals . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.5 Oversampled index . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.6 Short read alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Approximate string matching 72
4.1 Backtracking design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.1 Backtracking strategy . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.3 Seed-and-compare . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Design-space exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.1 Static architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.2 Run-time recongurable architecture . . . . . . . . . . . . . . . . . 88
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.2 Backtracking design . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.3 Seed-and-compare . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.4 Short read alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5 Sequence data compression 103
5.1 Reference-based compression . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . 107
7
5.2 FASTQ and FASTA compression . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.2 Reference-based compression . . . . . . . . . . . . . . . . . . . . . 117
5.3.3 FASTQ compression . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.4 FASTA compression . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Conclusion 125
Bibliography 131
8
List of Figures
1.1 Sequencing throughput of Illumina NGS platforms . . . . . . . . . . . . . 14
1.2 NGS data analysis pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Trie of the strings: B, AN, and ANA . . . . . . . . . . . . . . . . . . . . . 21
2.2 Sux trie of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Sux tree of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Sux array of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Construction of the BWT . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 FM-index of the text BANANA . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 FM-index search algorithm example . . . . . . . . . . . . . . . . . . . . . 28
2.8 Maxeler MPC-X2000 architecture . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Benchmarking system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Modied FM-index structure . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 The FM-index size and buckets size for the Human genome . . . . . . . . 44
3.3 Hardware achitecture for exact string matching . . . . . . . . . . . . . . . 46
3.4 Count operation steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Base design performance estimate for the Maxeler Max4 DFE . . . . . . . 50
3.6 Construction of k-step FM-index with step size of 2 . . . . . . . . . . . . 52
3.7 The k-step FM-index size and buckets size for the Human genome . . . . 53
3.8 Precomputed intervals optimisation . . . . . . . . . . . . . . . . . . . . . . 55
3.9 Oversampled FM-index construction . . . . . . . . . . . . . . . . . . . . . 57
9
3.10 Optimisation procedure for the base design . . . . . . . . . . . . . . . . . 58
3.11 Performance estimates for algorithmic optimisations . . . . . . . . . . . . 59
3.12 Memory channel bandwidth for the Max4 DFE . . . . . . . . . . . . . . . 61
3.13 Performance of D1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.14 Strong-scaling and weak-scaling performance of D1 . . . . . . . . . . . . . 64
3.15 Performance of D2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.16 Performance of D3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.17 Performance of D4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Approximate matching using greedy and exhaustive searching . . . . . . . 74
4.2 Search phases for approximate matching with one edit . . . . . . . . . . . 75
4.3 Search phases for approximate matching with two edits . . . . . . . . . . 76
4.4 Hardware architecture for approximate string matching . . . . . . . . . . 79
4.5 Seed-and-compare optimisation . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Frequency distribution of CMLs . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 Static architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 Alignment breakdown for NGS dataset . . . . . . . . . . . . . . . . . . . . 87
4.9 Run-time recongurable architecture . . . . . . . . . . . . . . . . . . . . . 89
4.10 Approximate matching performance . . . . . . . . . . . . . . . . . . . . . 94
4.11 Pipeline of modules for the seed-and-compare optimisation . . . . . . . . 95
4.12 Performance of seed-and-compare optimisation (one mismatch) . . . . . . 96
4.13 Performance of seed-and-compare optimisation (two mismatches) . . . . . 96
4.14 Pipeline of modules for up to two mismatches . . . . . . . . . . . . . . . . 98
5.1 Reference-based compression example . . . . . . . . . . . . . . . . . . . . 105
5.2 Hardware architecture for reference-based compression . . . . . . . . . . . 108
5.3 FASTQ format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 FASTA format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Reverse complement optimisation for FASTQ les . . . . . . . . . . . . . 111
5.6 Merge tuples optimisation for FASTA les . . . . . . . . . . . . . . . . . . 112
10
5.7 Run-length encoding of the quality scores . . . . . . . . . . . . . . . . . . 113
5.8 fastqZip compression pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.9 fastaZip compression pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.10 Reference-based compression performance . . . . . . . . . . . . . . . . . . 118
5.11 Strong-scaling performance for reference-based compression . . . . . . . . 119
5.12 Compressed le composition . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11
List of Tables
2.1 Comaprison of the NVIDIA Titan X and Nallatech 510T accelerators . . 30
2.2 Summary of software short read alignment tools . . . . . . . . . . . . . . 35
2.3 Summary of previous eorts on accelerating short read alignment with
FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Exact string matching designs evaluated . . . . . . . . . . . . . . . . . . . 60
3.2 Resource usage for Altera Stratix V FPGA . . . . . . . . . . . . . . . . . 62
3.3 State-of-the-art alignment tools evaluated against . . . . . . . . . . . . . . 68
3.4 Exact alignment performance . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 Device power and energy consumption . . . . . . . . . . . . . . . . . . . . 70
4.1 Approximate string matching designs evaluated . . . . . . . . . . . . . . . 91
4.2 Resource usage for Altera Stratix V FPGA . . . . . . . . . . . . . . . . . 92
4.3 State-of-the-art alignment tools evaluated against . . . . . . . . . . . . . . 97
4.4 Approximate alignment performance . . . . . . . . . . . . . . . . . . . . . 99
4.5 Device power and energy consumption . . . . . . . . . . . . . . . . . . . . 101
5.1 Compression tools evaluated for the FASTQ format . . . . . . . . . . . . 119
5.2 FASTQ compression performance . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Compression tools evaluated for the FASTA format . . . . . . . . . . . . . 122
5.4 FASTA compression performance . . . . . . . . . . . . . . . . . . . . . . . 123
12
Chapter 1
Introduction
Over the past 15 years, the rapid improvements in Next Generation Sequencing (NGS)
has revolutionised DNA sequencing. When compared to traditional Sanger sequencing,
NGS provides orders of magnitude more data at a much lower cost. For example, the
rst draft of the Human genome required over 13 years and 3 billion dollars using Sanger
sequencing [1]. In contrast, the latest NGS platforms can perform the same feat in less
than a day for approximately 1000 dollars [2]. With the continuous advancement of the
technologies which underpin NGS and the concurrent decrease in cost, NGS is poised to
transform biomedical research and the healthcare industry [3]. One such example is the
new-found ability to analyse the methylation pattern of DNA through Bisulte sequenc-
ing. This technique has been used in several life-saving clinical procedures including:
Non-invasive prenatal diagnosis. Testing a fetus for genetic disorders, such as
Down's Syndrome, traditionally requires an invasive procedure which has a high
risk of miscarriage. Since it was revealed that traces of fetal DNA can be found
in maternal plasma, new non-invasive tests have been developed which perform
Bisulte sequencing on DNA extracted from the mother's blood [4]. Such tests
can be performed much earlier in the pregnancy and have no risk of miscarriage.
Cancer diagnosis. Cancer cells often display aberrant DNA methylation patterns
such as hypermethylation within the promoter regions of tumor-suppressor genes.
13
A variety of tests which use Bisulte sequencing for cancer detection have been de-
veloped [5]. These tests can identify a single specic cancer type using biomarkers,
or provide a general cancer diagnosis.
The work presented in this thesis has been used to improve the eciency of these pro-
cedures through accelerating Bisulte sequencing [6]. This not only has the potential to
save lives through faster diagnosis and earlier treatment, but also reduces the burden on
hospital resources by improving patient turnaround times.
A major challenge impeding the adoption of NGS is the huge amounts of data pro-
duced, which is dicult to both store and analyse in a cost eective and ecient manner.
Figure 1.1 shows the increase in sequence data produced by the Illumina NGS platforms
over the last 11 years. It can be seen that this increase far outstrips the Moore's law ad-
vances in processor and disk technology. As a result, the major challenges facing DNA
sequencing today mostly derive from the informatics required to process the massive
amounts of data available [7].
Many of the bottlenecks present in the processing of sequence data can be classied as
string matching problems where a string is located in a much larger text. Although this
a well-characterised problem, its application to DNA sequencing presents many unique
opportunities and challenges. For example, the reduced alphabet size for DNA texts, the
size of the texts involved and the requirement for approximate string matching. Below
 
 
 
 
 
 
  
 
 
 
 
 
0.1
1
10
100
1000
10000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
G
ig
ab
as
e
s 
/r
u
n
 
Year 
GA
GAIIx
Hiseq 2000
Hiseq X
Moore's Law
...CGTGACAGTACAGTTTAGGACCCACAGAAAAT... 
...CG 
...CG 
...CGT 
...CGTGA 
ACAGTACAG 
ACAGTACAGT 
CAGTACAGTTT 
AGTACAGTTTA 
TAGGATCCACA 
GGATCCACAG 
GGATCCA 
GATCCACAGA 
AAAAT... 
AAAA... 
AAA... 
A... 
reads 
reference 
...CGTGACAGTACAGTTTAGGACCCACAGAAAAT..
. 
reference 
 
CAGTACAGTTT → <15, 11> 
10 20 30 40 
TAGGATCCACA → <25, 5>T<31, 5>  
AGTACAGTTTA → <16, 11> 
 
Figure 1.1: Sequencing throughput of Illumina NGS platforms. The data in this graph
was acquired from http://dx.doi.org/10.6084/m9.figshare.100940 (2014).
14
we describe two applications of string matching in DNA sequencing analysis which have
been the subject of much research:
Short read alignment. In short read alignment the short sequences of DNA produced
by the NGS platform (also called reads) are mapped to positions in a reference
genome. Exact matches as well as approximate matches must be considered in
order to identify genetic variations in the sequenced DNA such as single nucleotide
polymorphisms (SNPs), insertions and deletions. A typical alignment job will map
hundreds-of-millions of reads to the Human genome with multiple mismatches and
gaps permitted. For Bisulte sequencing this process can take over 6 hours when
running on a system with 24 cores and 128GB of memory, and is the largest
bottleneck.
Reference-based compression. In reference-based compression, sequences of DNA
are encoded as a mapping to a reference genome. This strategy exploits the prop-
erty that interspecies DNA is highly similar, therefore it is much more ecient
to store the dierences rather than the full sequences. Although reference-based
compression provides a higher compression ratio than general purpose tools, such
as gzip and bzip2, the compression speed can be orders of magnitude slower as
mapping the sequences to the reference genome is computationally intensive.
Figure 1.2 shows a general analysis pipeline for DNA sequencing. Here, the DNA is
sequenced using an NGS platform and the raw data is transformed into a list of genetic
variants using a series of analysis steps. The most computationally intensive steps, i.e.
those with string matching, are excellent candidates for acceleration using special pur-
pose processors such as the Graphics Processing Unit (GPU) and Field-Programmable
Gate Array (FPGA). These devices are designed to speed up the computationally in-
tensive sections of applications through massively parallel execution resources and high
memory bandwidths. Accelerating these analysis steps could have far reaching con-
sequences for the researchers and clinical scientists engaged in DNA sequencing. For
example, accelerating reference-based compression would enable sequence data archives
15
Figure 1.2: A general NGS data analysis pipeline (source [9]). Abbreviations: CNV
copy-number variant; SNV, single-nucleotide variant; SAM/BAM sequence alignment
formats; VCF variant-call format.
to store and serve sequence data more eciently [8], and accelerating short read align-
ment would greatly reduce the analysis time, allowing faster diagnosis and treatment of
diseases.
In this work we accelerate the string matching problems found in DNA sequencing
with FPGAs. This device has had limited adoption in DNA sequencing due to its high
cost and diculty in its design development. However, recent advances have made them
a far more attractive accelerator candidate. First, large cloud computing companies,
such as Amazon and Microsoft, have started including FPGAs in their infrastructure
and providing access to them [10, 11]. This greatly lowers the barrier to adoption for
users and removes the expenses associated with housing and running them. Second, the
ease-of-development for FPGAs has greatly improved as they can now be programmed
using the OpenCL framework [12] and high-level synthesis (HLS) tools [13]. This allows
users to abstract away from the traditional hardware development ow for a much faster
much faster and higher level software development ow.
Our work focuses on accelerating the FM-index [14], a data structure used to solve the
computationally intensive string matching problems found in DNA sequencing analysis
16
such as short read alignment [15, 16, 17] and de novo assembly [18]. Since the FM-index
is so widely used, accelerating it has the potential to reduce many of the data processing
bottlenecks in DNA sequencing. Although there has been previous work on accelerating
the FM-index with FPGAs [19, 20], the following key challenges still remain:
Reducing the memory bottleneck (C1) - The performance of the FM-index search
algorithm is greatly impeded by the memory access pattern, which lacks both
spacial and temporal locality. Although there has been some work on addressing
this challenge for FPGAs, such as masking the memory access latency through
hardware multithreading [19], methods which can reduce the number of memory
accesses in the search algorithm [21] and allow the data read from memory to be
reused [22] remain unexplored.
Approximate string matching (C2) - Previous work has not presented a suitable
approach for accelerating approximate string matching which can achieve both
high sensitivity (the percentage of strings successfully matched) and high perfor-
mance. For example, in [19] mismatches are accounted for by replacing any char-
acters which cannot be matched with other possible values in the text alphabet
(a greedy search). This approach achieves a high performance through minimising
the search space, however the sensitivity is lower than current software tools which
use backtracking [23] and index-assisted dynamic programming [16].
Quality and scope (C3) - Previous work on accelerating the FM-index with FPGAs
has been specically applied to short read alignment [19, 20, 24, 25]. Although
these eorts report a good speed-up relative to software, it is often at the cost of
other performance metrics such as sensitivity. Bridging the quality gap to software
and accelerating other analysis applications are essential steps in demonstrating
the benets of FPGAs in this domain.
The aim of this work is to address each of these challenges and ultimately show how
FPGAs can be used to alleviate the data processing bottlenecks in DNA sequencing
analysis. The main contributions of this work are as follows:
17
1. We address challenge C1 by developing several novel methods for reducing the
memory bottleneck of the FM-index search algorithm on FPGAs. These meth-
ods include: a) customising the FM-index structure in order to better utilise the
available memory bandwidth of the FPGA platform and b) architectural and al-
gorithmic optimisations which can reduce the number of memory accesses in the
search algorithm by over 2 times.
2. We address challenge C2 by presenting a new approach for accelerating approxi-
mate string matching which uses the backtracking FM-index. This approach uses
specialised modules to approximately match strings to a text with a specic edit
type and distance. These modules are are mapped to the FPGA using a novel
run-time recongurable architecture which oers better resource utilisation, per-
formance and design modularity compared to a statically populated design. Using
this approach we are able to achieve a 4 times speed-up compared to state-of-the-
art alignment tools whilst having an identical sensitivity.
3. We address challenge C3 by adapting the FM-index search algorithm for reference-
based compression and presenting its rst acceleration with FPGAs. This accel-
erated design is integrated into fastqZip and fastaZip, two new tools that we have
developed for the compression of sequence data stored in the FASTQ and FASTA
formats respectively. We show that both tools are able to achieve a higher compres-
sion ratio than state-of-the-art software tools whilst also having faster compression
and decompression speeds. For example, fastqZip is able to achieve a higher com-
pression ratio than the best performing tool, fastqz, whilst the average compression
and decompression speeds are 25 and 43 times faster respectively.
The rest of this work is organised as follows: in Chapter 2 we provide background
information on the FM-index and related work; in Chapter 3 we accelerate exact string
matching using the FM-index and present several novel methods to reduce the mem-
ory bottleneck of the search algorithm; in Chapter 4 we accelerate approximate string
matching using the backtracking FM-index and show how both high sensitivity and high
18
performance can be achieved; in Chapter 5 we present the rst acceleration of reference-
based compression with FPGAs and show how this accelerated design can be integrated
into tools for the fast and eective compression of sequence data in the FASTQ and
FASTA formats; nally in Chapter 6 we conclude this work with thoughts about future
work.
Selected publications
The work presented in this thesis has been published in six conferences and journals.
In addition to this, the work on genomic data compression presented in Chapter 5 was
the winner of the Maxeler design competition held at the International Symposium on
Field-Programmable Custom Computing Machines 2017. Below we provide a list of
selected publications which are mostly based on the material in this work:
 J. Arram et al, \Leveraging FPGAs for Accelerating Short Read Alignment" in
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),
[available online].
 J. Arram et al, \Ramethy: Recongurable Acceleration of Bisulte Sequence Align-
ment" in Proceedings of the 2015 International Symposium on Field-Programmable
Gate Arrays (FPGA), ACM/SIGDA, 2015, pp 250{259.
 J. Arram et al.\FPGA acceleration of reference-based compression for genomic
data" in Proceedings of the 2015 International Symposium on Field-Programmable
Technology (FPT), IEEE, 2015, pp 9{16.
19
Chapter 2
Background and related work
This chapter presents the background information that forms the basis of this work.
Section 2.1 describes indexed string matching and the FM-index. Section 2.2.1 gives a
brief overview of FPGA acceleration and the platform used in this work. Finally, Sec-
tion 2.3 discusses previous work on accelerating string matching problems with FPGAs
including indexing and seed-and-extend approaches.
2.1 Indexed String Matching
One method for classifying string matching algorithms is using preprocessing as a cri-
terion. For example, brute-force algorithms do not require either the strings or text
to be preprocessed, whereas nite-automata methods preprocess the strings. Indexing
involves preprocessing the text in order to perform fast string searching. This is partic-
ularly useful when: a) the text is so large that algorithms which are linear in time with
its length are impractical, b) the text changes infrequently so that the cost of building
the index can be amortised with a large number of searches, and c) there is enough
memory and suitable bandwidth to store the index and provide ecient access to it.
The string matching problems in DNA sequencing analysis satisfy these criteria; conse-
quently indexing has been successfully applied to many analysis steps such as short read
alignment [15, 16, 17] and de-novo assembly [26].
20
2.1.1 Notation and basic denitions
A string S is a sequence of characters s1s2:::sn, where each character is an element of
an alphabet . In this work, the elements of the string are zero-based so that the ith
character of S is at index Si 1 or S[i   1]. Zero-based numbering is chosen to simplify
the description for the FM-index construction and its search algorithm A substring of S
is written as Si;j = sisi+1; : : : ; sj . A prex is a substring of the form S0;j , and a sux is
of the form Si;n 1. If i > j, then a substring Si;j is an empty string written as  with
length jj = 0. Lexicographical order `<' is dened as follows. If x and y are characters
and P and Q are strings, then xP < yQ if x < y, or if x = y and P < Q.
2.1.2 Sux Tries, Trees and Arrays
A trie or digital tree [27] is a data structure which stores a set of strings (Figure 2.1).
Each node in the trie represents a prex in the set of strings or one of the strings itself
(in the case of a leaf node). The root node represents , and each string is terminated
with a unique symbol $ to ensure that a string is not the prex of any other string in
the set. Tries provide string searching in time proportional to the length of the string.
To search for a string S in a trie, the path from the root node with edges labelled with
the characters of S are followed. The search has two outcomes: either a node is reached
where there is no edge label with the same character as that in the string (S cannot be
found in the set), or a leaf node is reached (S can be found in the set).
A sux trie is a data structure which provides fast substring searching operations
(Figure 2.2). It is constructed by building a trie of all the suxes of a text. For a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
A 
N 
$ 
A $ 
B 
$ 
B 
A 
N 
A 
N 
A 
$ 
0 
$ 
A 
N 
$ 
6 
A 
$ N 
A 
$ 
3 
1 
5 
N 
A 
$ N 
A 
$ 
4 
2 
Figure 2.1: Trie of the strings: B, AN, and ANA.
21
text T with length jT j = n, the sux trie for T is a trie built over all the suxes
T0;n; T1;n; : : : ; Tn;n. Every occurrence of a string in a text is a substring of the text,
therefore it follows that an occurrence of S is a prex of a sux in the text. To search
for a string S in a sux trie, the path from the root node with edges labelled with the
characters of S are followed. The search operation has two outcomes: either a node
is reached where there is no edge label with the same character as that in the string
(S cannot be found in the text), or a node is reached (S can be found in the text). To
determine the number of times the string occurs in the text, a depth-rst traversal is
used to count the number of leaf nodes in the resulting subtree. As is the case for the
trie data structure, the sux trie enables substring searching in time linear in the string
length.
For a text T with length jT j = n, the sux trie has in the worst case O(n2) nodes;
consequently for large texts the space required to store this data structure can be very
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
A 
N 
$ 
A $ 
B 
$ 
B 
A 
N 
A 
N 
A 
$ 
0 
$ 
A 
N 
$ 
6 
A 
$ N 
A 
$ 
3 
1 
5 
N 
A 
$ N 
A 
$ 
4 
2 
Figure 2.2: Sux trie of the text BANANA.
22
  
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
       
6 5 3 1 0 4 2 
0 1 2 3 4 5 6 
0 1 2 3 4 5 6 
B A N A N A $ 
$ B A N A N A0 
A0 $ B A N A N0 
A1 N A $ B A N1 
A2 N A N A $ B0 
B0 A N A N A $ 
N0 A $ B A N A1 
N1 A N A $ B A2 
(0,7) 
(6,1) 
(1,1) 
(6,1) (2,2) 
(6,1) (4,3) 
(2,2) 
(4,3) (6,1) 
6 
5 
1 3 
0 
4 2 
T =  
A =  
BANANA$ 
sorted rotations 
last column 
ANNB$AA 
BWM 
Figure 2.3: Sux tree of the text BANANA.
large. Two approaches for reducing the number of nodes are: 1) coalescing unary paths
into a single edge with a sting label, and 2) storing the text and converting the edge
labels into (oset, length) pairs with respect to the text. A sux trie utilising these
approaches is dened as a sux tree (Figure 2.3). For a text of length n, the sux
tree contains n leaves and no unary paths, therefore it has O(n) nodes. The main issue
limiting the use of sux trees in DNA sequencing analysis applications is the number
of bytes required for each node. This value is implementation specic and is typically
larger than 10 bytes in practice [28]. A sux tree of the Human genome would therefore
require over 32GB of memory.
A sux array [29] is a permutation of all the suxes of a text such that all the
suxes are sorted in lexicographical order and are represented by their start positions
in the text (Figure 2.4). The sux array can be constructed by collecting the leaves
of the sux tree from left to right. This approach assumes that the edge labels of the
sux tree nodes are ordered lexicographically from left to right. It is more convenient,
however, to construct the sux array directly by sorting the suxes of the text [30]. By
comparing the sux tree and sux array it can be observed that the set of leaf nodes
in a subtree of the sux tree is a contiguous interval in the sux array. For example,
in Figure 2.3 the sux tree node representing `a' contains the leaf nodes [5, 3, 1]. These
leaf nodes are encompassed by the interval [1, 3] in the sux array. The sux array
along with the original text have enough information to enable string searching. To
23
  
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
       
6 5 3 1 0 4 2 
0 1 2 3 4 5 6 
0 1 2 3 4 5 6 
B A N A N A $ 
$ B A N A N A0 
A0 $ B A N A N0 
A1 N A $ B A N1 
A2 N A N A $ B0 
B0 A N A N A $ 
N0 A $ B A N A1 
N1 A N A $ B A2 
(0,7) 
(6,1) 
(1,1) 
(6,1) (2,2) 
(6,1) (4,3) 
(2,2) 
(4,3) (6,1) 
6 
5 
1 3 
0 
4 2 
T =  
A =  
BANANA$ 
sorted rotations 
last column 
ANNB$AA 
BWM L F 
Figure 2.4: Sux array of the text BANANA. The interval covering the suxes which
start with `A' are highlighted.
search for a string S with length jSj = m, two binary searches are performed on the
sux array A. One binary search gives the rst position where the suxes in A have
S as a prex. The other binary search gives the last position where the suxes in A
have S as a prex. In each step of the binary search a lexicographical comparison is
performed which has in the worst case O(m) time. As a result, the search operation has
in the worst case O(m log n) time. This can be improved to O(m + log n) by storing
additional information to reduce the number of comparisons performed.
2.1.3 Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) [31] is a reversible permutation of the charac-
ters of a text originally used for compression. The BWT of a text T is constructed using
the following steps (Figure 2.5):
1. Terminate T with the $ symbol. This symbol is lexicographically smaller than all
the other characters in the text alphabet.
2. Generate all the distinct rotations of T and sort them lexicographically. The result
is a matrix referred to as the Burrows-Wheeler Matrix (BWM), where every row
of the BWM is a rotation of T .
3. The BWT is the characters in the last column of the BWM extracted from top to
bottom.
24
  
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
       
6 5 3 1 0 4 2 
0 1 2 3 4 5 6 
0 1 2 3 4 5 6 
B A N A N A $ 
$ B A N A N A0 
A0 $ B A N A N0 
A1 N A $ B A N1 
A2 N A N A $ B0 
B0 A N A N A $ 
N0 A $ B A N A1 
N1 A N A $ B A2 
(0,7) 
(6,1) 
(1,1) 
(6,1) (2,2) 
(6,1) (4,3) 
(2,2) 
(4,3) (6,1) 
6 
5 
1 3 
0 
4 2 
T =  
A =  
BANANA$ 
sorted rotations 
last column 
ANNB$AA 
BWM F L 
Figure 2.5: Constructing the BWT of the text BANANA. Rank information for the rst
and last column of the BWM is included.
The BWT can be understood as a permutation of the text where the characters have
been sorted according to their right-context, i.e. the characters that follow them. This
right context sorting groups identical characters together, which can compressed using
a variety of strategies including move-to-front transformations, run-length encoding,
Human coding, etc. Since T is terminated with the $ symbol, the rotations in the
BWM are in the same order as the suxes in the sux array. The BWT can therefore
be dened in terms of the sux array A:
BWT[i] =
8><>:
T [Ai   1] if Ai > 0
$ if Ai = 0
(2.1)
A key property of the BWM is Last-to-First Mapping (LF Mapping), which enables
the BWT to be reversed and used as an index. Since the characters in the text are sorted
according to their right context, the ith occurrence of a character c in the rst and last
columns of the BWM correspond to the same occurrence in the text. In other words, if
the occurrences of c in T are ranked, then the ranks will appear in the same order in the
rst and last columns of the BWM. The LF Mapping operation is used to determine the
position where a character in the last column (L) of the BWM appears in the rst column
(F ). Let c be a character which appears in L at position i. To determine its position in
25
F , the result from two functions, Count and Occ, are summed (Equation 2.2). Count(c)
returns the number of characters in the text which are lexicographically smaller than c.
The rows in F are lexicographically sorted, therefore Count(c) gives the position in F
where c rst occurs. Occ(c; i) returns the number of occurrences of c that are counted in
L from positions 0 to i. The ranks of c appear in the same order in L and F , therefore
Occ gives the number of rows in F which have c as prex but a smaller rank.
F (i) = Count(L[i]) +Occ(L[i]; i) (2.2)
2.1.4 FM-index
The FM-index [14] is a full-text compressed index built upon the BWT. It comprises
two arrays which store the precomputed values of the Count and Occ functions used in
the LF Mapping operation (Figure 2.6). For a text with length jT j = n and an alphabet
 (excluding the $ symbol), the Count array has O(jj) space, and the Occ array has
O(njj) space. In most implementations of the FM-index, e.g. [16, 23], the size of the
Occ array is reduced by only storing the positions which are a multiple of a sample
distance d. If Occ0 is the sampled array, then Occ0[c][i] = Occ[c][i  d]. This reduces the
size of the Occ array by a factor of d, however the cost is that the BWT of the text must
be additionally stored in order to reconstruct the omitted values.
To search for a string S using the FM-index, a backward search is performed (Algo-
rithm 1). First, an interval [low, high) is dened which gives the positions in the rst
column of the BWM (and the sux array) that have S as prex. low gives the rst
position where S rst occurs and high gives the position following the one where S last
occurs. These values are initialised to the rst and last positions of the Occ array re-
spectively. Next, for each character in S, moving from the last character to the rst, the
interval is updated using the LF Mapping operation. After the nal update the interval
is converted into positions in the text using the sux array. In most implementations,
the size of the sux array is reduced by sampling the values in the same way as the
Occ array. If the sampled sux array values are a constant number of positions apart
26
    
 
 
 
 
 
 
 
 
 
  
  
 
  
Step 1 
Symbol = A 
[0, 7) → [1, 4) 
Step 2 
Symbol = N 
[1, 4) → [5, 7) 
Step 3 
Symbol = A 
[5, 7) → [2, 4) 
Convert 
A → T 
[2, 4) = 3, 1 
 
 A B N 
0 0 0 0 
1 1 0 0 
2 1 0 1 
3 1 0 2 
4 1 1 2 
5 1 1 2 
6 2 1 2 
7 3 1 2 
A B N 
1 4 5 
Count array Occ array 
Substring = ANA 
Text = BANANA$ 
ANNB$AA 
FM-index 
Figure 2.6: FM-index of the text BANANA.
in the text, then each conversion can be performed in constant time. Figure 2.7 shows
a worked example of the FM-index search algorithm.
From Algorithm 1 it can be seen that in each step of the for-loop the previous values
of low and high are required to compute the updated values. This dependency between
steps limits the impact of performance optimisations such as loop pipelining and loop
unrolling. Another factor which reduces the achievable performance is that due to the
sorted nature of the BWT, the access pattern for the Occ array is random and lacks both
spatial and temporal locality. This limits the impact of memory optimisations, such as
Algorithm 1 FM-index search algorithm.
Input: string S, text T , sux array A, the Count array and the Occ array
Output: positions in text where S occurs
1: [low; high) [0; jT j) . initialise interval
2: for i jSj   1 to 0 step -1 do . update interval for each character in S
3: low  Count(S[i]) +Occ(S[i]; low)
4: high Count(S[i]) +Occ(S[i]; high)
5: if low  high then
6: end
7: for i low to high step 1 do . convert interval to positions in text
8: positions A[i] . assuming sux array A is unsampled
27
    
 
 
 
 
 
 
 
 
 
  
  
 
  
Step 1 
Symbol = A 
[0, 7) → [1, 4) 
Step 2 
Symbol = N 
[1, 4) → [5, 7) 
Step 3 
Symbol = A 
[5, 7) → [2, 4) 
Convert 
A → T 
[2, 4) = 3, 1 
 
 A B N 
0 0 0 0 
1 1 0 0 
2 1 0 1 
3 1 0 2 
4 1 1 2 
5 1 1 2 
6 2 1 2 
7 3 1 2 
A B N 
1 4 5 
Count array Occ array 
Substring = ANA 
Text = BANANA$ 
ANNB$AA 
FM-index 
Figure 2.7: FM-index search algorithm example
Example of searching for the string ANA in the text BANANA using the FM-index.
caching, and results in most implementations being memory-bound. In order to improve
the performance, algorithmic optimisations have been proposed which aim to reduce
the number of memory accesses in the search algorithm and allow a higher achievable
memory bandwidth. Examples include the k-step FM-index [21], which allows multiple
characters to be matched in each step, and a sorting approach [32], which makes the
memory access pattern sequential at the cost of sorting the intermediate results after
each step.
The FM-index search operation can be extended with backtracking to support ap-
proximate matching. In this approach, when a character in the string cannot be matched
to the text, the search backtracks to a previously matched character and introduces an
edit. These edits can be mismatches, where a character in the string is substituted for
another; insertions, where a character is added to the string; and deletions, where a
character is removed from the string. The closeness of a match is quantied by the edit
distance. This is the number of edits required to transform one string into another. As
the permitted edit distance increases, the amount of backtracking grows exponentially.
Excessive backtracking greatly reduces the performance of the search algorithm; conse-
quently all previous eorts have focused on accelerating a greedy search where edits are
only considered when a character cannot be matched [19, 20]. This minimises the search
space, however the cost is that not all the approximate matches are discovered which
28
limits the sensitivity (the percentage of strings that are sucessfully matched). In [33], a
bi-directional search is proposed to reduce the amount of backtracking in an exhaustive
search. By allowing searches in both a forward and backward direction, a large fraction
of the search space can be pruned by constraining the edit positions.
2.2 Application Acceleration
Accelerators are special purpose processors which are designed to speed-up the compute-
intensive sections of applications. Examples of accelerators include Graphics Processing
Units (GPUs), Field-Programmable Gate Arrays (FPGAs) and Digital Signal Processors
(DSPs). Each type of accelerator has dierent characteristics which make them better
suited to perform certain tasks. For example, the main advantage of GPUs is their large
number of parallel execution resources and high memory bandwidths. On the other
hand, FPGAs provide a highly customisable hardware fabric which enables both massive
parallelism and low latency. Table 2.1 compares two of the latest GPU and FPGA
accelerators. It can be seen that for applications with intensive oating point operations,
the GPU oers the best performance. However, for applications which operate on huge
data sets with complex access patterns, the FPGA can provide superior performance.
2.2.1 FPGA acceleration
FPGAs are pre-fabricated integrated circuits which can be programmed after manufac-
ture. The FPGA fabric is composed of:
 Recongurable lookup tables (LUTs) which implement combinational logic. The
output of a LUT can optionally pass through a ip-op (FF). The LUTs and FFs
are grouped into slices, which are in turn grouped into larger logic blocks.
 Programmable interconnects which route signals between the logic blocks and other
embedded blocks.
 Congurable I/O pins for transferring data to/from the FPGA device.
29
 Independently addressable multi-ported Block RAMs (BRAMs).
 Other embedded blocks such as DSPs.
These components are programmed to perform a particular algorithm or computation.
The design to be implemented is typically coded using a hardware description language
(HDL), such as VHDL and Verilog, and is mapped into an FPGA conguration us-
ing a vendor-specic place and route tool. FPGAs support run-time reconguration,
which is the ability to switch between congurations in milliseconds. This feature allows
developers to change the functionality of the FPGA on-the-y.
FPGAs have been used to accelerate a wide variety of applications from computa-
tional nance [36] to climate modelling [37]. By ooading the complex and intensive
computations to the FPGA, a speed-up of 10 to 100 times can typically be reached.
Moreover, FPGAs have a much lower power consumptions than CPU and GPUs, there-
fore the running costs for data centres are signicantly lower. There are, however, a
number of limitations which have impeded the adoption of FPGAs. The most pertinent
of these are the reduced developer productivity and high cost relative to other accel-
erator devices. Recent advances have gone a long way in minimising these limitations,
making FPGAs a more attractive accelerator candidate. First, large cloud computing
companies, such as Amazon and Microsoft, have started including FPGAs in their in-
frastructure and providing access to them [10, 11]. This greatly lowers the barrier to
adoption for users and removes the expenses associated with housing and running them.
NVIDIA Titan X Nallatech 510T
Technology GPU FPGA
Feature size (nm) 16 20
Max TFLOPS 10.2 2.8
Memory size (GB) 12 32
Memory channels 6 10
Memory bandwidth (GB/s) 480 290
On-chip memory size (MB) 3 53
Power (W) 250 50
Price ($) 1200 >10,000
Table 2.1: Comparison of the NVIDIA Titan X [34] and Nallatech 510T [35] accelerators.
30
Second, the ease-of-development for FPGAs has greatly improved as they can now be
programmed using the OpenCL framework [12] and high-level synthesis (HLS) tools [13].
This allows users to abstract away from the traditional hardware development ow for
a much faster much faster and higher level software development ow.
In the following use-cases we discuss how FPGAs can be leveraged to solve some of
the important problems faced in DNA sequencing:
Use-case: Short read alignment
Short read alignment is the rst analysis step performed on the sequence data after it has
been acquired and preprocessed. A typical alignment job will map hundreds-of-millions
of reads to the Human genome with multiple mismatches and gaps permitted. This step
is more often than not the computational bottleneck of DNA sequencing analysis due to
the large amount data processed. For example, in Bisulte sequencing the analysis step
can take over 6 hours, which is approximately 50% of the total analysis time. In Chap-
ter 4 we develop a hardware-accelerated alignment tool that is up to 5 times faster than
state-of-the-art software tools for up to two mismatches. From a user perspective this
speed-up greatly reduces the total analysis time, allowing for faster research and clinical
testing. A practical way of targeting FPGAs for short read alignment would be using
an algorithm similar to that in Soap3-dp [15]. Here, the FM-index is used for matching
strings with up to two mismatches, then for larger edit distances (including gaps) index-
assisted dynamic programming is used. The steps of this algorithm can be split into
separate FPGA congurations (exact match, one mismatch, two mismatches, seed and
Smith-Waterman), which are dynamically loaded using run-time reconguration. This
modular design allows the user to control the alignment parameters without having to
modify the software or hardware. For example, if the user only requires alignment with
up to one mismatch, then a simple change to the command line argument would enable
the reads to be processed by the exact match and one mismatch congurations in turn.
If the user requires a more bespoke alignment algorithm then changes to the software
and/or hardware would need to be made. In the best-case there already exist FPGA
31
congurations for each of the algorithm steps, therefore the user would only need to
make changes to the software interface. We have shown this can be achieved on the
Maxeler Platform using python scripts which translate a high-level description of the
alignment algorithm to the corresponding software interface. In the worst-case there are
FPGA congurations which do not exist for some of the alignment steps, therefore the
user would need to implement these themselves (using RTL, HLS tools, MaxJ, etc). In
this use-case the FPGAs could be hosted in-house, however it is far more economical
to use a cloud-based platform such as Amazon F1. This also allows integration of the
hardware-accelerated alignment tool with existing cloud-based genomics infrastructures,
such as Illumina Base-space, so that user data can be accessed and analysed seamlessly.
Use-case: Sequence data compression
In recent years the cost of storing genomic data has been an increasingly growing prob-
lem among researchers and industry. Large-scale genomics projects, such as the Million
Veterans Program [38], are estimated to produce hundreds of Petabytes of genomic data,
which will cost millions of dollars to store per year based on current pricing models. In
Chapter 5 we show how FPGAs can be used to eciently compress sequence data in the
FASTQ format (the most accumulated type of data archived). We develop a hardware-
accelerated tool, fastqZip, which can achieve a compression ratio 2 times higher than
gzip (the de-facto standard compression tool used for this data) and has faster com-
pression and decompression speeds. From a user perspective this can halve the costs
associated with storing sequence data and increase productivity through more ecient
le access. fastqZip could be oered to users as a stand-alone application, however a
more practical solution would be integrating it directly into the NGS platform infras-
tructure. For instance, in Illumina's NGS workow fastqZip could replace gzip when
converting the .bcl format to FASTQ. To further improve user productivity, transparent
compression/decompression can be supported using a FUSE le system. As noted in
the use-case for short read alignment, it is far more economical to target FPGAs in the
cloud rather than hosting them in-house. This also has the benet of interfacing easily
32
with existing cloud-based genomics infrastructures such as Illumina Base-space (which
also uses Amazon to host both their storage and compute resources).
2.2.2 FPGA accelerator platforms
There exist several FPGA accelerator platform vendors including Maxeler, Nallatech,
AlphaData, Pico Computing and Convey. Each vendor oers a number of systems, the
most popular being PCIe cards containing an FPGA and some o-chip memory. At
the time of writing one of the most advanced FPGA accelerators is the Nallatech 510T,
which comprises an Altera Aria10 FPGA (20nm feature size) with up to 32 GB of DDR4
memory. This can provide an impressive 2.8 TFLOPs of performance and 290 GB/s of
memory bandwidth. Another example is the Alpha Data ADM-9V3 [39] which comprises
a Xilinx UltraScale Plus FPGA (16nm feature size) and up to 32 GB of DRAM.
The designs presented in this work are implemented exclusively on the Maxeler Max4
Platform, however the concepts and ideas can be applied to any platform. The Maxeler
system targeted is the MPC-X2000 [40] (Figure 2.8), which supports up to 8 accelerator
boards referred to as dataow engines (DFEs). Each DFE comprises a single Altera
Stratix V FPGA (28nm feature size) with up to 96GB of DRAM. Figure 2.9 shows the
system used to evaluate both the software and hardware designs developed in this work.
The system comprises a MPC-X2000 populated with 8 DFEs which is connected to a
 
Figure 2.8: Maxeler MPC-X2000 architecture. (Source: [40])
33
  
 
  
 
 
 
DFE 1 
500MB/s 
SSD 
60GB/s MEM 
48GB 
FPGA 
infiniband 
1.5GB/s 
∙2x Intel Xeon E5-2640 
∙64 GB RAM                     
 
Host 
MPC-X2000 
x8 
Figure 2.9: The system used to evaluate the software and hardware designs developed
in this work.
host machine via inniband. Each DFE has 48GB of DRAM available, which can be
accessed using a single channel of 384 bytes, or 6 independent channels of 64 bytes.
The peak o-chip memory bandwidth per DFE is 60GB/s. The host machine contains
dual Intel Xeon E5-2640 CPUs (32nm feature size), 64GB of RAM, and a single 500GB
Crucial MX200 SSD with read and write speeds of 500MB/s.
Maxeler designs comprise a Kernel, which performs the computation, and a Manager,
which orchestrates the ow of data between the host and DFE in the form of streams.
The Kernel and Manager are implemented using the MaxJ language (an extension of
the Java language). The MaxCompiler then compiles the Kernel and Manager into an
FPGA conguration and makes it available for use in a host application. The Max-
eler Platform has been used to successfully accelerate a wide number of applications
across many dierent domains including machine learning, scientic computing, imag-
ing, and genomics. Many of the accelerated applications are available on the Maxeler
app gallery [41], an ecosystem for sharing and distributing work.
2.3 Related Work
The majority of the previous work on accelerating DNA sequencing analysis with FPGAs
has focused on short read alignment. In this step the short sequences of DNA (also
called reads) are mapped to positions in a reference genome. Exact matches as well
34
as approximate matches must be considered in order to identify genetic variations in
the sequenced DNA such as single nucleotide polymorphisms (SNPs), insertions and
deletions.
There exist several popular CPU-based alignment tools including BWA [42], Soap2 [23]
and Bowtie2 [16]. These tools use the FM-index to perform string matching and are in
some cases extended with dynamic programming algorithms to eectively handle large
numbers of mismatches and gaps in the alignment. The massive parallelism available on
GPUs has been used to accelerate short read alignment. One such example is Soap3 [43]
which uses a GPU to accelerate the FM-index. The latest version of this tool is Soap3-
dp [15], which is able to better handle large numbers of mismatches and gaps using an
index-assisted dynamic programming stage. Both of these tools are reported to be up
to an order of magnitude faster than the equivalent CPU-based tool Soap2. Table 2.2
summarises the features of the state-of-the-art software short read alignment tools used
in evaluating our work.
Much of the work on accelerating short read alignment with FPGAs targets dynamic
programming algorithms such as Smith-Waterman local alignment [44]. This algorithm
builds a scoring matrix from the string and a region of the text. The maximum value in
the matrix indicates whether the string matches within a given threshold, and the opti-
mal alignment is reconstructed using a traceback step. Notable examples of accelerating
this algorithm include [45], where a variety of cell designs are explored for systolic array
Tool Version Device Algorithm Gaps Approximate matching strategy
Soap2 [23] 2.21 CPU FM-index % Backtracking FM-index.
BWA-aln [42] 0.7.5 CPU FM-index " Backtracking FM-index.
Bowtie2 [42] 2.1.0 CPU Seed-and-extend " FM-index is used for seed location. The
matching positions are then extended us-
ing Smith-Waterman local alignment.
Soap3-dp [15] 2.3.172 GPU Backtracking &
Seed-and-extend
" The backtracking FM-index is used up to
two mismatches. For larger edit distances
the FM-index is used for seed location,
then the matching positions are extended
using Smith-Waterman local alignment.
Table 2.2: Summary of the software short read alignment tools used in evaluating our
work. The gaps column refers to whether insertions and deletions are permitted.
35
architectures, and [46] where a systolic array architecture is written in OpenCL and can
achieve 77 GCUPS on a single Xilinx Virtex-7 690T FPGA.
Algorithms based on a seed-and-extend approach have also been accelerated with
FPGAs. In this approach the reads are split into shorter sequences called seeds, which
are matched to the reference genome. The reads are then compared to the reference
genome at each matching location using Smith-Waterman local alignment. In [47] a
modied hash table is used to perform the seed lookups. The design is partitioned across
8 Pico M-503 boards, and can align 54 million reads of 76 characters in 34 seconds. In [48]
the seed lookup is improved by introducing a variable length hash function. This hash
function improves the performance by equalising the number of the candidates in each
hash slot and minimising the number of the key comparisons. The main drawback of
both designs is that the traceback step is not performed during Smith-Waterman local
alignment. As a result, these designs can conrm whether the string can be matched
within a given edit distance threshold, however they are unable to indicate the optimal
alignment (the edit operations and their locations).
Besides dynamic programming and seed-and-extend approaches, other string match-
ing algorithms have been accelerated with FPGAs. For instance, in [49] a hash table is
used to perform short read alignment. Their design comprises 2 components: a software
program to pre-process the reference genome into a hash table, and a hardware pipeline
for performing fast lookups. To reduce the hash table size, a minimal perfect hash table
is used. The design is implemented on a Convey HC-1, and can exactly align 350 million
reads of 100 characters in 1 second. The limitation of this design is that the read length
must be xed at 100 characters and it cannot support approximate string matching.
In [50] a direct comparison approach is used. The reads are stored on the FPGA in
registers and the reference genome is streamed one character at a time per stream using
shift registers. The main challenge faced in this approach is trying to t as many strings
on the FPGA as possible in order to minimise the number of times the reference genome
must be streamed.
There has been relatively little work which focuses on accelerating the FM-index
36
with FPGAs. In [24], the Count and the Occ arrays are stored directly on the FPGA
using on-chip BRAMs. The challenge here is that for large reference genomes there are
not enough BRAMs available to store the Occ array. The solution proposed is to divide
the reference genome into sections and build multiple indices, however it is found that
the cost of switching the index makes this impractical. This design is implemented on
a Xilinx Virtex 6 LX760 FPGA and can exactly align 1000 reads of 72 characters in
0.06 ms. In [19] the Occ array is moved o-chip to DRAM and approximate matching
is supported with up to two mismatches. The long access time of DRAM is mitigated
by using a multithreaded approach which allows reads to be processed whilst others
are waiting for memory. To support approximate matching the exact match engine is
replicated. For every n allowed mismatches there are n+ 1 exact match engines. When
a mismatch is encountered in an engine, the read is sent to the next engine in line and
the mismatch character is replaced by other possible characters in the reference genome
alphabet. This design is implemented on the Convey HC-1 hardware platform, and can
match 18 million reads of 101 characters in 138 seconds. The achievable sensitivity is
not reported, however we assume it must be lower than that of software alignment tools
given that a greedy search is used. In [20] the sensitivity of searches with two mismatches
is increased. The read is split in half and two additional exact match engines are used to
match them against the reference genome starting from the centre and working outwards.
Finally, in [25], the Occ array is sampled, reducing the FM-index size. In this design
the FPGA contains a module which supports exact string matching. If a character in
the string cannot be matched, the string is sent back to the host which modies one or
more characters and sends it back to the FPGA for processing.
Table 2.3 summarises the notable previous eorts on accelerating short read align-
ment with FPGAs. This work builds upon the previous FM-index designs in the following
ways:
1. We present three new methods for reducing the memory bottleneck of the FM-
index search algorithm on FPGAs. These methods include customising the FM-
index structure according to the memory architecture of the FPGA platform, and
37
Design FPGA Algorithm Approximate Read Reference CMPS
platform matching length length (G) (M)
[48] Xilinx Virtex 7 Seed-and-extend Mismatches & gaps 100 3.1 47
[47] 8x Pico M-503 Seed-and-extend Mismatches & gaps 76 3.1 120
[49] Convey HC-1 Hash table Exact match only 100 3.1 35,000
[50] Xilinx Virtex 5 Direct Comparison Three mismatches 36 3.1 0.003
[24] Xilinx Virtex 6 FM-index Exact match only 108 0.49 2
[19, 20] Convey HC-1 FM-index Two mismatches 101 0.11 13
[25] Pico M-505 FM-index
Implements BWA's
bwt match exact alt
{ { {
Table 2.3: Summary of previous eorts on accelerating short read alignment with FP-
GAs. The performance of each design is normalised using the metric Characters Matched
Per Second (CMPS). Note that the datasets used in the performance evaluations dif-
fer, as well as the edit distance permitted and what is included in the execution time
measurement. Only a speed-up relative to BWA is given for the design in [25].
architectural and algorithmic optimisations, such as the k-step FM-index, which
reduce the number of memory accesses in the search algorithm.
2. We present a new approach for approximate string matching which can achieve
both high sensitivity and high performance. This approach uses specialised ap-
proximate string matching modules (ASMs), rather than using only exact string
matching modules like in previous work. The key advantages over previous eorts
are: 1) the ASMs use an exhaustive search which enables a higher sensitivity than
the designs in [19, 20]; and 2) unlike the design in [25], the ASMs do not have to
send the strings back to the host each time an edit operation is performed.
3. We not only apply our work to short read alignment, but also present the rst
acceleration of reference-based compression with FPGAs. We show that not only
is our work faster than the state-of-the-art software tools, but can also achieve a
higher compression ratio.
2.4 Summary
In this chapter we have presented the background information that forms the basis of
this work such as the FM-index and FPGA acceleration. We have also discussed the
38
features and limitations of related work and show where our work positions itself in
relation to it. The following chapters form the main contributions of this work: 1) in
Chapter 3 we accelerate exact string matching using the FM-index and present several
novel methods to reduce the memory bottleneck of the search algorithm; 2) in Chapter
4 we accelerate approximate string matching using the backtracking FM-index and show
how both high sensitivity and high performance can be achieved; and 3) in Chapter 5 we
present the rst acceleration of reference-based compression with FPGAs and show how
this accelerated design can be integrated into tools for the fast and eective compression
of sequence data in the FASTQ and FASTA formats.
39
Chapter 3
Exact string matching
Exact string matching is an important component of many DNA sequencing analysis
applications. For example, in short read alignment the majority of sequences can be
matched exactly to the reference genome due to the similarity of interspecies DNA.
As a result, exact string matching forms the bulk of the computation, with only 20-
30% of the data requiring approximate string matching. The FM-index is a popular
method for exact string matching and is widely used in software tools for short read
alignment [15, 16, 17] and de-novo assembly [18]. Its main drawback, however, is that
its performance is greatly impeded by the memory access pattern of the search algorithm,
which lacks both spacial and temporal locality. In this chapter we accelerate the FM-
index with FPGAs. Our main contribution is addressing challenge C1 through three
novel methods:
1. The FM-index structure is customised according to the memory architecture of the
FPGA platform. This allows the FM-index of large texts to t in memory without
splitting and better utilises the available memory bandwidth.
2. The hardware architecture is optimised so that the available memory bandwidth
is saturated and the data read from memory can be reused.
3. Algorithmic optimisations which reduce the number of memory accesses in the
search algorithm are developed .
40
The rest of this chapter is organised as follows: in Section 3.1 we present a base
design for accelerating the FM-index with FPGAs; in Section 3.2, we present several
algorithmic optimisations which reduce the number of memory accesses in the search
algorithm; and in Section 3.3 we evaluate the performance of our exact string matching
designs and compare them to state-of-the-art CPU, GPU, and FPGA-based short read
alignment tools.
3.1 Base design
In this section we present a base design for accelerating the FM-index with FPGAs. The
key features of this design are:
 An FM-index structure which is customised according to the amount of memory
available on the FPGA platform and the memory burst size (Section 3.1.1).
 A hardware architecture which performs the FM-index search algorithm (Sec-
tion 3.1.2).
 A performance model for the hardware architecture (Section 3.1.3). This model is
used in our evaluation to determine the optimal FM-index structure for the target
FPGA platform and verify the performance of our implementations.
3.1.1 FM-index structure
As discussed in Section 2.1.4, the FM-index comprises two arrays which store the pre-
computed values of the Count and Occ functions. For a text T with length jT j = n and
an alphabet T , the Count array has O(jT j) space and the Occ array has O(njT j)
space. The Count array can be stored on the FPGA in BRAMs, however the Occ ar-
ray is often several gigabytes or more in size, and therefore must be stored in o-chip
memory. In previous eorts, such as [19, 20], the full Occ array is stored in DRAM.
Although this approach simplies the hardware architecture for the search algorithm,
the performance can be limited in the following ways:
41
 The size of the Occ array may exceed the amount of memory on the FPGA plat-
form. For example, the Occ array for the full Human genome is 50GB, which is
larger than the amount of memory available on most o-the-shelf FPGA platforms.
In [24], the authors propose splitting the text into sections and building multiple
FM-indices. This, however, reduces the performance as the search algorithm must
be performed separately on each FM-index.
 In cases where the FM-index is stored in o-chip memory, the Occ values are
typically much smaller than the memory burst size, therefore there can be a large
amount of bandwidth waste. For example, the FM-index of the Human genome
requires the Occ values are stored using 32 bit unsigned integers (4 bytes). If the
memory burst size of the FPGA platform is 64 bytes (as is the case for DDR3/4),
then 94% of the data read from memory is never used. The memory access pattern
of the search algorithm lacks spacial locality, therefore attempting to reuse the
adjacent data through caching has no benet.
In this work we propose an FM-index structure which can be customised according
to the amount of memory on the FPGA platform and the memory burst size. This index
structure has three features which distinguish it from previous eorts:
1. The Occ array is sampled so that only the positions which are a multiple of a sample
distance d are stored. If Occ0 is the sampled array, then Occ0[c][i] = Occ[c][i  d].
This reduces the size of the Occ array by a factor of d. The cost, however, is
that the BWT of the text must be additionally stored in order to reconstruct the
omitted values. This cost can be mitigated by encoding the BWT characters with
dlog2(jT j+ 1)e bits instead of 8 bit ASCII values. The hardware architecture can
be customised to directly operate on the compressed BWT characters, whereas in
software the characters must be decompressed rst.
2. The Count array values are added to the corresponding Occ values so that
Occ0[c][i] = Count[c] +Occ0[c][i]. This removes two additions from each step in the
search algorithm.
42
3. Buckets are created by grouping the jT j values at each position in the sampled Occ
array with the corresponding section of the BWT. For example, bucket i contains
the jT j values from Occ0[:][i] and the compressed BWT characters from position
i  d to (i  d) + d   1. This improves the spatial locality of memory accesses, as
the data required to compute the occurrence are adjacent in memory. Figure 3.1
shows the proposed FM-index structure.
 
 
 
 
 
  
 
 
 
 
 
  
 
A B N 
1 4 5 
 BWT A B N  
0 A 0 0 0 
1 N 1 0 0 
2 N 1 0 1 
3 B 1 0 2 
4 $ 1 1 2 
5 A 1 1 2 
6 A 2 1 2 
7  3 1 2 
A 1 
B 4 
N 5 
A N N B 
A 2 
B 5 
N 7 
$ A A       
Count array 
Occ array 
Occ array 
Occ vals 
Occ vals 
BWT 
BWT 
Bucket 0 
Bucket 1 
Figure 3.1: FM-index structure for the text BANANA. In this example the sample
distance is 4.
The bucket size (B) is given by Equation 3.1. The rst term is the jT j Occ values
(each stored using 4 bytes), the second term is the d compressed BWT characters, and
the nal term p is optional padding. The FM-index size (I) is given by Equation 3.2.
This is the number of buckets multiplied by the size of each bucket.
B = (4  jT j) +
&
d  dlog2(jT j+ 1)e
8
'
+ p Bytes (3.1)
I =

n
d

B Bytes (3.2)
The FM-index can be customised according to the amount of memory available on the
FPGA platform (Equation 3.3). This memory can be on-chip memory, such as BRAM,
or o-chip memory such as DRAM. In cases where the memory type has burst access,
43
the FM-index can be customised according to the memory burst size (Equation 3.4).
These customisations allow the FM-index to t in memory without splitting the text
and ensures the buckets are memory aligned. In the case where the bucket padding
is zero, the memory bandwidth is fully utilised. The graphs in Figure 3.2 show the
FM-index size and bucket size for the Human genome when used in DNA and Bisulte
sequence alignment. For Bisulte sequence alignment the text alphabet size is reduced
to 3 (all `C' characters in the genome are converted to `T'), therefore the FM-index size
and bucket size are smaller relative to standard DNA texts.
I  memory capacity (Bytes) (3.3)
B = x memory burst size (Bytes) where x 2 Z : x > 0 (3.4)
The search algorithm for this FM-index structure is shown in Algorithm 2. For a
string S with length jSj = m, the search algorithm has m steps. In each step, the
interval [low; high) is updated for a single character in the string using two buckets from
the FM-index. The update procedure comprises a select and count operation, which are
performed on the buckets. The select operation selects the Occ value corresponding to
the string character, whilst the count operation counts the number of occurrences of the
 
 
 
0
0.5
1
1.5
2
2.5
3
32 64 128 256 512
FM
-i
n
d
ex
 s
iz
e 
(G
B
) 
Sample distance 
DNA
Bisulfite
0
50
100
150
200
250
32 64 128 256 512
B
u
ck
et
 s
iz
e 
(B
yt
es
) 
Sample distance 
DNA
Bisulfite
a)
 
 
 
0
0.5
1
1.5
2
2.5
3
32 64 128 256 512
FM
-i
n
d
ex
 s
iz
e 
(G
B
) 
Sample distance 
DNA
Bisulfite
0
50
100
150
200
250
32 64 128 256 512
B
u
ck
et
 s
iz
e 
(B
yt
es
) 
Sample distance 
DNA
Bisulfite
b)
Figure 3.2: Graphs showing a) the FM-index size, and b) the bucket size as functions of
the sample distance. The full Human genome (3 billion characters) is used as the text.
For standard DNA alignment this text has an alphabet size of 4, whereas for Bisulte
sequence alignment the alphabet size is 3.
44
Algorithm 2 FM-index search algorithm.
Input: string S, text T , FM-index F with sample distance d.
Output: the nal interval [low; high).
1: [low; high) [0; jT j)
2: for i jSj   1 to 0 do . update interval for each character in S
3: low  update(S[i]; F [low=d]; low)
4: high update(S[i]; F [high=d]; high)
5: if low  high then
6: return
7: end for
8: procedure update(ch; bucket; pos)
9: sel bucket:Occ[c] . select operation
10: cnt 0
11: for i 0 to i < pos mod d do . count operation
12: if ch = bucket:BWT [i] then
13: cnt cnt+ 1
14: end for
15: return sel + cnt
16: end procedure
string character in a section of the BWT. The result of the select and count operation
are summed to give the new values of low and high. If low  high after an update, then
the string does not occur in the text and the search algorithm is terminated.
3.1.2 Hardware architecture
Figure 3.3a shows the hardware architecture for the base design. This architecture
comprises three components:
1. An FPGA which is populated with exact string matching modules (ESMs). The
ESMs comprise two blocks: 1) the compute block, which performs the FM-index
search algorithm, and 2) the command block, which sends the memory commands
for the FM-index buckets to the memory controller. These two blocks are decoupled
to make it easier to avoid deadlock scenarios and race conditions between sending
the memory commands and receiving the data from memory.
2. Memory where the FM-index is stored. In this work we target o-chip memory, in
particular DRAM, as it is the most common type of memory available on FPGA
platforms that is able to store the FM-index of large texts. The next generation
45
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
FPGA 
Compute   
block 
Command             
block 
Host interface 
Memory interface 
in out 
cmds F[low/d] F[high/d] 
DRAM 
FM-index: F 
ESM 
update update 
 /  / 
d d 
low’ high’ 
low high F[low/d] F[high/d] String 
S[i] 
to stream head to command block 
1 
 + 
i 
a)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
FPGA 
Compute   
block 
Command             
block 
Host interface 
Memory interface 
in out 
cmds F[low/d] F[high/d] 
DRAM 
FM-index: F 
ESM 
update update 
 /  / 
d d 
low’ high’ 
low high F[low/d] F[high/d] String 
S[i] 
to stream head to command block 
1 
 + 
i 
b)
Figure 3.3: a) Base design architecture. b) Compute block architecture. Note that some
data and control streams have been omitted to improve readability.
of FPGAs promise larger BRAM resources (>50 MB) and gigabytes of in-package
HBM2, therefore it may be possible to leverage these faster memories in the future.
3. A host CPU which performs the software tasks that are not accelerated.
The exact string matching procedure is as follows: rst the host CPU packages the
strings into packets composed of a unique string identier, the string length, and the
string characters. In order to mini ise host-FPGA communication, the string characters
are encoded with dlog2 jT je bits instead of 8 bit ASCII values. The string packets are
then streamed to the FPGA where they are processed in parallel by the ESMs. When a
string has been processed, i.e. all the characters have been matched or low  high, the
nal interval along with the string identier is streamed back to the host CPU. The sux
array is then used to convert the intervals into positions in the text. This conversion is
performed on the host CPU, as it is a constant time operation and not computationally
intensive enough to justify ooading to the FPGA.
The compute block architecture is shown Figure 3.3b. In this architecture, each
variable in Algorithm 2 is implemented as a stream of data. Since the previous values
46
of low and high are required to compute the updated values, there is a loop-carried
dependency. This results in a cyclical data path where the updated values are passed
back for use in the next step. Cyclical streams are implemented for each loop-carried
variable, such as low and high, whereas constant variables, such as the string characters
and string length, are stored in BRAMs. The cyclical streams have three features:
1. A head, which has a multiplexer that selects either an initial value or the updated
value carried from the previous step.
2. A body, which contains the operations to compute the new value.
3. A foot, which connects the new value of the stream to the head. The loop oset
gives the number of cycles before the new value is available to be passed back to
the head.
The interval is updated by rst selecting a character from the string using a mul-
tiplexer. The string character along with the values of low, high and the FM-index
buckets are input to the update blocks, which perform the select and count operations.
The select operation is implemented using a multiplexer with the jT j Occ values as
inputs and the string character as the select line. The count operation is implemented
using the two steps shown in Figure 3.4. In step 1, the string character is compared to
the characters found at each position in the section of the BWT. If the characters match
 
 
  
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
  
 
 
  
 
 
           
1 0 1 0 1          … 0 
0 1 2 3 4      d-1 
… 
size = d 
BWT[0] ch 0 pos%d 
= < 
& 
0 1 
ch BWT[d-1] d-1 pos%d 
= < 
& 
0 1 
0 1 
+ 
1 
1 1 
+ 
2 
+ 
3 
… 64 bits 64 bits 
H 
… 
H 
+ 
Binary adder tree Hamming weight 
initial value 
+ 
carried value 
loop 
offset 
head 
body 
foot 
select 
1 
a)
 
 
  
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
  
 
 
  
 
 
           
1 0 1 0 1          … 0 
0 1 2 3 4      d-1 
… 
size = d 
BWT[0] ch 0 pos%d 
= < 
& 
0 1 
ch BWT[d-1] d-1 pos%d 
= < 
& 
0 1 
0 1 
+ 
1 
1 1 
+ 
2 
+ 
3 
… 64 bits 64 bits 
P 
… 
 P 
+ 
Binary adder tree Population count 
initial value 
+ 
carried value 
loop 
offset 
head 
body 
foot 
select 
1 
b)
Figure 3.4: a) step 1, and b) step 2 of the count operation.
47
and the position tested is in a valid section of the BWT, then a 1 is output. If one of
these conditions is false, then a 0 is output. Since these comparisons are independent,
they can be performed in parallel. In step 2, the outputs are summed to give the oc-
currence. This computation can be performed using a variety of methods such as an
adder tree, or a population count. In our implementations the population count method
shown in Listing 3.1 is used as it requires fewer FPGA resources. Finally, the result of
the select and count operations are summed to give the new values of low and high.
In cases where the FM-index is stored in o-chip memory, the large access latency
and low memory eciency (due to the random memory access pattern) result in the
hardware performance being memory-bound. To reduce the memory bottleneck, the
following optimisations are included in the hardware architecture:
1. Due to the loop-carried dependency in the search algorithm, the memory com-
mands for the FM-index buckets can only be issued after the interval has been
updated. This results in a large proportion of the memory bandwidth being un-
used, as for most of the time the memory controller is idle waiting for commands.
To address this problem the operations in the compute block are pipelined so that
multiple strings are processed concurrently. The pipeline depth is then customised
so that the memory commands issued by the dierent strings are able to saturate
the memory bandwidth.
Listing 3.1: Population count for a 64-bit value
const uint64_t m1 = 0x5555555555555555;
const uint64_t m2 = 0x3333333333333333;
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f;
int popcount(uint64_t x)
x -= (x >> 1) & m1;
x = (x & m2) + ((x >> 2) & m2);
x = (x + (x >> 4)) & m4;
x += x >> 8;
x += x >> 16;
x += x >> 32;
return x & 0x7f;
48
2. With each step of the search algorithm the size of the interval decreases, i.e. the
values of low and high converge. After several steps it is often the case that the
values of low and high give the same FM-index bucket address, therefore only a
single bucket is required to update the interval. To reduce the amount of data
read from memory, if low=d = high=d, then the bucket at address low=d is read
from memory and is used to update both low and high. Conversely, if low=d 6=
high=d, then both buckets are read from memory as usual. This optimisation
is implemented by adding control logic to the memory command generators and
adding a multiplexer to the bucket inputs. In the best case each step in the search
algorithm only requires a single bucket, therefore the number of memory accesses
can be reduced by up to 2 times.
3.1.3 Performance model
The execution time of the hardware architecture is modelled using Equation 3.5. Tcompute
and Tmem represent the execution time, assuming the performance is either compute-
bound or memory-bound respectively. Tother accounts for the time taken to congure
the FPGA, transfer the FM-index to memory, set scalar values, etc. For large datasets,
Tother can be ignored.
T = max(Tcompute , Tmem) + Tother (3.5)
The ESMs update the interval for a single character in each clock cycle. Tcompute
is therefore given by Equation 3.6, where m is the string length, N is the number of
strings, P is the population of ESMs on the FPGA, and F is the FPGA clock frequency.
Tcompute =
m N
F  P (3.6)
Tmem is given by Equation 3.7, where Br is the total number of bytes read from
memory and BW is the achievable memory bandwidth. The memory bandwidth is
platform specic and depends on a number of factors such as the physical layout and
49
the memory access pattern. This value can be obtained from vendor documentation,
or through measurement. The total number of bytes read from memory is given by
Equation 3.8. In the worst case (Equation 3.8a), there are no instances where the values
of low and high give the same bucket address, therefore two buckets must be read from
memory in each step. Conversely, in the best case (Equation 3.8b) low and high always
give the same address, therefore one bucket must be read from memory in each step.
Tmem =
Br
BW
(3.7)
Br (worst) = 2 m N B (3.8a)
Br (best) = m N B (3.8b)
Using the performance model, the optimal FM-index structure for the target FPGA
platform can be determined. Figure 3.5 shows the estimated performance of the base
design for the Maxeler Max4 DFE. In this estimate the FPGA is populated with 6
ESMs running at 150 MHz, and the memory bandwidths for the dierent bucket sizes
are determined empirically (Figure 3.12). The plot shows that sample distances of 32
and 64 give the best performance. In both of these designs the bucket size is 64 bytes
and the achievable memory bandwidth is 18 GB/s per DFE, which is 30% of the peak.
 
 
 
0
20
40
60
80
100
120
140
160
180
200
32 64 128 256 512
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Sample distance  
Tmem (best)
Tcompute
0
20
40
60
80
100
120
140
160
180
200
32 64 128 256 512
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Sample distance 
Tmem (best) k=2
Tmem (best) k=3
Tmem (best) base
Tcompute
Figure 3.5: Base design performance estimate for the Maxeler Max4 DFE. The execution
time is calculated for dierent sample distances when matching 100 million strings of
100 characters to the Human genome.
50
3.2 Algorithmic optimisations
In this section we present three algorithmic optimisations which reduce the number of
memory accesses in the FM-index search algorithm:
 An adaptation of the k-step FM-index for FPGAs (Section 3.2.1). This variation
of the FM-index reduces the number of memory accesses by a factor of k.
 A method to reduce the number of memory accesses by storing precomputed in-
tervals in on-chip memory (Section 3.2.2).
 An oversampling technique which increases the number of opportunities to reuse
the data read from memory (Section 3.2.3).
3.2.1 k-step FM-index
The k-step FM-index [21] is a variation of the FM-index which allows the interval to
be updated for k characters in each step of the search algorithm. The benet of this
optimisation is that the number of memory accesses in the search algorithm is reduced
by a factor of k. Although more data is read per memory access, reading larger blocks
generally increases the eective memory bandwidth. Therefore, if the additional data
can be read at the same cost, the performance can improve by up to k times. For a
text T with length jT j = n and an alphabet T (excluding the $ symbol), the k-step
FM-index is constructed using the following steps:
1. Terminate T with the $ symbol and generate its sux array, A, by lexicographically
sorting all of its suxes.
2. Using Equation 3.9, create a set of k strings from the sux array A. This step
is equivalent to extracting strings from the last k columns of the BWM. The
characters from the set of strings are then concatenated using Equation 3.10. This
forms a string L with an alphabet comprising the jT jk combinations of the text
alphabet with length k.
51
  
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
$ B A N A N A 
A $ B A N A N 
A N A $ B A N 
A N A N A $ B 
B A N A N A $ 
N A $ B A N A 
N A N A $ B A 
 AA AB AN ⋯ NN 
0 0 0 0  0 
1 0 0 0  0 
2 0 0 1  0 
⋮      
6 0 0 2  0 
7 0 0 2  0 
NA AN AN $B A$ NA BA 
AA AB AN ⋯ NN 
2 2 2  7 
BANANA$ 
sorted rotations 
BWM S1 S2 Occ array 
Count array 
2-step FM-Index 
>> 
low high F[low/d] F[high/d] string characters 
ch1 
to stream head to memory controller 
 cat 
ch2 
update update 
d 
 
low’ high’ 
str 
intervals 
… 
lowI 
highI 
 / 
sel sel 
 / 
d 
sel 
1 2 
sel = step > 1 & |S|/2 == 0  
L  
L 
Figure 3.6: Constructing the 2-step FM-index of the text BANANA.
3. From L, compute the Count and Occ arrays. Finally, construct the FM-index
structure described in Section 3.1.1 by: a) sampling the Occ array by only storing
the positions which are a multiple of a sample distance, b) adding the Count array
values to the corresponding Occ values, and c) creating buckets by grouping the
jT jk values at each position in the sampled Occ array with the corresponding
section of L.
Si[j] =
8><>:
T [A[j]  i] if A[j] > i  1
T [jT j   i+A[j]] otherwise
(3.9)
L[i] = Sk[i]Sk 1[i] : : : S1[i] (3.10)
The bucket size (Bk) is given by Equation 3.11. The rst term is the jT jk Occ
values (each stored using 4 bytes), the second term is the d characters of L, and the nal
term p is optional padding. The k-step FM-index size (Ik) is given by Equation 3.12.
This is the number of buckets multiplied by the size of each bucket.
Bk = (4  jT jk) +
&
(d  dlog2(jT j+ 1)ke)
8
'
Bytes (3.11)
52
Ik =

n
d

Bk Bytes (3.12)
As is the case in the base design, the k-step FM-index can be customised according
to the amount of memory on the FPGA platform and the memory burst size:
Ik  memory capacity (Bytes) (3.13)
Bk = x memory burst size (Bytes) where x 2 Z : x > 0 (3.14)
The graphs in Figure 3.7 show an example of the FM-index size and bucket size for
the Human genome as functions of the sample distance and step size. It can be seen
that both the index size and bucket size increase exponentially with the step size. The
relationship between the memory bandwidth and bucket size is an important factor to
consider when choosing the optimal sample distance and step size. In the best case,
the increase in eective memory bandwidth from reading larger buckets will cancel out
the increase in the number of bytes read from memory. As a result, the performance
improves by k times:
Tbase =
Br
BW
; Tk step =
x Br
k x BW =
1
k
Tbase where x 2 R : x > 0
 
 
 
0
5
10
15
20
25
30
32 64 128 256 512
FM
-i
n
d
ex
 s
iz
e 
(G
B
) 
Sample distance  
1-step 2-step
3-step
0
100
200
300
400
500
600
700
800
32 64 128 256 512
B
u
ck
et
 s
iz
e 
(B
yt
es
) 
Sample distance 
1-step 2-step
3-step
a)
 
 
 
0
5
10
15
20
25
30
32 64 128 256 512
FM
-i
n
d
ex
 s
iz
e 
(G
B
) 
Sample distance  
1-step 2-step
3-step
0
100
200
300
400
500
600
700
800
32 64 128 256 512
B
u
ck
et
 s
iz
e 
(B
yt
e
s)
 
Sample distance 
1-step 2-step
3-step
b)
Figure 3.7: Graphs showing a) the k-step FM-index size and b) the bucket size as
a function of the sample distance and step size. The full Human genome (3 billion
characters) is used as the text.
53
The search algorithm for the k-step FM-index is shown in Algorithm 2. In each
step k characters are selected from the string and their concatenated value is used in
the update procedure. Aside from the increased alphabet size, the select and count
operations are identical to that in Algorithm 2. An edge case occurs when the number
of characters in the string is not a multiple of k. In [21], this edge case is handled in the
nal step of the search algorithm by modifying the update procedure to only consider the
remaining characters. To avoid signicant amounts of control logic in the select and count
operations, we propose a new solution where the interval is updated for the remainder
characters in the rst step of the search algorithm using precomputed intervals. In
this approach, the intervals for all the combinations of T with lengths 1 to k   1 are
precomputed. If the string length is not a multiple of k, then the interval is rst updated
for the remainder characters using the precomputed intervals. For example, if k = 2
then the intervals for the all the combinations of T with length 1 are precomputed.
If the string length is not divisible by 2, then the interval is rst updated for the last
character in the string using the precomputed intervals. The interval is then updated
for the remaining characters using the k-step FM-index.
Algorithm 3 k-step FM-index search algorithm.
Input: string S, text T , k-step FM-index F with sample distance d.
Output: the nal interval [low, high).
1: [low; high) [0; jT j)
2: for i jSj   1 to 0 step -k do . assuming S is multiple of k
3: str  Si k : : : Si . concatenate k characters
4: low  update(str; F [low=d]; low)
5: high update(str; F [high=d]; high)
6: if low  high then
7: return
8: end for
9: procedure update(str; bucket; pos)
10: sel bucket:Occ[str] . select operation
11: cnt 0
12: for i 0 to i < posmod d do . count operation
13: if str = bucket:L[i] then
14: cnt cnt+ 1
15: end for
16: return sel + cnt
17: end procedure
54
3.2.2 Precomputed intervals
To reduce the number of memory accesses in the search algorithm, the intervals for
all the combinations of the text alphabet with length b can be stored. This allows
the rst b steps of the search algorithm to be performed using a single access to the
precomputed intervals, rather than 2  b accesses to the FM-index (Figure 3.8). By
adjusting the value of b, the precomputed intervals can be made to t in the CPU
cache, or in BRAMs on the FPGA. These memories provide a much lower access latency
than DRAM and can therefore greatly improve the performance of the search algorithm.
Assuming the values of low and high are stored as 32 bit unsigned integers, the size of the
precomputed intervals is 8  jT jb bytes, where T is the text alphabet. Current FPGAs
have enough BRAM blocks for a few Megabytes of storage, therefore b is typically less
than 10 characters for DNA alphabets. Since values can be read from BRAMs every
cycle, this optimisation is particularly eective when matching short strings as a large
proportion of the search algorithm steps can be performed in a single cycle. An example
use-case is the BFAST alignment tool [51] which maps seeds of 22 characters to the
reference genome. For strings where the number of characters is much greater than b,
this optimisation has less of an impact on performance and should be omitted if BRAMs
are a critical resource.
 
 Combinations low high 
0 AAAAAAA 1 13 
1 AAAAAAC 14 29 
2 AAAAAAG 30 54 
3 AAAAAAT 55 89 
4 AAAAACA 90 112 
⋮    
16383 TTTTTTT 328222 328250 
GATTAG AAAAAAA 
Initialise interval using  
precomputed values 
Update interval using 
FM-index 
[1, 13) 
Figure 3.8: Precomputed intervals optimisation for S = fA, C, G, Tg and b = 7.
55
Algorithm 4 FM-index search algorithm with precomputed intervals.
Input: string S, text T , FM-index F with sample distance d, array of precomputed intervals I
for b characters.
Output: the nal interval [low; high).
1: addr  SjSj 1 b : : : SjSj 1 . get address from last b characters in S
2: low  I[addr]:low . initialise low using precomputed interval
3: high I[addr]:high . initialise high using precomputed interval
4: if low  high then
5: return
6: for i jSj   1  b to 0 do . update interval for remaining characters using F
7: low  update(S[i]; F [low=d]; low)
8: high update(S[i]; F [high=d]; high)
9: if low  high then
10: return
11: end for
Algorithm 4 shows the search algorithm for this optimisation. Here, the values of
low and high are initialised to the precomputed interval at the address given by the last
b characters of the string. If low > high, then the string does not occur in the text and
the search algorithm is terminated, otherwise the interval is updated for the remaining
characters using the FM-index.
3.2.3 Oversampled Index
As discussed in Section 3.1.2, after several steps of the search algorithm the values of
low and high often give the same FM-index bucket address. In some cases, however,
the values of low and high are within d positions but they refer to adjacent buckets. To
prevent such occurrences, the index can be oversampled by a factor f . In this technique
the FM-index buckets are created by sampling the Occ array every d=f positions and
grouping the values with the corresponding d characters of the BWT (Figure 3.9). Now,
if the condition high  low < d=f is true, then low and high will give the same bucket
address for the remaining steps in the search algorithm. The cost of this optimisation is
that the FM-index size increases by f times (Equation 3.15), therefore a larger sample
distance may be required in order to t in the memory available on the FPGA platform.
Io = f 

n
d

B Bytes (3.15)
56
  
 
 
 
  
 
 
 
 
 
  
 
A B N 
1 4 5 
 BWT A B N  
0 A 0 0 0 
1 N 1 0 0 
2 N 1 0 1 
3 B 1 0 2 
4 $ 1 1 2 
5 A 1 1 2 
6 A 2 1 2 
7  3 1 2 
A 1 
B 4 
N 5 
A N N B 
A 2 
B 4 
N 6 
N B $ A      
Count array 
Occ array 
Occ array 
Occ vals 
Occ vals 
BWT 
BWT 
Bucket 0 
Bucket 1 
⋮ 
Figure 3.9: oversampled FM-index structure for the text BANANA. In this example the
sample distance is 4 and the oversample factor is 2.
The search algorithm for the oversampled FM-index is shown in Algorithm 5. In each
step, the interval is updated for a single character in the string using the FM-index. If
high  low < d=f , then the bucket at address low=(d=f) is read from memory and is
used to update both low and high. Conversely, if high  low  d=f then both buckets
are read from memory as usual.
Algorithm 5 Oversampled FM-index search algorithm.
Input: string S, text T , FM-index F with sample distance d and oversampling factor f .
Output: the nal interval [low; high).
1: [low; high) [0; jT j)
2: for i jSj   1 to 0 do
3: lpos low   ((low=(d=f))  (d=f))
4: low  update(S[i]; F [low=(d=f)]; lpos)
5: if high  low < d=f then . low and high give same bucket address
6: hpos high  low + lpos
7: high update(S[i]; F [low=(d=f)]; hpos)
8: else
9: hpos high  ((high=(d=f))  (d=f))
10: high update(S[i]; F [high=(d=f)]; hpos)
11: if low  high then
12: return
13: end for
57
3.2.4 Optimisation procedure
To determine which algorithmic optimisations to apply to the base design, the decision
tree shown in Figure 3.10 should be followed. The rst test condition is whether the
strings matched are xed to a given length. For instance, in the BFAST alignment tool
the short reads are split into seeds of exactly 22 characters which are then mapped to
the reference genome. In this case, both the k-step FM-index and precomputed intervals
optimisation can be applied to the base design. Since the interval is rst updated for b
characters using the precomputed intervals, the remaining m   b characters must be a
multiple of the step size k to ensure that all the characters in the string are processed.
If the string length is variable, which is true for most end-to-end alignment algorithms,
then only one of the k-step FM-index and precomputed intervals optimisation can be
applied. This is because there is no longer any guarantee that m   b is a multiple of
Figure 3.10: Optimisation procedure for the base design. Here, m is the string length, k is
the step size for the k-step FM-index optimisation and b is the length of the combinations
of jT j for the precomputed intervals optimisation.
58
k. In order to determine which of these optimisations to apply, the number of memory
accesses in the search algorithm can be used to model the performance improvement.
For the k-step FM-index there are k times fewer memory accesses (m=k), whereas for
the precomputed intervals optimisation there are b fewer (m  b). The nal step in the
decision tree is applying the oversampled index optimisation. This optimisation should
always be applied provided there is sucient memory available on the FPGA platform.
Figure 3.11a shows the estimated performance improvement of the k-step FM-index
optimisation for the Maxeler Max4 DFE. In this estimate the FPGA is populated with
6 ESMs running at 150 MHz, and the memory bandwidths for the dierent bucket sizes
are determined empirically (Figure 3.12). The plot shows that the best performing
design has a step size of 2 and sample distance of 64. This design is 2 times faster
than the base design, which is the peak improvement for a step size of 2. Figure 3.11b
compares the performance improvements for the k-step FM-index and precomputed
intervals optimisations. In this estimate, the combination length for the precomputed
intervals is set to 10. This requires that the FPGA has enough BRAMs to store at
least 8 MB of data. The plot shows that for strings with less than 20 characters the
precomputed intervals optimisation gives the best performance, whereas for strings with
greater than 20 characters the k-step FM-index does.
 
 
 
0
5
10
15
20
25
30
35
40
20 40 60 80 100
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
String length  
Tmem (best) k=2
Tmem (best) b=10
Tmem (best) base
Tcompute
0
20
40
60
80
100
120
140
160
180
200
32 64 128 256 512
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Sample distance 
Tmem (best) k=2
Tmem (best) k=3
Tmem (best) base
Tcompute
a)
 
 
 
0
5
10
15
20
25
30
35
40
20 40 60 80 100
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
String length  
Tmem (best) k=2
Tmem (best) b=10
Tmem (best) base
Tcompute
0
20
40
60
80
100
120
140
160
180
200
32 64 128 256 512
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Sample distance 
Tmem (best) k=2
Tmem (best) k=3
Tmem (best) base
Tcompute
b)
Figure 3.11: a) Performance estimate for k-step FM-index on the Maxeler Max4 DFE.
b) Performance comparison of the k-step FM-index and precomputed intervals optimi-
sations. In both graphs the execution time is calculated for matching 100 million strings
of 100 characters to the Human genome.
59
3.3 Evaluation
In this section we implement several exact string matching designs on the Maxeler Max4
platform and evaluate their performance. In Section 3.3.1, we describe the experimental
setup such as the designs evaluated and the systems used. In Sections 3.3.2 to 3.3.5
we evaluate the base design performance and measure the improvement from each algo-
rithmic optimisation. Finally, in Section 3.3.6 we evaluate the best performing design
against state-of-the-art CPU, GPU, and FPGA-based short read alignment tools.
3.3.1 Experimental setup
The FM-index designs listed in Table 3.1 are implemented on a Maxeler MPC-X2000
system which hosts 8 Max4 DFEs. Each DFE contains an Altera Stratix V FPGA
(28nm feature size) and 48GB of DRAM. The multi-channel DRAM mode is used, which
allows up to 6 independent memory channels per DFE. This mode places the following
restrictions on the FM-index: 1) the FM-index size must be less than 8GB in order to t
on the DIMMs without splitting the text, and 2) the bucket size should be a multiple of
64 bytes so that the buckets are memory aligned. Figure 3.12 shows the memory channel
bandwidth for the Max4 DFE as a function of the access size and memory footprint.
Two key results can be read from this plot:
1. The memory bandwidth increases for larger memory footprints. For instance, the
memory bandwidth for reading a single burst increases by 50% when the memory
footprint is doubled from 1GB to 2GB. This result is most likely due to increased
Index size Bucket size
Design Description d k b f (GB) (Bytes)
D1 Base design 64 1 0 1 2.6 64
D2 2-step FM-index 64 2 0 1 5.2 128
D3 Precomputed intervals 64 1 7 1 2.6 64
D4 Oversampled FM-index 64 1 0 2 5.2 64
Table 3.1: The designs evaluated and their FM-index parameters. Here, d is the sample
distance, k is the step size, b is the length of the combinations of jT j for the precomputed
intervals, and f is the oversampling factor.
60
bank-level parallelism, which allows memory commands to dierent banks on the
DIMM to be serviced in parallel.
2. Reading larger blocks of adjacent bursts is more ecient. For instance, the memory
bandwidth when reading a single burst is 3.1 GB/s per channel. When reading
two bursts, the memory bandwidth increases to 6.2 GB/s, therefore the additional
burst can be read for free. The cost of reading larger blocks gradually increases as
the memory bandwidth tends to the peak.
In each design evaluated, the FPGA is populated with 6 ESMs which are connected to
their own memory controllers. The ESMs run at 150MHz, whilst the memory controllers
run at the maximum clock frequency of 800MHz. Given that the performance of the
designs is memory-bound, increasing the clock frequency of the ESMs does not improve
the performance any further. Table 3.2 shows the FPGA resource usage for the designs.
It can be seen that approximately 35% of the FPGA resources are used in all six designs.
Although a higher population of ESMs is achievable, each memory channel is saturated
with just one ESM, therefore the performance would not increase any further. If the
FPGA platform was able to support any number of memory controllers, we estimate
 
 
 
 
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8
M
em
o
ry
 B
an
d
w
id
th
 (
G
B
/s
) 
Memory Footprint (GB) 
1 bpc (64 bytes) 2 bpc (128 bytes)
3 bpc (192 bytes) 4 bpc (256 bytes)
8 bpc (512 bytes) Peak bandwidth
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8
M
ax
 M
em
o
ry
 B
an
d
w
id
th
 (
G
B
/s
) 
Bursts Per Command  
Figure 3.12: Memory channel bandwidth for the Max4 DFE. In this plot the memory
bandwidth is shown as a function of the memory footprint and the number of bursts
read per memory command (bpc).
61
Design LUTs FFs BRAMS
name (K) (K)
D1 188 (35.8%) 172 (32.8%) 903 (35.2%)
D2 196 (37.3%) 182 (34.7%) 951 (37.1%)
D3 190 (36.2%) 173 (32.9%) 1300 (50.6%)
D4 188 (35.8%) 173 (32.9%) 903 (35.2%)
Table 3.2: Resource usage for each FM-index design when implemented on an Altera
Stratix V GS 5SGSD8 FPGA.
that up to 16 ESMs would be able to populate the FPGA. This would increase the
performance by 2.7 times given that it scales linearly with the number of ESMs.
The performance of each design is evaluated against an equivalent software version
with the same FM-index structure and algorithmic optimisations. The software versions
are written in C++11 and compiled with g++ version 4.9 with the optimisation ag -O3.
OpenMP is used to parallelise the code so that the strings are matched with multiple
threads. The software versions are run with 12 threads on the MPC-X2000 host machine.
This system has dual Intel Xeon E5-2640 CPUs (32nm feature size), 64 GB of RAM and
a single Crucial MX200 SSD with read and write speeds of 500MB/s.
In each experiment the full Human genome build 38 [52] (3.1G characters) is used
as the text. Any `N' characters are removed from the text so that its alphabet is
fA, C, G, Tg. In all of our experiments the string datasets are generated by extracting
xed-length substrings at random positions in the text. This minimises any bias in the
performance from matching strings with large runs of repeated characters. Furthermore,
since all the characters in the strings will match to the text, the performance of each
design can be accurately veried using the performance model dened in Section 3.1.3.
In Sections 3.3.2 to 3.3.5 the execution times reported only consider the raw string
matching time. This does not include the time for overheads such as le I/O, conguring
the FPGA and transferring the FM-index to DRAM. When evaluating against other
short read alignment tools (Section 3.3.6), the end-to-end execution time is reported in
order to ensure a fair comparison. In all of our experiments the execution times are
measured using the Linux time command and are averaged over 5 runs.
62
3.3.2 Base design - D1
The FM-index for D1 is constructed with a sample distance of 64. This was predicted to
be the optimal FM-index structure according to the performance estimates in Figure 3.5.
Each bucket is padded with 24 bytes so that the bucket size is a multiple of the memory
burst size. This gives an FM-index size of 2.6 GB and a bucket size of 64 bytes. The
achieved memory bandwidth for this design is 16.8 GB/s per DFE, which is 28% of
the peak. The graph in Figure 3.13 compares the performance of D1 and the software
version. Also shown are the best and worst case performances, which are calculated
using the performance model dened in Section 3.1.3. The following results can be read
from this plot:
 The performance of D1 and the software version scale linearly with the string
length. This is expected as the FM-index search algorithm is linear in time with
the string length.
 D1 is on average 4.8 times faster than the software version. The reasons for this
speed-up are twofold: 1) the CPU cache has a negligible number of hits, which
exposes the large DRAM access latency (this is why the hardware architecture does
not contain a cache); and 2) the memory architecture of the Maxeler Platform
enables a higher eective memory bandwidth than the host system (16.8 GB/s
compared to < 4 GB/s for the host system).
 
 
 
 
 
0
5
10
15
20
25
30
20 40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
String length 
software version
D1 (best case estimate)
D1 (worst case estimate)
D1 (1 DFE)
4
4.2
4.4
4.6
4.8
5
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
Figure 3.13: The performance of D1 and the software version are measured for exact
matching 10 million strings with varying lengths.
63
 The performance of D1 is close to the best-case performance estimate. This indi-
cates that for the majority of the search algorithm steps a single bucket is accessed.
To evaluate the scalability of D1, the performance is measured for a xed workload
and an increasing number of DFEs (strong-scaling), and a xed workload per DFE and
an increasing number of DFEs (weak-scaling). The graph in Figure 3.14a evaluates the
strong-scaling of D1. The plot shows a near linear speed-up with the number of DFEs,
which is expected as the strings can be matched independently. If the number of DFEs
continues increasing, the execution time would asymptote to 2.1 seconds. This is the
time taken to stream the data to and from the DFE via inniband. The strong-scaling
eciency (as a percentage of linear) is:
eciency =
T (1 DFE)
8  T (8 DFE)  100% =
46
8  6:2  100% = 92:7%
The graph in Figure 3.14b evaluates the weak-scaling of D1. The plot shows an
almost constant execution time, which again is expected as the strings can be matched
independently. The weak-scaling eciency (as a percentage of linear) is:
eciency =
T (1 DFE)
T (8 DFE)
 100% = 4:6
4:8
 100% = 95:8%
 
 
 
 
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
2.3 
4.6 
11.8 
0
2
4
6
8
10
12
14
D2 (1 DFE) D1 (1 DFE) software version
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
a)
 
 
 
 
 
0
5
10
15
20
25
30
20 40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
String length 
software version
D1 (best case estimate)
D1 (worst case estimate)
D1 (1 DFE)
4
4.2
4.4
4.6
4.8
5
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
b)
Figure 3.14: a) The performance of D1 is measured for exact matching 100 million
strings of 100 characters with a varying the number of DFEs. b) The performance of
D is measured for exact matching 10 million strings of 100 characters per DFE.
64
3.3.3 k-step FM-index - D2
In D2, the k-step FM-index is evaluated for a step size of 2 and a sample distance of 64.
This was predicted to be the optimal FM-index structure according to the performance
estimates in Figure 3.11a. Each bucket is padded with 24 bytes so that the bucket size is
a multiple of the memory burst size. This gives an FM-index size of 5.2GB and a bucket
size of 128 bytes. The achieved memory bandwidth for this design is 36GB/s per DFE,
which is 60% of the peak. Although it is possible to use a step size of 3, our performance
estimates show that the larger bucket sizes required result in a lower performance. If
only the single channel mode is supported by the DFE (the DRAM is accessed using a
single channel with a burst size of 384 bytes), then a step size of 3 and a sample distance
of 32 would give the best performance.
The graph in Figure 3.15 compares the performance of D2, the software version and
D1. The plot shows that D2 is 5.1 times faster than the software version and 2.0 times
faster than D1. The speed-up of 2 times relative to D1 is due to there being 2 times
fewer memory accesses in the search algorithm of D2. Although the bucket size for D2 is
2 times larger than that of D1, the additional burst can be read for free as the memory
bandwidth is approximately 2 times larger:
TD1 =
Br
BW
; TD2 =
2 Br
2  2 BW =
1
2
TD1
 
 
 
 
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
2.3 
4.6 
11.8 
0
2
4
6
8
10
12
14
D2 (1 DFE) D1 (1 DFE) software version
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
Figure 3.15: The performance of D2, the software version and D1 are measured for exact
matching 10 million strings of 100 characters.
65
3.3.4 Precomputed intervals - D3
D3 uses the same FM-index structure as D1, however the precomputed intervals for
combination lengths of 7 are stored on-chip in BRAMs. In our implementation each
ESM has its own copy of the precomputed intervals, therefore the total number of bytes
to store is 6  8  47 = 0:79MB. We found that designs with larger combination lengths
failed to be successfully placed on the FPGA as the number of BRAMs required exceeded
the total number available on the Stratix V. Given that more recent FPGAs, such as
the Xilinx Ultrascale plus, have approximately an order of magnitude more BRAMs, we
estimate that the value of b can be increased to around 10 for the latest FPGA platforms.
The graph in Figure 3.16a compares the performance of D3, the software version and
D1. The plot shows that D3 is 5.0 times faster than the software version and 1.1 times
faster than D1. The speed-up over D1 is greatest for short string lengths, as a larger
proportion of the search steps are performed using BRAMs (Figure 3.16b). For instance,
a speed-up of 1.4 times is measured for strings of 20 characters, however this drops to
less than 1.1 times for strings with 120 characters. As discussed in Section 3.2.2, this
optimisation should be applied when matching very short strings such as those in seed-
and-compare alignment algorithms. When matching larger strings the k-step FM-index
should be applied instead since it gives a larger performance improvement.
 
 
 
 
4.2 4.6 
20.8 
0
5
10
15
20
25
D3 (1 DFE) D2 (1 DFE) software version
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
1.2
1.3
1.4
1.5
1.6
1.7
1.8
20 40 60 80 100 120
Sp
ee
d
u
p
  
String length 
a)
 
 
 
 
4.2 4.6 
20.8 
0
5
10
15
20
25
D3 (1 DFE) D2 (1 DFE) software version
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
1.0
1.1
1.2
1.3
1.4
1.5
20 40 60 80 100 120
Sp
ee
d
u
p
  
String length 
b)
Figure 3.16: a) The performance of D3 the software version and D1 is measured for
exact matching 10 million strings of 100 characters. b) The speed-up of D3 relative to
D1 whilst varying the string length.
66
3.3.5 Oversampled index
In D4, the FM-index is constructed with a sample distance of 64 and an oversampling
factor of 2. Each bucket is padded with 24 bytes so that the bucket size is a multiple of
the memory burst size. This gives an FM-index size of 5.2 GB (double that of D1 due
to the oversampling factor of 2) and a bucket size of 64 bytes. The achieved memory
bandwidth for this design is 18.6 GB/s per DFE, which is 31% of the peak. In our
testing we found that further increasing the oversampling factor had a negligible impact
on performance and simply made it more dicult to t the FM-index in DRAM. For
instance, an oversampling factor of 3 increases the FM-index size to 7.8GB and we found
there was no discernible performance improvement when compared to an oversampling
factor of 2.
The graph in Figure 3.17 compares the performance of D4, the software version and
D1. The plot shows that D4 is 4.5 times faster than the software version and 1.2 times
faster than D1. There are two main reasons for the speed-up over D1: 1) the achieved
memory bandwidth for D4 is approximately 11% larger than that of D1 due to the
increase in the memory footprint of the FM-index; and 2) there are more steps where
the values of low and high give the same FM-index bucket address, therefore there are
fewer memory accesses in the search algorithm. For example, the number of memory
accesses decreases by 8% for strings with 100 characters.
 
4 
4.6 
18 
0
2
4
6
8
10
12
14
16
18
20
D4 (1 DFE) D1 (1 DFE) software version
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
Figure 3.17: The performance of D4, the software version and D1 are measured for exact
matching 10 million strings of 100 characters.
67
3.3.6 Short read alignment
In this section we evaluate our best performing exact match design (D2) against the short
read alignment tools listed in Table 3.3. This design does not include the precomputed
intervals or oversampled FM-index optimisations as: a) the performance impact of the
precomputed intervals optimisation is negligible as reads of 100 characters are matched;
and b) an oversampling factor of 2 would increase the FM-index size to 10.4 GB, therefore
the FM-index would not t on the 8GB DIMMs installed on the Max4 DFE.
Each CPU-based alignment tool is run with 12 threads on the MPC-X2000 host
system and the command line options are set such that no mismatches or gaps are
permitted, i.e. exact matches only. The GPU-based alignment tool, Soap3-dp, is run on
a system with an NVIDIA K40 GPU (28nm feature size) which has 12GB of GDDR5
memory. The execution time of the FPGA-based tool FHAST [20] is taken directly from
the publication. The value reported is for FHAST running on a single Xilinx Virtex 5
FPGA in the Convey Computer HC-1. Since a dierent dataset is used, we calculate the
characters matched per second (CMPS) in order to normalise the performance. CMPS
is dened in Equation 3.16. Here, N is the number of strings, m is the string length,
and T is the execution time.
CMPS =
N m
T
(3.16)
To ensure that our comparison to the software tools is fair, we match both the read
and its reverse complement to the Human genome and output a random alignment
location in the SAM format. In this experiment 100 million substrings of 100 characters
Tool Device Version Options Notes
BWA-aln [42] CPU 0.7.5 -n 0 -t 12 BWA samse is included.
Bowtie2 [16] CPU 2.1.0 -p 12 {very-fast Bowtie2 is run in end-to-end mode.
Soap2 [23] CPU 2.21 -M 0 -v 0 -p 12 {
Soap3-dp [15] GPU 2.3.172 -s 0 Soap3-dp uses CUDA 4.2.
FHAST [20] FPGA { { FHAST is run on the Convey HC-1.
Table 3.3: The CPU, GPU, and FPGA-based alignment tools evaluated against.
68
are extracted from random positions in the Human genome. The string dataset is in the
FASTQ format and has a size of 24 GB. Both D2 and the software version interleave the
le I/O and string matching using a batch system. Here, a single thread is responsible
for reading the next batch of strings from the le, while other threads (or the FPGA)
align the current batch of strings concurrently.
Table 3.4 compares the end-to-end performance of D2, the software version and the
alignment tools listed in Table 3.3. Three key results can be read from this table:
1. D2 running on a single DFE is 8.8 times faster than the best performing CPU-based
tool Soap2 and is 4.9 times faster than the GPU based tool Soap3-dp. The software
version of D2 is 5.2 and 2.9 times faster than Soap2 and Soap3-dp respectively. The
low performance of Soap3-dp relative to the software version is due to: a) the string
dataset and results must be transferred to and from GPU memory via PCIe. Since
a non-streaming interface is used (i.e. there is no interleaving between computation
and communication), this overhead is directly exposed in the execution time (20%
of the total); and b) Soap3-dp uses the base FM-index, with a sample distance
of 256, and does not include the k-step FM-index optimisation. As a result, it
reads approximately 4 times more data than the software version during the search
algorithm. Although the GPU has a high peak memory bandwidth of 288 GB/s,
the random memory access pattern of the search algorithm results in only a fraction
of this value being achievable (8-14% according to the NVIDIA visual proler).
Tool Feature Strings String Execution CMPS Speed-up
size (nm) (M) length time (s) (M)
BWA-aln 32 100 100 2165 4.6 1.0
Bowtie2 32 100 100 1826 5.5 1.2
Soap2 32 100 100 1538 6.5 1.4
Soap3-dp 28 100 100 865 11.6 2.5
FHAST 65 18 101 55 33.1 7.2
D2 (software version) 32 100 100 295 33.9 7.3
D2 (1 DFE) 28 100 100 175 57.1 12.4
Table 3.4: The performance of D2 and the short read alignment tools listed in Table 3.3
is measured for exact matching 100 million strings of 100 characters.
69
2. D2 is only 1.7 times faster than the software version. This result is due to the le
I/O amounting to signicant proportion of the execution time (50%), therefore
the speed-up is limited according to Amdahl's Law. In order to achieve a greater
speed-up, the le I/O time can be minimised by: a) modifying the software to read
and write compressed data, and b) reducing the disk access time by using faster
disks and RAID.
3. D2 running on a single DFE is at least 1.7 times faster than the FPGA-based
tool FHAST. This is the worst case speed-up as the dataset evaluated in [20] may
not contain only substrings of the text, and the reverse complements may not be
matched. Given that the Convey HC-1 has a peak memory bandwidth of 80 GB/s
(2.2 times higher than the Max4 DFE), it is clear that FHAST is only able to
achieve a fraction of this value in their implementation.
Table 3.5 shows the energy consumption of D2, the software version and the align-
ment tools listed in Table 3.3. The device power of the CPU and GPU are their TDP
values taken from the product documentation, whilst the device power of the DFE is
measured from the MaxOS operating system using the maxtop command. The power
measurement from maxtop was sampled every second for the duration of the hardware
execution time and averaged. When the DFE was idle, the device power was measured
as 18W. The energy consumption was calculated by multiplying the device power with
the execution time. This can be considered an upper-bound estimate as it assumes that
Tool Feature Device power Execution Energy consumption
size (nm) (W) time (s) (KJ)
BWA-aln 32 190 1969 374.1
Bowtie2 32 190 1826 350.0
Soap2 32 190 1538 292.2
Soap3-dp 28 235 826 203.3
D2 (software version) 32 190 295 56.0
D2 (1 DFE) 28 46 175 8.1
Table 3.5: Device power and energy consumption. Note that the GPU and DFE power
consumption is for the entire board (including memory), whereas the CPU power is for
the device only.
70
the given device was utilised for 100% of the execution time. Furthermore, it does not
include the energy consumed by the host system (in the case of the GPU and DFE) and
other components like main memory and the SSD. The results from Table 3.5 indicate
that the D2 running on a single DFE consumes over 6.9 times less energy than the soft-
ware tools tested. This result is due to the low operational clock frequency of the FPGA
coupled with the shorter execution time.
3.4 Summary
In this chapter we accelerate the FM-index with FPGAs. Our main contribution is devel-
oping three novel methods for reducing the memory bottleneck of the search algorithm
on FPGAs:
1. The FM-index structure is customised according to the amount of memory on
the FPGA platform and the memory burst size. This allows the FM-index to
t in memory without splitting the text and better utilises the available memory
bandwidth.
2. The hardware architecture is optimised so that the available memory bandwidth
is saturated and the data read from memory can be reused.
3. Algorithmic optimisations which reduce the number of memory accesses in the
search algorithm are developed. For example, the k-step FM-index can reduce the
number of memory access by over 2 times.
We implement several exact string matching designs on the Maxeler Max4 platform
and show that the best performing design running on a single DFE is 8.8 times faster
than the CPU-based tool Soap2, 5.0 times faster than the GPU-based tool Soap3-dp,
and at least 1.7 times faster than the FPGA-based tool FHAST. In addition to this, the
design consumes over 6.9 times less energy than the alignment tools evaluated.
In the following chapter we explore how our FM-index design can be extended with
backtracking in order to support approximate string matching.
71
Chapter 4
Approximate string matching
Approximate string matching is used in DNA sequencing analysis to identify genetic
variations such as single nucleotide polymorphisms (SNPs) and gaps. Identifying these
variations is crucial to understanding the origin of genetic diseases and providing eective
diagnosis. SNPs are positions in the DNA molecule where one nucleotide is replaced by
another (a mismatch), and gaps are regions where a section of DNA has been added to (an
insertion), or removed from (a deletion). In previous eorts approximate string matching
is supported by extending the FM-index search algorithm with a greedy search [19,
25]. These approaches achieve high performance through minimising the search space,
however the sensitivity (the percentage of strings successfully matched) is lower than
state-of-the-art software alignment tools which use backtracking [23] and index-assisted
dynamic programming [16]. In this chapter we address challenge C2 by presenting a new
approach for accelerating approximate string matching with FPGAs which can achieve
both high sensitivity and high performance. The novel aspects of our design are:
1. Specialised modules based on the backtracking FM-index are used to match strings
with a specic edit type and distance. For example, one mismatch, two mismatches,
one deletion, etc. Each type of module uses an exhaustive search to nd all the
potential matching locations and includes a custom set of search phases which
prune the search space by constraining the edit positions.
72
2. A seed-and-compare optimisation which improves the performance when only mis-
matches are permitted. This optimisation can speed-up the identication of SNPs
associated with complex diseases such a heart disease, diabetes and cancer.
3. A run-time recongurable architecture is used to map the specialised modules onto
the FPGA. This architecture can provide better resource utilisation, performance,
and design modularity than a static architecture.
The rest of this chapter is organised as follows: in Section 4.1, we present a design
for accelerating the backtracking FM-index with FPGAs; in Section 4.2 we explore the
dierent ways in which the specialised modules can be mapped onto FPGAs; and in
Section 4.3 we evaluate the performance of our approximate match designs and compare
them to state-of-the-art CPU, GPU, and FPGA-based short read alignment tools.
4.1 Backtracking design
In this section we present the rst design for accelerating the backtracking FM-index
with FPGAs. The key features of this design are:
 A method to prune the large search space for an exhaustive search (Section 4.1.1).
In this method, the edit positions are constrained so that long sections of the string
can be exact matched before edit operations are performed.
 A hardware architecture which performs the backtracking FM-index (Section 4.1.2).
In this architecture the FPGA is populated with specialised modules which match
strings to a text with a specic edit type and distance. Each type of module
is designed with a custom set of search phases which prune the search space by
constraining the edit positions.
 A seed-and-compare optimisation which improves the performance when only mis-
matches are permitted. This optimisation allows the majority of strings to be
processed using an exact search rather than with backtracking (Section 4.1.3).
73
4.1.1 Backtracking strategy
The backtracking FM-index algorithm can be represented as the traversal of a search
tree. For a string S with length jSj = m, the search tree has a depth of m. Each node
in the tree has T children, where the edges give the characters in the text alphabet.
There are two common methods for traversing the search tree (Figure 4.1):
Greedy search. In a greedy search, the search tree is traversed by following the path
from the root node with edges labelled with the string characters. If a character
cannot be matched, an edit operation is performed. For example, a mismatch
would involve visiting a child node where the edge is not labelled with the string
character. When the path cannot give a valid solution, i.e. the current character
cannot be matched and no more edit operations are permitted, it backtracks to the
previous edit position. The advantage of this method is that the search space is
small, as edit operations are only performed when a character cannot be matched.
The trade-o, however, is that some approximate matches may not be discovered,
resulting in a low sensitivity.
Exhaustive search. In an exhaustive search, the search tree is traversed by following
all the valid paths from each node. When the current character cannot be matched
and no more edit operations are permitted, the path backtracks to the parent node.
The advantage of this method is that the entire valid search space is explored,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
NN→NA NN→AN 
A 
B 
N 
A B N A B N A B N 
A 
B 
N 
A B N A B N A B N 
NN→NA 
phase 1 E 
|S|/2 
phase 2 E 
|S|/2 
|S|/3 
2 × E phase 1 
2 × E 
|S|/3 
phase 2 
E E 
|S|/3 2|S|/3 
phase 3 
NN→BN 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
NN→NA NN→AN 
A 
B 
N 
A B N A B N A B N 
A 
B 
N 
A B N A B N A B N 
NN→NA 
phase 1 E 
|S|/2 
phase 2 E 
|S|/2 
|S|/3 
2 × E phase 1 
2 × E 
|S|/3 
phase 2 
E E 
|S|/3 2|S|/3 
phase 3 
NN→BN 
Figure 4.1: Approximate matching the string NN to the text BANANA with up to one
mismatch. The search tree is traversed using a greedy search (left), and an exhaustive
search (right). The dashed lines indicate valid paths.
74
therefore all the possible matches within the permitted edit distance are discovered.
The trade-o, however, is that the search space can be very large when multiple
edits are permitted. For instance, in the worst case
m+1T  1
T 1 nodes must be visited.
One of the key dierences between this work and previous eorts is that an exhaustive
search is used rather than a greedy search. This enables a sensitivity comparable to state-
of-the-art software alignment tools. To prune the large search space associated with an
exhaustive search the edit positions are constrained [33]. For e permitted edits, the
string is matched using e+ 1 search phases. In each search phase the edit positions are
constrained so that 1e+1 of the string is exact matched before the edit operations are
performed. The benets of constraining the edit position are twofold: 1) edit operations
are performed at a lower depth in the search tree, therefore the branches at higher
depths are pruned; and 2) There are fewer valid paths at lower depths in the search tree,
therefore less time is spent exploring paths which do not have a solution.
Figure 4.2 shows the search phases for approximate matching strings with one edit.
The edit is constrained to either the rst half of the string (phase 1), or the second
half of the string (phase 2). In phase 1, the second half of the string is exact matched
using a backward search. The search is then extended to the rst half of the string
where an edit is considered at each position. In phase 2, the rst half of the string is
exact matched using a forward search. The search is then extended to the second half
of the string where an edit is considered at each position. The search phases require
the FM-index search algorithm is performed in both a forward and backward direction.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
Solution 2 = NN→NA Solution 1 = NN→AN 
A 
B 
N 
A B N A B N A B N 
A 
B 
N 
A B N A B N A B N 
Solution 1 = NN→NA 
phase 1 E 
|S|/2 
phase 2 E 
|S|/2 
|S|/3 
2 × E phase 1 
2 × E 
|S|/3 
phase 2 
E E 
|S|/3 2|S|/3 
phase 3 
Figure 4.2: Search phases for matching a string S with one edit. In this diagram the
string is represented by a rectangle. The blue regions indicate where edits (E) are
considered, and the arrows indicate the search direction.
75
This is achieved by constructing the FM-index of the text, T , and its reverse T 0. For a
backward search, the FM-index of T is used, whilst for a forward search the FM-index
of T 0 is used.
Figure 4.3 shows the search phases for matching strings with two edits. The edits are
constrained to: the rst two-thirds of the string (phase 1), the last two-thirds of the string
(phase 2), and the rst third and last third of the string (phase 3). The benet from
constraining the edit positions gradually diminishes as the number of edits permitted is
increased. This is because the section of the string that is exact matched rst is smaller,
therefore less of the search space is pruned. This limitation is typically addressed using
a two-stage approximate matching strategy such as that in Soap3-dp [15]. Here, the
FM-index is used for matching strings with up to two mismatches, then for larger edit
distances (including gaps) index-assisted dynamic programming is used.
The FM-index search algorithm for one mismatch is shown in Algorithm 6. In each
step of the search algorithm the interval [low, high) is updated for a single character.
If the search position is equal to the mismatch position, then a character dierent to
that in the string is used (a mismatch character), otherwise the string character is used.
The mismatch position is initialised to the middle of the string as for one mismatch the
edit is constrained to either the rst half of the string (phase 1) or the second half of
the string (phase 2). When a mismatch character is used the current interval is stored.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
Solution 2 = NN→NA Solution 1 = NN→AN 
A 
B 
N 
A B N A B N A B N 
A 
B 
N 
A B N A B N A B N 
Solution 1 = NN→NA 
phase 1 E 
|S|/2 
phase 2 E 
|S|/2 
|S|/3 
2 × E phase 1 
2 × E 
|S|/3 
phase 2 
E E 
|S|/3 2|S|/3 
phase 3 
Figure 4.3: Search phases for matching a string S with two edits. In this diagram
the string is represented by a rectangle. The blue regions indicate where edits (E) are
considered, and the arrows indicate the search direction.
76
This allows the search algorithm to backtrack by restoring the interval and moving the
search position back to the mismatch position. After updating the interval a backtrack
is performed if: a) a match is found (so the remaining paths can be explored); or b) the
permuted string does not occur in the text. When all the possible mismatch characters
have been tested the mismatch position is incremented so that in the next step the
interval is updated using the string character.
The search algorithm is terminated if a mismatch is encountered before the con-
strained position or if all the valid paths have been explored. In both cases there are
no more possible matches for the current search phase. Algorithm 6 can be modied to
Algorithm 6 FM-index search algorithm for one mismatch.
Input: string S, text T , FM-index F with sample distance d, and search phase p. For p equal
to 1 the FM-index of the text is used, else the FM-index of the reversed text is used.
Output: the nal interval, mismatch position mpos and mismatch character mch for all the
possible matches.
1: i 0 . search position
2: [low; high) [0; jT j) . initialise interval
3: mpos  jSj=2 . constrain mismatch position
4: while (1) do
5: if i = mpos then . select mismatch character
6: ch mch
7: [lows; highs) [low; high) . store current interval
8: else . select string character
9: if p = 1 then
10: ch S[jSj   1  i] . backward search
11: else
12: ch S[i] . forward search
13: low  update(ch; F [low=d]; low) . update interval
14: high update(ch; F [high=d]; high)
15: i i+ 1
16: if low < high and i = jSj then . output match
17: output(low; high;mpos;mch)
18: if (low < high and i = jSj) or low  high then . backtrack
19: [low; high) [lows; highs)
20: i mpos
21: mch  mch + 1 . next mismatch character
22: if mch =2 T then . all possible mismatch characters tested
23: mpos  mpos + 1
24: if (low  high and i < mpos) or i > jSj then
25: return . mismatch before constrained position or all paths explored
26: end while
77
support any type and number of edits. For instance, the edit operation for one deletion
would involve incrementing the search position so that the current string character is
skipped. The algorithmic optimisations presented in Section 3.2 can be used to improve
the performance of the search algorithm. It should be noted, however, that the k-step
FM-index signicantly increases the complexity of the backtracking logic as multiple
characters are matched in the same step.
4.1.2 Hardware architecture
In previous eorts approximate string matching is supported by using one or more
exact string matching modules (ESMs) and performing the edit operations separately
in hardware or software. For example, in [19] several ESMs are connected in a linear
pipeline. When a character cannot be matched a separate replace block copies the string
T 1 times, where in each copy the mismatch character is replaced by another character
in the text alphabet. The copies are then sent to the next ESM in the pipeline, which
attempts to match the remaining characters in the string. Another example is [19] where
the FPGA contains a single ESM. If a character in the string cannot be matched, the
string is sent back to the host which modies one or more characters and sends it back
to the FPGA for processing. The advantage of using just ESMs is that the hardware
architecture is relatively simple and can be easily extended to support dierent edit
types and distances. The main limitation, however, is that this approach does not scale
well for an exhaustive search as the number of times the string must be copied or sent
back to the host grows exponentially with the number of edits permitted.
In this work the FPGA is populated with specialised approximate string matching
modules (ASMs) which match strings to a text with a specic edit type and distance.
For example, ASMs can be designed for one mismatch, two mismatches, one deletion,
etc. Each type of module is based on the backtracking FM-index and uses an exhaustive
search to nd all the potential matching locations with the given edit type and distance.
The advantages of using specialised ASMs are: 1) each ASM can be designed with
a custom set of search phases which prune the search space by constraining the edit
78
positions (Section 4.1.1); and 2) the use of backtracking allows the entire search space
to be explored without having to make copies of the string or send it back to the host.
The tradeo, however, is that the architecture of the ASMs can be highly complex, with
large amounts of control logic. It is therefore much slower to design the ASMs than
simply reusing the ESM like in previous eorts.
Figure 4.4a shows the hardware architecture for approximate string matching. This
architecture comprises three components:
1. An FPGA which is populated with ASMs. The ASMs comprise two blocks: 1)
the compute block, which performs the backtracking FM-index search algorithm
for the given edit type and distance, and 2) the command block, which sends the
memory commands for the FM-index buckets to the memory controller. These
two blocks are decoupled to make it easier to avoid deadlock scenarios and race
conditions between sending the memory commands and receiving the data from
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
edit                
states 
e 
i 
S[i] 
select             
logic 
ch 
FPGA 
Compute   
block 
Command             
block 
Host interface 
Memory interface 
in 
out F[low/d] F[high/d] 
DRAM 
FM-index: F FM-index: F’ out 
ASM 
backtracking 
logic 
update update 
i 
edit                
states 
e 
low’ high’ 
i low high 
edit               
states 
string 
characters low high 
cmds 
a)
 
 
  
 
 
 
 
 
 
 
 
 
 
 
  
update update 
low 
low high F[low/d] F[high/d] 
sym 
to stream head 
edit states 
edit operation 
logic 
e 
backtracking logic 
high 
low’ high’ edit states 
e 
to command block to stream head 
sym 
String F[high/d] F[low/d] high low 
high’ low’ 
 / 
update update 
select logic 
low high 
i E 
backtracking logic 
d d 
 / 
low’ high’ E’ i’ 
E’ 
i 
b)
Figure 4.4: a) Approximate string matching architecture. b) Compute block architec-
ture. The stream labelled E comprises the interval, edit position and edit value for each
permitted edit. Note that the some of data and control streams have been omitted to
improve readability.
79
memory. In Section 4.2 we discuss the various ways of mapping the ASMs onto
the FPGA.
2. Memory where the FM-index and ASM output is stored. In this work we target
o-chip memory, in particular DRAM, as it is the most common type of memory
available on FPGA platforms that is able to store the FM-index of large texts. The
ASM output is not streamed directly to the host CPU as the number of matches per
string is non-deterministic and the host interface typically requires that the amount
of bytes to be transferred is set before the computation is initiated. To allow for
both forward and backward searches, the FM-index of the text and reversed text
are stored.
3. A host CPU which performs the software tasks that are not accelerated.
The approximate matching procedure is as follows: rst the host CPU rst packages
the strings into packets composed of a unique string identier, the string length, and the
string characters. In order to minimise host-FPGA communication, the string characters
are encoded with dlog2 jT je bits instead of 8 bit ASCII values. The string packets are
then streamed to the FPGA where they are processed in parallel by the ASMs. When an
ASM nds a match an output packet composed of the string identier, the nal interval,
and each edit is streamed to DRAM. After all the strings have been processed, the host
CPU reads the output packets from DRAM. The sux array is then used to convert
the intervals into positions in the text. This conversion is performed on the host CPU,
as it is a constant time operation and not computationally intensive enough to justify
ooading to the FPGA. For strings matched using a forward search, the sux array of
the reversed text is used to convert the intervals.
The compute block architecture is shown in Figure 4.4b. This diers from the ESM
compute block in the following ways:
 Additional cyclical streams are implemented to store the interval, position and
value for each edit permitted. For example, the ASM for one mismatch requires
four additional cyclical streams: two streams store the values of low and high at
80
the mismatch position and the other two streams store the mismatch position and
the current mismatch character.
 Backtracking is performed using two additional logic blocks. The select logic block
selects the character the interval is updated for. If the search position is equal
to an edit position, then the edit character is selected and the current interval
is stored. Conversely if the search is not at an edit position, then the string
character is selected. The backtracking logic block modies the search position
when a backtrack is performed. If low  high or a match is found, the search
backtracks to the previous edit position by restoring the interval and moving the
search position back to the edit position. The backtracking logic block is also
responsible for updating the edit position and character so that: a) the edit position
is incremented when all possible edit values have been tested, and b) the edit value
is updated so a new one is attempted each time the search algorithm backtracks.
4.1.3 Seed-and-compare
The seed-and-compare optimisation is used to improve the performance when only mis-
matches are permitted. This optimisation allows the majority of strings to be processed
using an exact search rather than with backtracking. As the name suggests, the seed-
and-compare optimisation comprises two steps which are performed before the strings
are matched by the ASMs:
Seed step. The string is split into m + 1 non-overlapping seeds with equal lengths,
where m is the number of mismatches permitted. For example, if m = 1 then the
string is split into two seeds with lengths jSj=2. The seeds are then exact matched
to positions in the text using the FM-index. These positions are referred to as
candidate matching locations (CMLs).
Compare step. The string is directly compared to the text at each CML. Since the
seeds start at dierent positions in the string, their oset is subtracted from the
CML. In order to limit the number of comparisons between the string and the text,
81
a threshold is introduced. If the number of CMLs exceeds the threshold value, then
the compare step is omitted and the string is matched using the ASMs. If there
are no CMLs, then the string contains more than m mismatches and can therefore
be omitted from any further processing.
For long strings, such as those produced by NGS platforms, there will naturally only
be a handful of CMLs per string. As a result, the majority of strings can be eciently
processed using an exact search and direct comparison rather than with backtracking.
The graphs in Figure 4.6 show the frequency distribution of CMLs for strings of 100
characters when m = 1 and 2. Both plots show that with a threshold value of 3, over
90% of the strings can be processed using the seed and compare steps. The benets of this
optimisation diminish when the string length is small and/or the number of mismatches
permitted is large. This is because the number of CMLs per string increases and it is
therefore more likely to exceed the threshold value. In the worst-case all the strings have
a number of CMLs greater than the threshold value. In this case the performance is
worse than if just using the ASMs as the seed step is also performed. The ideal threshold
value is large enough so that the majority of the strings can be processed using the seed
and compare steps, but small enough so that the performance is not adversely aected by
the number of comparisons. In our testing we nd that a threshold value of 5 meets both
 
Pos 
X 
Y 
Seed Pos 
A X 
A Y 
B X+|S|/3 
C Y+2|S|/3 
A B C 
|S|/3 2|S|/3 
text 
A A 
B 
C 
X Y 
Seed step 
Compare step 
text @ Y: 
string: 
text @ X: 
string: AAAACG ⋯TAGGAG 
AAAACG ⋯AATGAG 
AAAACG ⋯AATGAG 
AAAACG ⋯AATGAG 
Figure 4.5: Seed and compare steps. In this example a string is matched to a text with
up to two mismatches.
82
  
 
 
90%
91%
92%
93%
94%
95%
96%
1 2 3 4 5
C
u
m
u
la
ti
ve
 %
 o
f 
st
ri
n
gs
 
Number of CMLs 
86%
87%
88%
89%
90%
91%
92%
93%
1 2 3 4 5
C
u
m
u
la
ti
ve
 %
 o
f 
st
ri
n
gs
 
Number of CMLs 
a)
 
 
 
 
90%
91%
92%
93%
94%
95%
96%
1 2 3 4 5
C
u
m
u
la
ti
ve
 %
 o
f 
st
ri
n
gs
 
Number of CMLs 
86%
87%
88%
89%
90%
91%
92%
93%
1 2 3 4 5
C
u
m
u
la
ti
ve
 %
 o
f 
st
ri
n
gs
 
Number of CMLs 
b)
Figure 4.6: Frequency distribution of CMLs for 10 million strings of 100 bases with a)
one mimsatch and b) two mismatches. In both experiments the strings are matched to
the Human genome.
of these criteria. It is worth noting that the seed-and-compare optimisation is similar
to the seed-and-extend algorithm. The key dierence is that instead of using Smith-
Waterman local alignment to extend the CMLs, a direct comparison to the text is used.
As a result, this optimisation should be used when only mismatches are permitted, i.e.
no insertions or deletions.
The seed step is accelerated with FPGAs by modifying the ESM to perform the FM-
index search algorithm shown in Algorithm 7. For a string S with length jSj = m, the
search algorithm has at most m steps. In each step the interval [low; high) is updated
for a single character in the string. If after an update low  high, or the entire seed has
been matched, then the nal interval and the seed oset is output. The search position
then moves to the start of the next seed and the interval is reinitialised. As is the case for
the ESM, the algorithmic optimisations presented in Section 3.2 can be used to improve
the performance of the search algorithm.
The compare step is shown in Algorithm 8. Although this step can also be accelerated
with FPGAs, we nd that for small threshold values it is more ecient to perform it on
a CPU with multiple threads. This is because in most cases the time to congure the
FPGA and stream the data exceeds the time taken to perform the entire compare step
on a CPU.
83
Algorithm 7 FM-index search algorithm for the seed step.
Input: string S, text T , FM-index F with sample distance d, and number of mismatches
permitted m.
Output: the nal interval [low; high) and oset for each match.
1: [low; high) [0; jT j)
2: i 0
3: spos  0 . start position of seed
4: slen  min(jSj   i; djSj=(m+ 1)e) . seed length
5: while (1) do
6: low  update(S[i]; F [low=d]; low) . update interval
7: high update(S[i]; F [high=d]; high)
8: i i+ 1
9: if low < high and i = spos + slen then . seed matched
10: output(low; high; jSj   i) . output nal interval and oset
11: [low; high) [0; jT j)
12: spos  spos + slen
13: i spos
14: slen  min(jSj   i; djSj=(m+ 1)e)
15: if low  high then . seed does not match
16: [low; high) [0; jT j)
17: spos  spos + slen
18: i spos
19: slen  min(jSj   i; djSj=(m+ 1)e)
20: if i  jSj then . all seeds processed: terminate
21: return
22: end while
Algorithm 8 Compare step.
Input: string S, text T , CML c and number of mismatches permitted m.
Output: the CML and each mismatch found.
1: mis 0 . number of mismatches
2: for i 0 to jSj   1 do
3: if S[i] 6= T [c+ i] then . character does not match
4: mis mis+ 1
5: if mis > m then . number mismatches > m
6: break
7: store(i; T [c+ i]) . store mismatch position and character
8: end for
9: if mis  m then . report match
10: return c; store
84
4.2 Design-space exploration
In Section 4.1, specialised modules are used to match strings to a text with a specic
edit type and distance. In order to solve problems, such as nding the best match within
a given edit distance, a set of these modules are connected into a linear pipeline with
increasing edit distance. The strings propagate through the pipeline until successfully
matched, or the end of the pipeline is reached. In this section we explore the following
ways in which the pipeline of modules can be mapped onto FPGAs:
 A static architecture, where the pipeline of modules is mapped to a single FPGA
conguration (Section 4.2.1).
 A novel run-time recongurable architecture, where the modules in the pipeline are
mapped to separate FPGA congurations and run-time reconguration is used to
dynamically switch between them (Section 4.2.2). We show that this architecture
allows improved resource utilisation, throughput, and design modularity compared
to the static architecture.
The following table denes the symbols used in our analysis of static and run-time
recongurable architectures:
Symbol Denition
A Total resources available on the FPGA
Pi Population of a given module on the FPGA
ri Number of resources required by a given module
Ni Number of strings processed by a given module
ti Time for a given module to process a string
tr Reconguration overhead
4.2.1 Static architecture
In a static architecture the pipeline of modules is mapped to a single FPGA conguration.
When a string is unable to be matched by a given module it is forwarded to the next
module in line. This forwarding continues until the string is matched or the end of the
85
pipeline is reached. Figure 4.7 shows a static architecture for a pipeline with 3 modules.
In this example, the FPGA contains one of each type of module, however in most cases
there are sucient resources on the FPGA to replicate certain modules. For a static
architecture, the achievable population of a given module is:
Pi =
A P
j 6=i
Pj  rj
ri
(4.1)
This is the resources available after all the other modules in the pipeline have been
placed on the FPGA divided by the resources required for the module. Since the FPGA
contains several dierent types of resources, such as LUTs, FFs and BRAMs, the critical
resource for the module should be used in this equation. This is the resource type which
uses the greatest proportion of the available FPGA resources.
For the static architecture, the modules in the pipeline process the strings concur-
rently. The best-case execution time is therefore the maximum process time across all
the modules in the pipeline:
Tbest = max

N1t1
P1
;
N2t2
P2
;   

(4.2)
In the worst-case, the strings are arranged according to their edit distance so that the
amount of concurrent processing is minimised:
Tworst =
X
i
Niti
Pi
(4.3)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
module 1 
module 1 
module 1 
unmatched 
strings 
unmatched 
strings 
module 2 
module 2 
module 2 
unmatched 
strings 
module 3 
module 3 
module 3 
output 
output 
FPGA configuration 1 
FPGA configuration 2 
FPGA configuration 3 
output 
step 1 
step 2 
step 3 
time 
module 1 
module 2 
module 3 output 
FPGA Configuration 1 
unmatched 
strings 
Figure 4.7: Static architecture for a pipeline with 3 modules.
86
When applied to DNA sequencing analysis steps, such as short read alignment, the
static architecture presents several limitations which can reduce the design performance
and practicality.
 In the static architecture, the entire pipeline of modules is mapped to a single
FPGA conguration. In some cases, however, there are insucient resources
available on the FPGA to do this. As a result, a subset of the modules must
be performed in software, which reduces the performance according to Amdahl's
Law. This scenario typically occurs when a large edit distance is permitted, i.e the
pipeline contains many specialised modules.
 To improve the performance of the static architecture, the modules with the longest
process time are replicated. The modules often have very dierent throughputs,
therefore creating a balanced pipeline can be challenging given the limited resources
available on the FPGA.
 For short read data, the number of unmatched strings tails o with the edit distance
(Figure 4.8). Since only unmatched strings propagate through the pipeline, the
modules towards the end of the pipeline will only process a small fraction of the
strings. These modules are therefore idle for the majority of the execution time,
which reduces the hardware eciency.
                        
Exact match 
73% 
One 
mismatch 
12% 
Two 
mismatches 
3% 
Unmatched 
12% 
Figure 4.8: Alignment breakdown for the ERR161544 1 dataset [53]. This dataset com-
prises 74 million short reads of 100 bases sequenced by the Illumina Hiseq 2500 platform.
The short reads are aligned to the Human genome with up to two mismatches permitted.
87
 The number of mismatches and gaps permitted in alignment often change de-
pending on the quality of data and the experiment being performed. For a static
architecture there is limited control over these parameters as the design is xed.
Any substantial changes may require the pipeline of modules to be modied, which
in turn must placed and routed on the FPGA. This process can take several hours
to days to perform, which greatly reduces the practicality of the design.
4.2.2 Run-time recongurable architecture
In a run-time recongurable architecture the modules in the pipeline are mapped to
separate FPGA congurations. For a pipeline with  modules, there are  FPGA
congurations, each comprising a single type of module. Figure 4.7 shows a run-time
recongurable architecture for a pipeline with 3 modules. In this example, the modules
are mapped to separate FPGA congurations and replicated as many times as possible.
The population of a given module in a conguration is:
Pi =
A
ri
Unlike the static architecture, the congurations only contain a single type of module,
therefore the population is simply the total resources available on the FPGA divided by
the resources required for the module.
For a pipeline with  modules, the run-time recongurable architecture is executed
over  steps. In each step the conguration for the given module is loaded onto the FPGA
using run-time reconguration. The unmatched strings from the previous step are then
streamed to the FPGA where they are processed in parallel by the modules. The module
output is either stored in external memory connected to the FPGA, or in the host CPU
memory depending on the memory architecture of the FPGA platform. For platforms
where the memory controllers are located on the FPGA, the module output must be
stored in the host CPU memory as reconguring the device will remove the memory
controller and lead to undened data in memory. Conversely for platforms where the
memory controllers are hardened, the module output can be stored in external memory,
88
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
module 1 
module 1 
module 1 
unmatched 
strings 
unmatched 
strings 
module 2 
module 2 
module 2 
unmatched 
strings 
module 3 
module 3 
module 3 
output 
output 
FPGA configuration 1 
FPGA configuration 2 
FPGA configuration 3 
output 
step 1 
step 2 
step 3 
time 
module 1 
module 2 
module 3 output 
FPGA Configuration 1 
unmatched 
strings 
Figure 4.9: Run-time recongurable architecture for a pipeline with 3 modules.
which reduces the host-FPGA communication time.
The execution time for a run-time recongurable architecture is given by Equa-
tion 4.4. This is the time taken for each conguration to process the unmatched strings
as well as the reconguration overhead trec. For most FPGA platforms the recongura-
tion time takes in the order of 100ms to 1s, therefore for large datasets this overhead can
be ignored. Unlike the static architecture, the modules in the pipeline are not executed
concurrently. However, the congurations only contain a single type of module, therefore
the achievable population of each module is much higher. This can be seen as a trade-o
between task-level parallelism and data-level parallelism.
T =
X
i

Niti
Pi
+ trec

(4.4)
89
The run-time recongurable architecture is particularly well-suited to the string
matching problems in DNA sequencing analysis as the majority of the strings can be
matched exactly to the reference genome. In such cases the execution time of the static
and run-time recongurable architectures is dominated by the time taken for the ESMs
(labelled with a subscript 0) to process the strings:
Tstatic =
N0t0r0
A  P
j 6=0
Pj
Trec =
N0t0r0
A
Since the ASMs consume the majority of the FPGA resources in a static architecture,
the run-time recongurable architecture can achieve a higher population of ESMs and
therefore has a smaller execution time:
A  P
j 6=0
Pj
r0
<<
A
r0
) Trec < Tstatic
The other advantages of the run-time recongurable architecture are as follows:
 The modules in the pipeline are mapped to separate congurations, therefore it
becomes much easier to fully implement large pipelines of modules in hardware.
Furthermore, the complex task of connecting modules with dierent throughputs
is no longer required.
 The performance is not impacted by idle modules as the unmatched strings are
partitioned equally across all the modules in the conguration.
 The run-time recongurable architecture is completely modular, therefore the
pipeline can be modied without having to rebuild the design. For example, the
pipeline can be added to or rearranged simply by changing the conguration loaded
onto the FPGA in each step. This feature allows the number of mismatches and
gaps permitted in alignment to be changed without impacting the practicality of
the design.
90
4.3 Evaluation
In this section we implement several approximate string matching designs on the Maxeler
Max4 platform and evaluate their performance. In Section 3.3.1, we describe the exper-
imental setup such as the designs evaluated and the systems used. In Section 4.3.2 we
evaluate the performance of ASMs specialised for one mismatch and two mismatches, and
in Section 4.3.3 we measure the performance improvement when the seed-and-compare
optimisation is included. Finally in Section 3.3.4 we evaluate the best performing design
with up to two mismatches permitted against state-of-the-art CPU, GPU, and FPGA-
based short read alignment tools.
4.3.1 Experimental setup
The designs listed in Table 4.1 are implemented on a Maxeler MPC-X2000 system which
hosts 8 Max4 DFEs. Each DFE contains an Altera Stratix V FPGA (28nm feature size)
and 48GB of DRAM. The multi-channel DRAM mode is used, which allows up to 6
independent memory channels per DFE. This mode places the following restrictions on
the FM-index: 1) the FM-index size must be less than 8GB in order to t on the DIMMs
without splitting the text, and 2) the bucket size should be a multiple of 64 bytes so that
the memory bandwidth is fully utilised and the buckets are memory aligned. Each design
uses the same FM-index structure and optimisations as the best performing design in
Chapter 3. Here, the FM-index is constructed with a sample distance of 64 and uses
the k-step FM-index with a step size of 2. This gives an FM-index size of 5.2 GB and a
bucket size of 128 bytes. The achieved memory bandwidth for all the designs is 36GB/s
per DFE, which is 60% of the peak.
Design name Module description Modules per FPGA
D1 ESM - exact match 6
D2 ESM - seed step 6
D3 ASM - one mismatch 6
D4 ASM - two mismatches 6
Table 4.1: Approximate string matching designs evaluated.
91
In each design evaluated, the FPGA is populated with 6 ESMs or ASMs which
are connected to their own memory controllers. The ESMs/ASMs run at 150MHz,
whilst the memory controllers run at the maximum clock frequency of 800MHz. Given
that the performance of the designs is memory-bound, increasing the clock frequency
of the modules would not increase the performance any further. Table 4.2 shows the
FPGA resource usage for the designs. It can be seen that the BRAM usage is much
greater for the ASMs than the ESMs. This is due to the additional cyclical streams
required for backtracking FM-index, and the FIFOs used to buer the output streams
to memory. Although a higher population of modules is achievable in each design, the
memory channels are saturated with just one module, therefore the performance would
not increase any further.
The performance of each design is evaluated against an equivalent software version
with the same FM-index structure and algorithmic optimisations. The software versions
are written in C++11 and compiled with g++ version 4.9 with the optimisation ag
-O3. OpenMP is used to parallelise the software code so that the strings are matched
with multiple threads. The software versions are run with 12 threads on the MPC-X2000
host machine. This system has dual Intel Xeon E5-2640 CPUs (32nm feature size), 64
GB of RAM and a single Crucial MX200 SSD with read and write speeds of 500MB/s.
In each experiment the full Human genome build 38 [52] (3.1G characters) is used
as the text. Any `N' characters are removed from the text so that its alphabet is
fA, C, G, Tg. In Sections 4.3.2 and 4.3.3, the string datasets are generated by extracting
xed-length substrings from the text and inserting mismatches at random positions. This
Design LUTs FFs BRAMS
(K) (K)
D1 196 (37.3%) 182 (34.7%) 951 (37.1%)
D2 235 (44.8%) 263 (50.1%) 1030 (40.1%)
D3 280 (53.4%) 275 (52.4%) 1593 (62.0%)
D4 294 (39.2%) 302 (57.5%) 1773 (69.0%)
Table 4.2: Resource usage for each FM-index design when implemented on an Altera
Stratix V GS 5SGSD8 FPGA.
92
minimises any bias in the performance from having the mismatches concentrated at the
centre or ends of the strings. In Section 4.3.4 we use the dataset ERR161544 1 [53],
which comprises 74 million short reads of 100 bases sequenced by the Illumina Hiseq
2500 platform. We choose this dataset as it is publicly available and has been used to
evaluate other short read alignment tools [15, 23].
In Sections 4.3.2 to 4.3.3, the execution times reported only consider the raw string
matching time. This does not include the time for overheads such as le I/O, conguring
the FPGA and transferring the FM-index to DRAM. When evaluating against other
short read alignment tools, the end-to-end execution time is reported in order to ensure
a fair comparison. In all of our experiments the execution times are measured using the
Linux time command and are averaged over 5 runs.
4.3.2 Backtracking design
In D3 the FPGA is populated with ASMs specialised for one mismatch. The graph in
Figure 4.10a compares the performance of D3 and the software version. The plot shows
that D3 is on average 3.3 times faster than the software version. It can be seen that the
execution time scales linearly with the string length. This is because the search spaces
increases linearly with the string length and therefore more steps in the search algorithm
are required to nd all the possible matches.
In D4 the FPGA is populated with ASMs specialised for two mismatches. The graph
in Figure 4.10b compares the performance of D4 and the software version. The plot shows
that D4 is an average of 3.5 times faster than the software version. As is the case for
D3, the execution time of D4 scales linearly with the string length, however there is also
a large increase in the execution time for strings with 40 characters. This is because the
section of the string which is exact matched rst is small (13 characters), therefore less
of the search space is pruned. If the graph in Figure 4.10a is extended for smaller string
lengths we would expect a similar increase in the execution time. This would occur at
a string length of approximately 20, since half of the string is exact matched before the
mismatch is considered.
93
  
 
 
 
 
 
 
 
0
5
10
15
20
25
30
35
40
40 60 80 100 120
Ex
e
cu
ti
o
n
 t
im
e 
(s
) 
String length 
D3 (1 DFE)
Software version
10
20
30
40
50
60
70
80
90
100
110
40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
String length 
D4 (1 DFE)
Software version
a)
 
 
 
 
 
 
 
 
 
0
5
10
15
20
25
30
35
40
40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
String length 
D3 (1 DFE)
Software version
10
20
30
40
50
60
70
80
90
100
110
40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
String length 
D4 (1 DFE)
Software version
b)
Figure 4.10: The performance of a) D3 and b) D4 are measured for matching 10 million
strings whilst varying the string length. In a) the strings contain one mismatch and in
b) the strings contain two mismatches.
The speed-up achieved by D3 and D4 is less than that observed for the ESM, which
is 5.1 times faster than the equivalent software version. This is due to: a) the memory
bandwidth is reduced as the ASM includes a write stream which uses the same memory
channel; b) the time taken to read the ASM output from DRAM is exposed in the
execution time as it is not interleaved with any processing; and c) there are additional
software tasks required for approximate string matching such as collating the ASM
output according to the string identier.
4.3.3 Seed-and-compare
The seed-and-compare optimisation is evaluated for approximate string matching with
one and two mismatches permitted. In both designs the FPGA is rst congured with
D2 which performs the seed step. If the number of CMLs for a string is less than the
threshold value, the string is compared to the text at each CML. The compare step is
performed on the host CPU using multiple threads as we nd that it is not computa-
tionally intensive enough to justify ooading it to the FPGA. We use a threshold value
of 5 as it allows the majority of the strings to be matched using the seed and compare
steps whilst limiting the number comparisons performed. After the compare step, the
FPGA is recongured with D3 (for one mismatch) or D4 (for two mismatches). Any
94
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
D1 D2 D3 D4 
Compare 
CPU 
FPGA 
0 < CML ≤ threshold 
CML > threshold 
CPU 
FPGA 
D2 D3 
CML > threshold 
Compare 
0 < CML ≤ threshold 
CPU 
FPGA 
D2 D4 
CML > threshold 
Compare 
0 < CML ≤ threshold 
a)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
D1 D2 D3 D4 
Compare 
CPU 
FPGA 
0 < CML ≤ threshold 
CML > threshold 
CPU 
FPGA 
D2 D3 
CML > threshold 
Compare 
0 < CML ≤ threshold 
CPU 
FPGA 
D2 D4 
CML > threshold 
Compare 
0 < CML ≤ threshold 
b)
Figure 4.11: Pipeline of modules for approximate matching strings with a) one mismatch
(P1) and b) two mismatches (P2).
strings with a number of CMLs greater than the threshold value are then streamed to
the FPGA where they are matched using the ASMs. The pipeline of modules for one
mismatch searches is referred to as P1 (Figure 4.11a) and the pipeline for two mismatch
searches is referred to as P2 (Figure 4.11b).
The graph in Figure 4.12 compares the performance of P1, the software version and
D3. Three key results can be read from this plot:
1. P1 is up to 2.3 times faster than the software version. The achievable speed-up is
limited by the time taken for software tasks such as the converting each CML to
a position in the text and the compare step. For strings of 100 characters we nd
that the time taken for these software tasks amounts to approximately 40% of the
execution time.
2. For strings with less than 60 characters P1 is up to 1.3 times slower than D3. This
is because: a) the seed length is smaller, therefore the number of CMLs per string
is likely to exceed the threshold value of 5; and b) more comparisons are performed
as the number of CMLs per string increases. For example, when reducing the string
length from 100 to 40, the number of strings processed by the seed-and-compare
optimisation decreases by 10% and the number of CMLs per string doubles.
95
  
 
 
 
 
0
5
10
15
20
25
30
40 60 80 100 120
Ex
e
cu
ti
o
n
 t
im
e
 (
s)
 
String length 
D3 (1 DFE)
P1 (1 DFE)
Software version
10
20
30
40
50
60
70
80
90
100
110
40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
String length 
D4 (1 DFE)
P2 (1 DFE)
Software version
Figure 4.12: The performance of P1 is measured for matching 10 million strings with
one mismatch whilst varying the string length.
3. For strings with greater than 60 characters P1 is up to 1.2 times faster than D3.
This is because there are only a handful of CMLs per string, therefore the majority
of strings can be processed using an exact search and direct comparison rather than
with backtracking. For strings with 120 characters, 96% are processed using the
seed-and-compare optimisation and the average number of CMLs per string is 1.1.
The graph in Figure 4.13 compares the performance of P2, the software version and
D4. The plot shows that P2 is up to 3.2 times faster than the software version. Once
again it can be seen that the seed-and-compare optimisation is most eective for longer
 
 
 
 
 
 
0
5
10
15
20
25
3
40 60 80 100 120
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
String length 
D3 (1 DFE)
P1 (1 DFE)
Software version
10
20
30
40
50
60
70
80
90
100
110
40 60 80 100 120
Ex
e
cu
ti
o
n
 t
im
e 
(s
) 
String length 
D4 (1 DFE)
P2 (1 DFE)
Software version
Figure 4.13: The performance of P2 is measured for matching 10 million strings with
two mismatches whilst varying the string length.
96
strings. For example, for strings with 120 characters P2 is 1.6 times faster than D4,
however for strings of 40 characters it is 1.1 times slower. Given that current NGS
platforms produce short reads with 100-150 bases, we can conclude that the seed-and-
compare optimisation is an eective way of improving the approximate string matching
performance when only mismatches are permitted.
4.3.4 Short read alignment
In this section we evaluate our design for aligning strings with up to two mismatches
against the short read alignment tools listed in Table 3.3. The pipeline used in this design
(referred to as P3) comprises four stages: exact match (D1), seed-and-compare (D2), one
mismatch (D3), and two mismatches (D4) (Figure 4.14). The strings propagate through
the pipeline until they are successfully matched; consequently all the best matches with
up to two mismatches are found. For the seed-and compare stage, the strings are split
into three non-overlapping seeds which are matched to the text using D2. The pipeline
then branches according to the number of CMLs found:
1. If there are no CMLs, then the string has more than two mismatches and can
therefore be omitted from any further processing.
2. If the number of CMLs is less than or equal to the threshold value of 5, then the
string is compared to the text at each CML.
3. If the number of CMLs is greater than the threshold value the strings are processed
by the ASMs D3 and D4 in turn.
Tool Device Version Options Notes
BWA-aln [42] CPU 0.7.5 -n 2 -t 12 BWA samse is included.
Bowtie2 [16] CPU 2.1.0 -p 12 {very-fast No option for edit distance.
Soap2 [23] CPU 2.21 -M 0 -v 2 -p 12 {
Soap3-dp [15] GPU 2.3.172 -s 2 Soap3-dp uses CUDA 4.2.
FHAST [20] FPGA { { FHAST is run on 4 FPGAs on the
Convey HC-1.
Table 4.3: The CPU, GPU, and FPGA-based alignment tools evaluated against.
97
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
D1 D2 D3 D4 
Compare 
CPU 
FPGA 
0 < CML ≤ threshold 
CML > threshold 
CPU 
FPGA 
D2 D3 
CML > threshold 
Compare 
0 < CML ≤ threshold 
CPU 
FPGA 
D2 D4 
CML > threshold 
Compare 
0 < CML ≤ threshold 
Figure 4.14: The pipeline of modules (P3) for approximate matching strings with up to
two mismatches.
Each CPU-based alignment tool is run with 12 threads and the command line options
are set such that a maximum of two mismatches are permitted with no gaps. The GPU-
based alignment tool, Soap3-dp, is run on a system with an NVIDIA K40 GPU (28nm
feature size) which has 12GB of GDDR5 memory. The execution time of the FPGA-
based tool FHAST [20] is taken directly from the publication. The value reported is
for FHAST running on four Xilinx Virtex 5 FPGAs hosted by the Convey Computer
HC-1. Since a dierent dataset is used, we calculate the characters matched per second
(CMPS) in order to normalise the performance (Equation 3.16).
To ensure that our comparison to the software tools is fair, we match both the read
and its reverse complement to the Human genome and output a random alignment
location in the SAM format. In this experiment we align the dataset ERR161544 1 [53],
which comprises 74 million short reads of 100 bases sequenced by the Illumina Hiseq
2500 platform. When matched to the Human genome, 73% of the strings match exactly,
12% have one mismatch, 3% have two mismatches and the remaining 12% are unable to
be matched. The string dataset is in the FASTQ format and has a size of 19 GB. Both
P3 and the software version interleave the le I/O and string matching using a batch
system. Here, a single thread is responsible for reading the next batch of strings from the
le, while other threads (or the FPGA) align the current batch of strings concurrently.
To avoid having to recongure the FPGA multiple times per batch, all the unmatched
98
strings are buered and processed in a single batch. As a result, the FPGA is congured
four times during the entire computation. The tradeo, however, is that the host memory
usage increases due to storing the unmatched strings. For instance, buering 10 million
short reads with 100 bases requires approximately 3.84 GB of additional main memory.
Table 4.4 compares the performance of P3, the software version and the alignment
tools listed in Table 4.3. Four key results can be read from this table:
1. P3 running on a single DFE is 4.8 times faster than the best performing CPU-
based tool Bowtie2 and is 4.0 times faster than the GPU based tool soap3-dp.
The software version of P3 is 2.8 and 2.3 times faster than Bowtie2 and Soap3-dp
respectively. It should be noted that it is unclear whether the software alignment
tools nd all the possible matches (as is the case for P3 and the software version),
or terminate the search algorithm after the rst match is found. The speed-up
achieved by P3 and the software version can be attributed to the use of the k-
step FM-index and the seed-and-compare optimisations. To our knowledge this
is the rst time that these optimisations have been applied to the backtracking
FM-index.
2. P3 running on a single DFE is at least 2.0 times faster than the FPGA-based tool
FHAST running on 4 FPGAs. FHAST does not report the percentage of reads
aligned in their evaluation, however since it uses a greedy search we can expect
that it will be lower than that of P3. In addition to this, P3 nds all the best
Tool Feature Strings String Reads Execution CMPS Speed-up
size (nm) (M) length aligned time (s) (M)
BWA-aln 32 74.1 100 90.2% 2358 3.1 1.0
Soap2 32 74.1 100 90.2% 1470 5.0 1.6
Bowtie2 32 74.1 100 90.6% 1372 5.4 1.7
Soap3-dp 28 74.1 100 90.2% 1163 6.4 2.0
FHAST 64 18 101 { 138 13.2 4.3
P3 (software version) 32 74.1 100 90.2% 495 14.9 4.8
P3 (1 DFE) 28 74.1 100 90.2% 287 25.8 8.2
Table 4.4: The performance of P3 and the short read alignment tools listed in Table 3.3
is measured for aligning the ERR161544 1 dataset with up to two mismatches.
99
matching locations with up to two mismatches whereas FHAST terminates after
nding the rst one.
3. P3 is only 1.7 times faster than the software version. This result is due to the
software tasks, such as le I/O and buering the unmatched strings, amounting to
a signicant proportion of the execution time (55%). In order to achieve a greater
speed-up, the le I/O time can be minimised by: a) modifying the software to read
and write compressed data, and b) reducing the disk access time by using faster
disks and RAID. We also nd that the reconguration overhead for the MPC-
X2000 is quite large, which further limits the achievable speed-up. The time taken
to recongure the FPGA and transfer the FM-index to DRAM is 8 seconds,
therefore the reconguration overhead amounts to 11% of the total execution
time. This issue can be addressed in future FPGA platforms by using partial
reconguration and having hardened memory controllers so that the FM-index
does not need to be transferred to the DFE memory after each reconguration.
4. P3 has a sensitivity of 90% for this dataset, which is identical to the software
tools. This result is expected as the software tools also use an exhaustive search for
small numbers of mismatches permitted. When a larger number of mismatches and
gaps are required, P3 can be extended with an index-assisted dynamic program-
ming algorithm such as that in Soap3-dp. Since P3 uses a run-time recongurable
architecture, this can be appended to the pipeline without having to modify any
of the existing stages.
Table 4.5 shows the energy consumption of P3, the software version and the alignment
tools listed in Table 4.5. The device power of the CPU and GPU are their TDP values
taken from the product documentation, whilst the device power of the DFE is measured
from the MaxOS operating system using the maxtop command. The power measurement
from maxtop was sampled every second for the duration of the hardware execution time
and averaged. When the DFE was idle, the device power was measured as 18W. The
energy consumption was calculated by multiplying the device power with the execution
100
Tool Device power Execution Energy consumption
name (W) time (s) (KJ)
BWA-aln 190 2358 448.0
Soap2 190 1470 279.3
Bowtie2 190 1372 260.7
Soap3-dp 235 1163 273.3
P3 (software version) 190 495 94.1
P3 (1 DFE) 56 287 16.0
Table 4.5: Device power and energy consumption. Note that the GPU and DFE power
consumption is for the entire board (including memory), whereas the CPU power is for
the device only.
time. This can be considered an upper-bound estimate as it assumes that the given
device was utilised for 100% of the execution time. Furthermore, it does not include
the energy consumed by the host system (in the case of the GPU and DFE) and other
components like main memory and the SSD. The results from Table 4.5 indicate that the
DFE consumes over 5.9 times less energy than the software tools tested. As is the case
for the exact string matching design, this is due to the low operational clock frequency
coupled with the short execution time.
4.4 Summary
In this chapter we present a new approach for accelerating approximate string matching
with FPGAs which can achieve both high sensitivity and high performance. The novel
aspects of our design are:
 Specialised modules are used to match strings to a text with a specic edit type and
distance. For example, one mismatch, two mismatches, one deletion, etc. Each
type of module uses an exhaustive search to nd all the potential matches and
includes a custom set of search phases which prune the search space by constraining
the edit positions.
 A run-time recongurable architecture is used to map the specialised modules onto
the FPGA. This architecture can provide better resource utilisation, performance,
101
and design modularity than a static architecture.
 A seed-and-compare optimisation which improves the performance when only mis-
matches are permitted. This optimisation allows the majority of strings to be
processed using an exact search rather than with backtracking.
We implement several approximate string matching designs on the Maxeler Max4
Platform and show that the best performing design running on a single DFE is 4.8 times
faster than Bowtie2, 4.0 times faster than the GPU-based tool Soap3-dp, and at least 2.0
times faster than the FPGA-based tool FHAST. In addition to this, the design consumes
over 5.9 times less energy than the alignment tools tested.
In the following chapter we explore how our FM-index design can be adapted to
support the compression of sequence data.
102
Chapter 5
Sequence data compression
Over the last decade, the advances in NGS has revolutionised a multitude of scientic
disciplines including medicine, molecular biology, and forensics. Compared to traditional
Sanger sequencing, the massively parallel sequencing platforms are able to provide orders
of magnitude more data at a much lower cost. This rapid growth in the amount of
sequence data available surpasses the Moore's law advances in disk technology, and
therefore challenges our ability to store the data in a cost ecient manner. For example,
large-scale genomics programs, such as the Million Veterans Program [38], are estimated
to produce hundreds of Petabytes of genomic data, which will cost millions of dollars to
store per year based on current pricing models. In 2025 the amount of genomic data is
expected to reach up to 40 Exabytes of data [54]. At this point the computing facilities
required to store and analyse this massive amount of data will dominate the cost of
projects.
To alleviate the problem of data storage, researchers and archives have adopted the
use of general purpose compression tools to reduce the size of sequence data. Tools,
such as gzip and bzip2, are widely used and can reduce the size of sequence data by
approximately 2 to 4 times. In recent years, compression tools specialised for sequence
data have been developed. These tools exploit the characteristics of the sequence data
formats, allowing them to achieve a higher compression ratio than the general purpose
tools. One compression technique used in many specialised tools is reference-based
103
compression [55, 56, 57]. In this technique, the sequences are encoded as a mapping to
a reference genome. This exploits the property that inter-species DNA is highly similar
(over 99.9% for Humans), therefore it is much more ecient to store the dierences
instead of the full sequences. Specialised compression tools which utilise reference-based
compression have been shown to achieve a much higher compression ratio than general
purpose tools. The trade-o, however, is that these tools have a much slower compression
speed, as mapping the sequences to the reference genome is a computationally intensive
problem. This limitation has greatly reduced the impact of specialised tools in research
and clinical settings.
In this chapter we accelerate sequence data compression with FPGAs. The main
contributions of this work are:
1. The FM-index search algorithm is adapted for reference-based compression and we
present its rst acceleration with FPGAs.
2. The reference-based compression design is integrated into fastqZip and fastaZip,
two new tools that we have developed for compressing sequence data in the FASTQ
and FASTA formats respectively.
3. An evaluation of fastqZip and fastaZip against state-of-the-art compression tools.
We show that both tools are able to achieve a higher compression ratio than general
purpose whilst also having faster compression and decompression speeds.
The rest of this chapter is organised as follows: in Section 5.1 we adapt the FM-index
search algorithm for reference-based compression and accelerate it with FPGAs; In Sec-
tion 3.2, we present fastqZip and fastaZip, two new tools for compressing sequence data
in the FASTQ and FASTA formats respectively; and nally in Section 3.3 we evaluate
the performance of fastqZip and fastaZip against current state-of-the-art compression
tools.
104
5.1 Reference-based compression
In this section we accelerate reference-based compression with FPGAs. The novel aspects
of this design are:
 An adaptation of the FM-index search algorithm for reference-based compression
(Section 5.1.1). In this approach the FM-index is used to map sequences to a
reference genome, where each mapping is encoded as a group of tuples.
 A hardware architecture which performs the search algorithm for reference-based
compression (Section 5.1.2). This architecture is based on the exact string match-
ing architecture presented in Section 3.1.2.
5.1.1 Algorithm overview
In reference-based compression DNA sequences are encoded as a mapping to a refer-
ence genome. The mapping can be represented in a variety of ways such as the SAM
format [58] or tuples of position and length [56]. In this work, we encode the mapping
as a group of tuples hpos; len; symi, where pos is the position at which the sequence
matches to the reference genome, len is the length of the match, and sym is the se-
quence character following the match. An example of reference-based compression using
this representation is shown in Figure 5.1. Here, the rst sequence exactly matches to
the reference genome, therefore it is encoded with a single tuple. On the other hand,
the second sequence has a mismatch at position 4 and must be encoded with two tuples.
To decompress the sequence, each tuple is replaced by the corresponding section of the
Reference: AATGGGACGTGAGGGTTCCTCAGGCC
Sequences 3-tuples hpos; len; symi
AATGGG h0; 5; i
TTCCACA h15; 4; Ai h20; 2; i
Figure 5.1: Reference-based compression example. For matches that reach the end of
the sequence, the sym eld is set to the null value ` '.
105
reference genome. For example, the tuple h15; 4; Ai, is replaced by a section of the ref-
erence genome starting at position 15 with length 4. An `A' character is then appended
to the end of the string to complete the decompression.
In existing reference-based compression tools, the mapping procedure is performed
using a variety of methods including the BWA sequence alignment tool [58] and hash
tables [56, 57]. In this work we present the rst adaptation of the FM-index for this
purpose. The main benets of this approach are as follows:
1. The time complexity of the search algorithm is only weakly dependent on the
text length, therefore the mapping performance is not signicantly impacted when
using large reference genomes such as the Human genome (3.1G characters).
2. The FM-index only needs to be built once per reference genome, therefore the
construction time can be amortised when compressing many les. For hash-table
methods the construction cost (from collision handling, etc) is incurred each time
the compression tool is run. This is one of the main bottlenecks in previous ef-
forts [56, 57].
3. The FM-index search algorithm uses a backward search to match all the characters
in the string, therefore the mapping can be represented with the fewest number of
tuples possible. In [58], the string is split into k-mers which are used as seeds for
local alignment. This approach does not ensure the longest mapping is found, as
it is a local rather than an end-to-end alignment.
Algorithm 9 shows the FM-index search algorithm for reference-based compression.
For a string S with length jSj = m, the search has m steps. In each step, the interval
[low; high) is updated for a single character in the string using a backward search.
If low  high after an update, i.e. the character cannot be matched, then a tuple
hpos; len; symi is added to the end of the group. The interval and the length of the
match are then re-initialised so that in the next step a new search is started. In cases
where the match reaches the end of the string, a tuple with the sym eld set to null is
106
Algorithm 9 FM-index search algorithm for reference-based compression.
Input: string S, text T , FM-index F with sample distance d, sux array of the text A.
Output: a group of tuples, tup, representing the mapping.
1: [low; high) [0; jT j)
2: len 0
3: for i jSj   1 to 0 do . update interval for each character in S
4: low0  update(S[i]; F [low=d]; low)
5: high0  update(S[i]; F [high=d]; high)
6: if low  high then . append tuple and reinitialise interval
7: tup:append(A[low]; len; S[i])
8: [low; high) [0; jT j)
9: len 0
10: else
11: [low; high) [low0; high0)
12: len len+ 1
13: if i = 0 then . match reaches end of string
14: tup:append(A[low0]; len; null)
15: end for
added to the end of the group. The algorithmic optimisations presented in Section 3.2
can be used to improve the performance of the search algorithm. For example, the k-step
FM-index can be used to reduce the number of memory accesses by a factor of k. Note
that if this optimisation is used the sym eld has an alphabet size jT jk.
5.1.2 Hardware architecture
Figure 5.2a shows the hardware architecture for reference-based compression. This ar-
chitecture comprises three components:
1. An FPGA which is populated with compression modules (CMs). The CMs com-
prise two blocks: 1) the compute block, which performs the FM-index search al-
gorithm for reference-based compression, and 2) the command block, which sends
the memory commands for the FM-index buckets to the memory controller. These
two blocks are decoupled to make it easier to avoid deadlock scenarios and race
conditions between sending the memory commands and receiving the data from
memory.
2. Memory where the FM-index and CM output are stored. In this work we target
107
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Sequencing direction 
CM 
FM-index: F 
DRAM 
F[high/d] F[low/d] cmds 
in 
Memory interface 
Host interface 
Command             
block 
Compute   
block 
FPGA 
out 
CM output 
a)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
FPGA 
Compute   
block 
Command             
block 
Host interface 
Memory interface 
in out 
cmds F[low/d] F[high/d] 
DRAM 
FM-index: F 
ESM 
update update 
 /  / 
d d 
low’ high’ 
low high F[low/d] F[high/d] String 
S[i] 
to stream head to command block 
 + 
i 
 < 
0 |T| 
0 
 + 
1 
ctl 
ctl 
len 
0 
0 
0 1 1 
1 
i len low high 
b)
Figure 5.2: a) Reference-based compression architecture. b) Compute block architecture.
Note that some data and control streams have been omitted to improve readability.
o-chip memory, in particular DRAM, as it is the most common type of memory
available on FPGA platforms that is able to store the FM-index of large texts.
The CM output is not streamed directly to the host CPU as the number of tuples
per sequence is non-deterministic and the host interface typically requires that the
amount of bytes to be transferred is set before the computation is initiated.
3. A host CPU which performs the software tasks that are not accelerated.
The compression procedure is as follows: rst the host CPU packages the strings
into packets composed of a unique string identier, the string length, and the string
characters. In order to minimise host-FPGA communication, the string characters are
encoded with dlog2 jT je bits instead of 8 bit ASCII values. The packets are then
streamed to the FPGA where they are processed in parallel by the CMs. When a
character in the string cannot be matched to the text, or the match has reached the end
of the string, an output packet is streamed to DRAM. These packets are composed of
the string identier, the value of low, the length of the match, and the current string
character. When all the sequences have been processed, the host CPU reads the output
108
packets from DRAM and constructs the group of tuples for each string. This involves
iterating through the output packets and appending the corresponding tuple to the group
given by the string identier. The sux array is used to convert the value of the pos
eld in each tuple into a position in the text. This conversion is performed on the host
CPU as it is a constant time operation and not computationally intensive enough to
justify ooading to the FPGA.
The compute block architecture is shown Figure 5.2b. This diers the ESM compute
block in the following ways:
 An additional cyclical stream is implemented to store the length of the match. If
low < high following the update, i.e. the character matches, then the length of
the match is incremented, otherwise it is reset to 0.
 Instead of terminating the search algorithm if a character does not match, the
interval is reset so that in the next step a new search is started. This is achieved
by adding multiplexers after the update block with the value of low < high used
as the select lines. If this condition is true then the updated interval is used in the
next step, otherwise it is reset to the initial values.
5.2 FASTQ and FASTA compression
In this section we introduce fastqZip and fastaZip, two new tools that we have developed
for compressing sequence data in the FASTQ and FASTA formats respectively. The
FASTQ format is the defacto standard output format for NGS platforms, whilst the
FASTA format is used to store sequences without any quality score information. Both
tools have been designed with the following specication:
1. The compression ratio should be comparable with state-of-the-art specialised tools.
2. The compression and decompression speed should be comparable with that of the
general purpose tools. This ensures that the tools will not become a bottleneck
when storing and accessing the data.
109
1 @SRR001666 .1 071112 _SLXA -EAS1_s_7 :5:1:817:345 length =72
2 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
3 +SRR001666 .1 071112 _SLXA -EAS1_s_7 :5:1:817:345 length =72
4 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII >IIIIII/
Figure 5.3: FASTQ format. Each sequence is stored over 4 lines. Line 1 contains the
sequence identier, line 2 contains the sequence, line 3 contains the strand information
and line 4 contains the quality scores.
1 >NM_012514 Rattus norvegicus breast cancer 1 (Brca1), mRNA
2 CGCTGGTGCAACTCGAAGACCTATCTCCTTCCCGGGGGGGCTTCTCCGGCATTTAGGCCTTTTAGGCCT
3 CGGCGTTTGGAAGTACGGAGGTTTTTCTCGGAAGAAAGTTCACTGGAAGTGGAAGAAATGACGATGCAG
Figure 5.4: FASTA format. Lines with the `>' character in the rst position are the
sequence identiers, the remaining lines contain sequences. No quality scores or strand
information are stored.
3. The compression should be completely reversible without any loss of information.
4. The tools must be robust and able to handle any valid input data. This aim is
particularly important for the FASTQ format as the sequence identier information
can have a variety of dierent elds which tend to crash existing specialised tools.
Figures 5.3 and 5.4 shows examples of the FASTQ and FASTA formats respectively.
Since these formats share many similarities, fastqZip and fastaZip are designed with
the same compression procedure: the sequence data is split into its components which
are then compressed using strategies specialised for the particular type of data. The
components for the FASTQ format are the sequence identiers, the sequences, strand
information, and quality scores, whilst the components for the FASTA format are the
sequence identiers and sequences. Below we summarise the compression strategy for
each component.
Sequence identiers
The sequence identiers are split into elds based on the delimiters `.:#/ '. The elds
are then categorised according to whether they are constant, alphanumeric, numeric, or
110
incrementing numeric. Constant elds and incrementing numeric elds are compressed
by storing only the value of the rst item. This can be done as all the subsequent items
can be reproduced from rst. Alphanumeric elds are compressed using the zlib library
(which uses the DEFLATE algorithm), whilst numeric elds are rst converted to a
binary representation, then compressed using zlib.
Sequences
The sequences are compressed using the reference-based compression algorithm pre-
sented in Section 5.1. In some cases the sequences contain `N' characters which repre-
sent an unknown character. Given that `N' characters are removed from the reference
genome when constructing the FM-index, these characters are unable to be matched
to the reference genome and therefore must be handled separately. In our approach,
the sequence is split into sections of contiguous DNA letters and `N' characters. The
DNA sections are mapped to the reference genome using the FM-index, whilst the `N'
sections are encoded using tuples of hlen; 0; xi, where len is the number of repeated `N'
characters and x is a unique symbol indicating the tuple refers to an `N' section.
To improve the compression ratio we develop format-specic optimisations which
reduce the number of tuples required to encode a sequence. For the FASTQ format,
we observe that the sequences are obtained from both strands of the DNA. Given that
the reference genome contains the sequence of one of the DNA strands, the sequences
obtained from the other strand give a better mapping when their reverse complement is
mapped instead. For example, in a typical FASTQ le 35% of the sequences can be
Reference: AATGGGACGTGAGGGTTCCTCAGGCC
Sequence 3-tuples hpos; len; symi
TCACGT h19; 3;Ci, h8; 2; i
ACGTGA h6; 6; i
Best mapping ! h6; 6; i
Figure 5.5: Reverse complement optimisation for FASTQ les.
111
Reference: AATGGGACGTGAGGGTTCCTCAGGCC
Sequences 3-tuples hpos; len; symi
AATGGG h0; 6; i
ACGTGA h6; 6; i
Merge tuples ! h0; 12; i
Figure 5.6: Merging tuples optimisation for FASTA les.
encoded with one tuple, however this percentage increases to 70% when their reverse
complement is also mapped. In fastqZip, if any of the sequences are encoded with more
than one tuple, then their reverse complement is also mapped to the reference genome.
The nal encoding is the mapping with the fewest number of tuples, along with an
additional bit used to determine if the mapping is for the original sequence or its reverse
complement. Figure 5.5 shows an example of this optimisation.
For the FASTA format, we observe that the sequence length often exceeds the max-
imum length supported by the hardware architecture (255 characters in our implemen-
tation). In this case the sequence will be split into several smaller sequences which are
matched separately. This limits the maximum achievable compression ratio of the se-
quences to d(jSj=L)e  T , where jSj is the sequence length, L is the maximum length
supported by the hardware architecture and T is the tuple size in bytes. In fastaZip, a
merging step is introduced where tuples referring to adjacent sections of the reference
genome are merged together. This improves the compression ratio by enabling tuples to
represent longer mappings. Merging can only occur for tuples with a null sym eld, oth-
erwise this information would be lost. After merging, the pos and len elds of the tuples
typically have a wide range of values. These elds are further compressed by encoding
them using variable length integers. Figure 5.6 shows an example of this optimisation.
Strand information
The strand information is by far the easiest component to compress since it is constant
throughout the le. The `+' or `-' character therefore only needs to be stored once for
112
all the values to be reproduced. An additional ag is also stored to indicate whether the
sequence identier is repeated in the strand lines or not.
Quality scores
The quality scores are compressed using a modied version of run-length encoding (Fig-
ure 5.7). In the original algorithm, consecutive appearances of the same character are
encoded with pairs of hchr; leni, where chr is the given character and len is the number
of times the character appears. In the worst case there are no consecutive characters,
therefore the data size increases by the product of the number of characters and the
number of bits used to encode the run lengths. Since the quality scores have a maxi-
mum value of 126, we address this limitation by utilising the unused bit in each byte to
signal either a single character or a character followed by a run length. If the run length
is 1, the character is stored and the most signicant bit of the byte is set. Conversely,
if the run length is greater than 1, the character is stored with the most signicant bit
of the byte unset and the run length is stored in the subsequent byte. The run length
encoded quality scores are then further compressed using the bzip2 algorithm.
Figures 5.8 and 5.9 illustrate the compression procedure of fastqZip and fastaZip re-
spectively. The output is a series of compressed les for each component which are then
archived into a single le using the tar utility. To decompress the sequence data, the
compressed les are extracted and the compression procedure is performed in reverse. In
both fastqZip and fastaZip the le I/O is interleaved with the compression and decom-
pression procedures using a batch system. Here, a single thread is responsible for reading
the next batch of data, while other threads (and the FPGA) compress/decompress the
current batch of data concurrently.
Quality scores A A A B C D D D D
Original run-length hA, 3i hB, 1i hC, 1i hD, 4i
Modied run-length A* 2 B C D* 3
Figure 5.7: Run-length encoding. The `*' indicates where the most signicant bit is set.
113
Figure 5.8: fastqZip compression pipeline.
114
Figure 5.9: fastaZip compression pipeline.
115
5.3 Evaluation
In this section we evaluate fastqZip and fastaZip against other state-of-the-art compres-
sion tools. In Section 5.3.1, we describe the experimental setup such as the hardware
design evaluated and the systems used. In Section 5.3.2 we evaluate the reference-based
compression performance. Finally, in Sections 5.3.3 and 5.3.4 we evaluate the perfor-
mance of fastqZip and fastaZip against other specialised and general-purpose compression
tools respectively.
5.3.1 Experimental setup
For this evaluation we implement two versions of fastqZip and fastaZip: 1) a software ver-
sion where the entire compression procedure is performed on a CPU, and 2) a hardware-
accelerated version where reference-based compression is ooaded to an FPGA. In the
hardware-accelerated versions, the reference-based compression design presented in Sec-
tion 5.1.1 is implemented on a Maxeler MPC-X2000 system which hosts 8 Max4 DFEs.
Each DFE contains an Altera Stratix V FPGA (28nm feature size) and 48GB of DRAM.
The multi-channel DRAM mode is used, which allows up to 6 independent memory
channels per DFE. This mode places the following restrictions on the FM-index: 1) the
FM-index size must be less than 8GB in order to t on the DIMMs without splitting
the text, and 2) the bucket size should be a multiple of 64 bytes so that the memory
bandwidth is fully utilised and the buckets are memory aligned. Both versions of each
tool use the same FM-index structure and optimisations as the best performing design
in Chapter 3. Here, the FM-index is constructed with a sample distance of 64 and uses
the k-step FM-index with a step size of 2. This gives an FM-index size of 5.2 GB and
a bucket size of 128 bytes. The achieved memory bandwidth for the reference-based
compression design is 36 GB/s per DFE, which is 60% of the peak.
In the hardware-accelerated versions the FPGA is populated with 6 CMs which
are connected to their own memory controllers. The CMs run at 150MHz, whilst the
memory controllers run at the maximum clock frequency of 800MHz. Given that the
116
performance of the designs is memory-bound, increasing the clock frequency of the CMs
does not improve the performance any further. The design uses 45.6% of the look-up
tables, 44% of the ip-ops and 56% of the BRAMs available on the Altera Stratix V GS
5SGSD8 FPGA. Although a higher population of CMs is available, each memory channel
is saturated with just one CM, therefore the performance would not increase any further.
If the FPGA platform was able to support any number of memory controllers, then we
estimate that 10 CMs would be able to populate the FPGA. This would increase the
performance by 1.7 times given that the performance scales linearly with the number of
CMs.
The software compression tools evaluated are run with 12 threads on the MPC-X2000
host machine. This system has dual Intel Xeon E5-2640 CPUs (32nm feature size), 64
GB of RAM and a single Crucial MX200 SSD with read and write speeds of 500MB/s.
The software versions of both tools are written in C++11 and compiled with g++ version
4.9 with the optimisation ag -O3. OpenMP is used to parallelise the software code so
that the data is compressed and decompressed with multiple threads. To enable a fair
performance comparison, we choose to evaluate the parallel implementations of gzip and
bzip2 (named pigz and pbzip2 respectively) rather than their single threaded versions.
When evaluating the software tools which use reference-based compression, the full
Human genome build 38 [52] (3.1G characters) is used as the reference. Any `N' char-
acters are removed from the text so that its alphabet is fA, C, G, Tg. We assess the
performance of each compression tool using three metrics: 1) the compression ratio (the
original le size divided by the compressed le size); 2) the compression time; and 3)
the decompression time. For the compression and decompression times, the end-to-end
execution time is measured using the Linux time command and averaged over 5 runs.
5.3.2 Reference-based compression
Figure 5.10 compares the reference-based compression performance of the hardware-
accelerated version and software version. Here, the execution times reported do not
include the time for overheads such as le I/O, conguring the FPGA and transferring
117
  
 
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
2.9 
13.9 
0
2
4
6
8
10
12
14
16
Ex
ec
u
ti
o
n
 t
im
e 
(s
) 
Software (12 threads) Hardware-accelerated (1 DFE) 
Figure 5.10: The reference-based compression performance is measured for mapping 10
million substrings of 100 characters to the Human genome.
the FM-index to DRAM. The plot shows that the hardware-accelerated version running
on a single DFE is 4.8 times faster than the software version. The reasons for this
speed-up are as follows:
1. The CPU cache has a negligible number of hits, which exposes the large DRAM
access latency (this is why the hardware architecture does not contain a cache).
2. The memory architecture of the Maxeler Platform enables a higher eective mem-
ory bandwidth than the host system (36 GB/s compared to 8 GB/s for the host
system).
Figure 5.11 evaluates the strong-scaling of the hardware-accelerated version. he
plot shows a near linear speed-up with the number of DFEs, which is expected as the
sequences can be mapped independently. If the number of DFEs continues increasing,
the execution time would asymptote to 3.6 seconds. This is the time taken to stream the
data to and from the DFE via inniband. It should be noted, however, that before this
limit is reached the performance bottleneck would be the le I/O and compressing the
other sequence data components such as the quality scores. The strong-scaling eciency
(as a percentage of linear) is:
eciency =
T (1 DFE)
8  T (8 DFE)  100% =
30
8  4:5  100% = 83:3%
118
  
 
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Number of DFEs 
2.9 
13.9 
0
2
4
6
8
10
12
14
16
DFE software
Ex
ec
u
ti
o
n
 t
im
e
 (
s)
 
Figure 5.11: The performance of the hardware-accelerated version is measured for com-
pressing 100 million strings of 100 characters with a varying number of DFEs.
5.3.3 FASTQ compression
In this section fastqZip is evaluated against the compression tools listed in Table 5.1.
We choose to compress three FASTQ les obtained from the PRJEB3177 project [53]
(ERR161544 1.fastq, ERR1615445 1.fastq and ERR161546 1). These les contain 74-
107 million short reads of 100 bases which are sequenced using the Illumina HiSeq
2000 platform. These datasets were used as: a) they are publicly available so these
experiments can be reproduced; and b) Illumina has an overwhelming market share in
the NGS market, therefore most of the FASTQ data accumulated is sequenced by one
of their platforms. It should be noted that the latest Illumina NGS platforms use a
dierent binning strategy for the quality scores which can improve the compression ratio
by up to 2 times [59].
Tool Version Reference Notes
based
pigz [60] 1.3.12 % Parallel implementation of gzip.
pbzip2 [61] 1.1.6 % Parallel implementation of bzip2.
fqzcomp [55] 4.6 % {
fastqz [55] 15 " Sequences with the quality score `!' are removed.
quip [62] 1.1.8 " De-novo assembly-based compression is used.
dsrc2 [63] 2.0 % {
Table 5.1: Compression tools evaluated for the FASTQ format.
119
Table 5.2 shows the compression time, decompression time and compression ratio for
each tool. Four key results can be read from this table:
1. The average compression ratio of fastqZip is 3% higher than the best performing
tool fastqz. This improvement is due to the quality scores being compressed more
compactly in fastqZip. Note that fastqZip uses a modied run-length encoding
followed by bzip2, whilst fastqz uses ZPAQ with a custom model. It is important
to highlight that the average compression time of fastqZip is 12.6 times faster than
fastqz for the software version and 24.9 times faster for the hardware-accelerated
version. When compared to the general purpose compression tools, the average
compression ratio of fastqZip is 2.0 times higher than pigz and 1.6 times higher
than pbzip2. This improvement is due to the specialised compression strategies
used for each component in the FASTQ le.
2. The compression time for the hardware-accelerated version of fastqZip is up to
1.8 times faster than pigz and pbzip2. The large speed-up achieved relative to the
specialised tools is due to the dierence in strategies used to compress the sequence
component. In particular, we nd that the highly optimised FM-index used in this
ERR161544 1 (19 GB) ERR161545 1 (28 GB) ERR161546 1 (20 GB)
Tool c.ratio c.time d.time c.ratio c.time d.time c.ratio c.time d.time
pigz 3.2 252 93 3.0 402 189 3.1 268 98
pbzip2 4.1 184 84 3.7 309 131 3.9 215 91
fqzcomp 5.4 346 349 4.9 609 537 5.1 435 379
fastqz 6.6 3689 3661 5.8 5185 5068 6.2 3877 3775
quip 5.4 633 1007 4.9 945 1391 5.1 698 698
dsrc2 4.7 102 80 4.3 148 123 4.4 121 89
fastqZip v1 6.8 292 73 5.9 416 130 6.3 304 87
fastqZip v2 6.8 146 73 5.9 219 130 6.3 149 87
Table 5.2: FASTQ le compression. Here, c.ratio denotes the compression ratio, c.time
denotes the compression time in seconds, and d.time denotes the decompression time in
seconds. fastqZip v1 runs entirely in software, whilst fastqZip v2 runs reference-based
compression on a single DFE. The original le sizes are shown next to the FASTQ le
name.
120
work is signicantly faster than ZPAQ with a custom model (fastqz) and de novo
assembly (quip). Although the hardware-accelerated version of fastqZip is 1.4 times
slower than dsrc2, this result can be oset by the fact that the average compression
ratio for fastqZip is 1.4 times higher.
3. The compression time for the hardware-accelerated version of fastqZip is only 2.0
times faster than the software version. This is due to the le I/O amounting to
a signicant proportion of the execution time (50%), therefore the achievable
speed-up is limited by Amdahl's law. In order to achieve a greater speed-up, the
le I/O time can be minimised by using faster disks and RAID. If the le I/O time
is omitted, then the hardware-accelerated version can be up to 3.8 times faster
than the software version when running on a single DFE.
4. The decompression time of fastqZip is comparable to the best performing tools
pbzip2 and dsrc2. We nd that the decompression performance of these tools is
I/O bound for the system used. This is because the decompression procedures
are not very computationally intensive, therefore writing the original FASTQ les
(19-28 GB) to disk dominates the execution time.
Figure 5.12 shows how the compressed le size is distributed among the FASTQ
le components. The plot shows that the majority of the compressed le size is from
 
quality scores 
74% 
sequences 
19% 
sequence 
identifiers 
7% 
Figure 5.12: File composition for the compressed ERR161544 1 FASTQ le.
121
the quality scores. This is expected, as the randomness of the quality scores makes
them more challenging to compress than the other components. After compression, the
sequence identiers and sequences are 21.1 and 13.9 times smaller respectively, whilst
the quality scores are only 3.6 times smaller. The large compression ratio observed for
the sequences indicates that they map well to the reference genome. We nd that for
the FASTQ les compressed, each sequence can be represented with an average of 1.2
tuples.
5.3.4 FASTA compression
In this section fastaZip is evaluated against the compression tools listed in Table 5.3.
We attempted to include results for the tools GDC2 [56] and FRESCO [57] (which
also use reference-based compression), however they failed to run on the full genome
datasets used in this evaluation. This is because both of these tools are more suited to
compressing large batches of short sequences in the FASTA format. In this evaluation
we compress three previous builds of the full Human genome (3.1G characters) obtained
from the UCSC genome archive. We choose these datasets as the latest NGS platform to
come to market (manufactured by Oxford Nanopore) is able to produce ultra long reads
(150 Kb) which may eventually span the entire Human genome. We predict that future
clinical tests which leverage this platform will produce massive quantities of full-genome
data.
Table 5.4 shows the compression time, decompression time and compression ratio for
each tool. Three key results can be read from this table:
Tool Version Reference Notes
based
pigz [60] 1.3.12 % Parallel implementation of gzip.
pbzip2 [61] 1.1.6 % Parallel implementation of pbzip2.
faTo2Bit [64] { % Encodes each DNA letter with 2 bits.
mfCompress [65] 1.01 % Uses probabilistic model for compression.
Table 5.3: Compression tools evaluated for the FASTA format.
122
1. The average compression ratio of fastaZip is 165 times larger than the best per-
forming tool MFCompress. This large increase in the compression ratio is due to
fastaZip being the only tool to use reference-based compression coupled with the
high degree of similarity between the FASTA les and reference genome. We nd
that the merge tuples optimisation improves the compression ratio by approxi-
mately 25 times. This indicates that most of the tuples refer to adjacent sections
of the reference genome and can therefore be merged into a single tuple.
2. The compression time for the hardware-accelerated version of fastaZip is only 1.7
times faster than the software version. This is due to the hardware initialisation
steps, such as conguring the FPGA and transferring the FM-index to DRAM,
amounting to a signicant proportion of the execution time, therefore the speed-
up is limited according to Amdahl's Law. If the time taken to read the FM-
index, congure the FPGA and transfer the FM-index to DRAM is omitted, then
hardware-accelerated version of fastaZip would be 3.9 times faster than the software
version.
3. The compression and decompression time for both the software and hardware-
accelerated versions of fastaZip are comparable to the best performing tools faTo2Bit
and pbzip2. Once again we nd that the decompression performance of these tools
hg19 (3.1 GB) hg18 (3.1 GB) hg17 (3.1 GB)
Tool c.ratio c.time d.time c.ratio c.time d.time c.ratio c.time d.time
pigz 3.6 61 15 3.6 63 14 3.6 62 14
pbzip2 3.9 38 16 3.9 34 16 3.9 34 16
faTo2Bit 4.1 29 14 4.1 29 14 4.1 28 14
mfCompress 5.2 144 121 5.1 150 119 5.1 147 124
fastaZip v1 1014 49 13 777 48 13 761 46 13
fastaZip v2 1014 28 13 777 28 13 761 27 13
Table 5.4: FASTA le compression. c.ratio denotes the compression ratio, c.time denotes
the compression time in seconds, and d.time denotes the decompression time in seconds.
fastaZip v1 runs entirely in software, whilst fastaZip v2 runs the sequence component
compression on an FPGA. The original le sizes are shown next to the FASTA le name.
123
is I/O bound for the system used as the decompression procedures are not very
computationally intensive.
5.4 Summary
In this chapter we present the rst acceleration of reference-based compression with
FPGAs. We integrate this design into fastqZip and fastaZip, two new tools that we have
developed for compressing the FASTQ and FASTA sequence formats respectively. In
both tools the sequence les are split into components which are then compressed using
a strategies specialised for the particular data type. We show that through hardware
acceleration our tools can achieve a much higher compression ratio than general purpose
tools whilst also having faster compression and decompression speeds. For example,
fastqZip can achieve a compression ratio up to 2 times higher than the general purpose
tools pigz and pbzip2 and has a compression speed up to 1.8 times faster. We envisage
these tools having a very important place in both research and clinical settings as a
method for addressing the data storage challenges in NGS.
124
Chapter 6
Conclusion
In this work we show how FPGAs can be used to alleviate the data processing bottle-
necks in DNA sequencing. This is achieved through accelerating the FM-index, a data
structure used to solve the computationally intensive string matching problems found in
DNA sequencing analysis. Below we summarise the main contributions of this work:
Reducing the memory bottleneck (C1)
We develop three novel methods for addressing challenge C1 (Chapter 3). First, the
FM-index structure is customised according to the amount of memory on the FPGA
platform and memory burst size. This allows the FM-index to t in memory without
splitting the text and better utilises the available memory bandwidth of the FPGA
platform. Second, the hardware architecture is optimised so that the available memory
bandwidth is saturated and the data read from memory can be reused. Finally, algo-
rithmic optimisations are developed which reduce the number of memory accesses in the
search algorithm. We nd that the most eective algorithmic optimisation for this work
is the k-step FM-index which can reduce the number of memory accesses by over 2 times.
We implement several exact string matching designs on the Maxeler Max4 platform and
show that the best performing design running on a single DFE is 8.8 times faster than
the CPU-based tool Soap2, 5.0 times faster than the GPU-based tool Soap3-dp, and at
least 1.7 times faster than the FPGA-based tool FHAST.
125
Approximate string matching (C2)
We address challenge C2 by presenting a new approach for accelerating approximate
string matching which can achieve both high sensitivity and high performance (Chap-
ter 4). In this approach specialised modules based on the backtracking FM-index are
used to match strings to a text with a specic edit type and distance. For example, one
mismatch, two mismatches, one deletion, etc. Each type of module uses an exhaustive
search to nd all the potential matching locations and includes a custom set of search
phases which prune the search space by constraining the edit positions. The modules are
mapped to the FPGA using a novel run-time recongurable architecture, which provides
better resource utilisation, performance, and design modularity compared to a static ar-
chitecture. We implement several approximate string matching designs on the Maxeler
Max4 Platform and show that the best performing design (with up to two mismatches
permitted) running on a single DFE is 4.8 times faster than Bowtie2, 4.0 times faster
than the GPU-based tool Soap3-dp, and at least 2.0 times faster than the FPGA-based
tool FHAST. In addition to this, it achieves an identical sensitivity to the state-of-the-art
software alignment tools evaluated.
Quality and scope (C3)
We address challenge C3 by adapting the FM-index search algorithm for reference-
based compression and accelerating it using FPGAs (Chapter 5). This accelerated design
is integrated into fastqZip and fastaZip, two new tools that we have developed for com-
pressing sequence data stored in the FASTQ and FASTA formats respectively. In both
tools the sequence data is split into its components which are then compressed using
strategies specialised for the particular type of data. To improve the achievable compres-
sion ratio we develop two format-specic optimisations for reference-based compression.
For FASTQ les we additionally map the reverse complement of the sequences to the
reference genome and for FASTA les we merge tuples which refer to adjacent sections of
the genome. We show that both tools can achieve a much higher compression ratio than
general purpose tools whilst also having faster compression and decompression speeds.
For example, fastqZip can achieve a compression ratio up to 2 times higher than the gen-
126
eral purpose tools pigz and pbzip2 and has a compression speed up to 1.8 times faster.
Furthermore, fastqZip is able to achieve a compression ratio comparable to the best
performing specialised tool, fastqz, whilst the average compression and decompression
speeds are 25 and 43 times faster respectively.
The work presented in this thesis has focused on accelerating two important appli-
cations in DNA sequencing analysis, namely short read alignment and sequence data
compression. We show that for both applications FPGAs can provide a large speed-up
relative to state-of-the-art software tools and at a lower energy budget. For users this
can greatly improve the eciency of DNA sequencing analysis, allowing for faster re-
search and clinical testing, and reduces the large costs associated with storing sequence
data. At the start of this research the main barriers limiting the adoption of FPGAs in
this domain were: a) the high costs associated with building a compute infrastructure
with FPGAs; and b) the diculty in developing for them. These barriers are rapidly
disappearing, making FPGAs an increasingly suitable accelerator candidate. With Ama-
zon providing access to FPGAs in their F1 instance, users can now avoid the cost of
assembling infrastructures and tools in-house. Furthermore, since the users only pay
for F1 compute capacity by the hour, leveraging FPGAs may be the most economical
option given the achieved speed-up relative to software tools. Although there have been
massive leaps in improving the productivity of FPGA development, such as OpenCL
compatibility and HLS tools, there is work still to be done in order to bridge the gap
to software. In this work we show that a run-time recongurable architecture is a good
approach for making user-friendly FPGA designs in this domain. For example, when
applied to short read alignment, it allows the user to control the alignment parameters
without having to modify either the software or hardware. With a comprehensive library
of FPGA congurations, a user can quite easily create a bespoke hardware-accelerated
alignment tool that meets their requirements. This can achieved using scripts which take
a high-level description of the alignment algorithm and generate the software which calls
the corresponding FPGA congurations in the correct order. Of course, if the required
127
FPGA conguration does not exist for some of the alignment steps, the user would need
to implement these themselves. To avoid the massive drop in productivity for this sce-
nario, a domain specic language can be created to allow users to create string matching
designs at a higher level of abstraction.
Future Work
We see ve potential avenues for future research on this topic:
1. So far we have applied our work to short read alignment [66], Bisulte sequence
alignment [6] and sequence data compression [67]. There are, however, a num-
ber of other analysis steps that would benet from acceleration such as RNA
sequence alignment [17], variant calling [68] and de-novo assembly [69]. One ex-
citing possibility is accelerating the state-of-the-art RNA sequence alignment tool
HISAT2 [17]. This tool enables rapid and accurate alignment by using both a
global FM-index of the reference genome and many smaller indices that collec-
tively cover the genome. The smaller indices represent a genomic region of only
56 Kbp in size, therefore there is potential for a large speed-up by storing them in
the faster memories available on the FPGA device.
2. To improve developer productivity, a domain specic language (DSL) can be im-
plemented so that sequence analysis applications can be accelerated at a higher
level of abstraction. For example, a DSL could enable the researchers to easily cre-
ate new string matching modules and pipelines which target the string matching
problem specic to their work. To achieve this goal we are currently writing scripts
which automatically generate optimal FM-index designs for the Maxeler Platform
based on a description of the text and the memory performance. By using knowl-
edge edge transfer techniques [70], it may be possible to automatically generate
and port these designs to dierent devices from insights derived from optimising
related designs.
128
3. We would like to explore how our accelerated tools can be integrated into existing
sequence analysis tool-kits, such as GATK [71], so users can leverage FPGAs with-
out having to signicantly modify their analysis workows. One exciting possibility
is directly integrating our compression tools into a le system, e.g. FUSE. This
would enable the sequence data to be automatically compressed and decompressed
without user interaction. There are, however, several issues that must be resolved
such as ensuring the le is decompressed with the correct reference genome and
cases where no reference genome exists for the species.
4. There are several opportunities to improve the functionality of our FPGA designs.
For instance, our approximate matching design can be extended to support larger
numbers of mismatches and gaps through an index-assisted dynamic programming
step similar to that in Soap3-dp. To achieve this, the seeds can be matched to the
reference genome using exact string matching and a Smith-Waterman module can
be used to evaluate the seed positions. Our compression tools can be extended by
supporting both lossy compression of the quality scores and selective decompres-
sion. We would also like to test dierent algorithms for compressing the sequence
identiers and quality scores and see how they aect the trade-o between the
compression ratio and the compression and decompression speeds. One option for
the quality scores is k-array grouping which can greatly reduce the entropy per
character.
5. It would be interesting to see how the performance of our FM-index designs change
when ported onto the latest FPGA platforms. These platforms contain the latest
memory technologies, such as DDR4, Hybrid Memory Cube [72] and hardened
memory controllers, therefore it would be useful to understand the challenges and
opportunities they present. We would also like to explore how our accelerated
designs can be used on cloud-based platforms such as the Amazon F1 instance.
This would greatly reduce the barrier to adoption for the researchers and clinicians
engaged in DNA sequencing. Designs which target the next-generation of Maxeler
129
devices can also run on the Amazon F1 instance, therefore evaluating this cloud-
based platform could be achievable in the near future.
Final thoughts
The advent of NGS has revolutionised how we conduct biomedical research and pro-
vide healthcare. Its many applications, such as personalised medicine and non-invasive
prenatal and cancer diagnosis, have the potential to greatly impact society, however in
order for this potential to come to fruition, signicant work must be done to bridge the
performance gap between the generation of the sequence data and its subsequent analy-
sis. We see FPGAs as an increasingly useful candidate for this role. They can provide a
large speed-up through the highly customisable hardware fabric and in addition have a
low power consumption. Furthermore, with FPGAs now being oered in cloud environ-
ments and the improvement in high-level development ows, their limitations are being
rapidly addressed. In the future we envision FPGAs being used as custom accelerators
for a wide variety of applications in the storage and analysis of sequence data. We hope
that the research presented in this thesis has provided a step towards this vision.
130
Bibliography
[1] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin,
K. Devon, K. Dewar, M. Doyle, W. FitzHugh et al., \Initial sequencing and analysis
of the human genome," Nature, vol. 409, no. 6822, pp. 860{921, 2001.
[2] \Illumina Hiseq X Ten preliminary system specication sheet," http://
www.illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf, ac-
cessed: 2016-06-28.
[3] R. R. Gullapalli, K. V. Desai, L. Santana-Santos, J. A. Kant, M. J. Becich et al.,
\Next generation sequencing in clinical medicine: Challenges and lessons for pathol-
ogy and biomedical informatics," Journal of pathology informatics, vol. 3, no. 1,
p. 40, 2012.
[4] F. M. Lun, R. W. Chiu, K. Sun, T. Y. Leung, P. Jiang, K. A. Chan, H. Sun,
and Y. D. Lo, \Noninvasive prenatal methylomic analysis by genomewide bisulte
sequencing of maternal plasma DNA," Clinical chemistry, vol. 59, no. 11, pp. 1583{
1594, 2013.
[5] K. A. Chan, P. Jiang, C. W. Chan, K. Sun, J. Wong, E. P. Hui, S. L. Chan,
W. C. Chan, D. S. Hui, S. S. Ng et al., \Noninvasive detection of cancer-associated
genome-wide hypomethylation and copy number aberrations by plasma DNA bisul-
te sequencing," Proceedings of the National Academy of Sciences, vol. 110, no. 47,
pp. 18 761{18 768, 2013.
131
[6] J. Arram, W. Luk, and P. Jiang, \Ramethy: recongurable acceleration of bisul-
te sequence alignment," in Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2015, pp. 250{259.
[7] S. D. Kahn, \On the future of genomic data," science, vol. 331, no. 6018, pp.
728{729, 2011.
[8] G. Cochrane, B. Alako, C. Amid, L. Bower, A. Cerde~no-Tarraga, I. Cleland, R. Gib-
son, N. Goodgame, M. Jang, S. Kay et al., \Facing growth in the European nu-
cleotide archive," Nucleic acids research, p. gks1175, 2012.
[9] S. Roy, W. A. LaFramboise, Y. E. Nikiforov, M. N. Nikiforova, M. J. Routbort,
J. Pfeifer, R. Nagarajan, A. B. Carter, and L. Pantanowitz, \Next-generation se-
quencing informatics: challenges and strategies for implementation in a clinical
environment," Archives of pathology & laboratory medicine, vol. 140, no. 9, pp.
958{975, 2016.
[10] A. Putnam, A. M. Cauleld, E. S. Chung, D. Chiou, K. Constantinides, J. Demme,
H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., \A recongurable fabric for
accelerating large-scale datacenter services," in 2014 ACM/IEEE 41st International
Symposium on Computer Architecture (ISCA). IEEE, 2014, pp. 13{24.
[11] \Amazon f1 instance," http://aws.amazon.com/ec2/instance-types/f1/, accessed:
2016-08-02.
[12] J. E. Stone, D. Gohara, and G. Shi, \Opencl: A parallel programming standard for
heterogeneous computing systems," Computing in science & engineering, vol. 12,
no. 3, pp. 66{73, 2010.
[13] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown,
and T. Czajkowski, \Legup: high-level synthesis for fpga-based processor/accelera-
tor systems," in Proceedings of the 19th ACM/SIGDA international symposium on
Field programmable gate arrays. ACM, 2011, pp. 33{36.
132
[14] P. Ferragina and G. Manzini, \Opportunistic data structures with applications," in
Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on.
IEEE, 2000, pp. 390{398.
[15] R. Luo, T. Wong, J. Zhu, C.-M. Liu, X. Zhu, E. Wu, L.-K. Lee, H. Lin, W. Zhu,
D. W. Cheung et al., \SOAP3-dp: fast, accurate and sensitive GPU-based short
read aligner," PloS one, vol. 8, no. 5, p. e65632, 2013.
[16] B. Langmead and S. L. Salzberg, \Fast gapped-read alignment with Bowtie 2,"
Nature methods, vol. 9, no. 4, pp. 357{359, 2012.
[17] D. Kim, B. Langmead, and S. L. Salzberg, \HISAT: a fast spliced aligner with low
memory requirements," Nature methods, vol. 12, no. 4, pp. 357{360, 2015.
[18] J. T. Simpson and R. Durbin, \Ecient de novo assembly of large genomes using
compressed data structures," Genome research, vol. 22, no. 3, pp. 549{556, 2012.
[19] E. B. Fernandez, W. A. Najjar, S. Lonardi, and J. Villarreal, \Multithreaded FPGA
acceleration of DNA sequence mapping," in High Performance Extreme Computing
(HPEC), 2012 IEEE Conference on. IEEE, 2012, pp. 1{6.
[20] E. B. Fernandez, J. Villarreal, S. Lonardi, and W. A. Najjar, \FHAST: FPGA-based
acceleration of Bowtie in hardware," IEEE/ACM Transactions on Computational
Biology and Bioinformatics (TCBB), vol. 12, no. 5, pp. 973{981, 2015.
[21] A. Chacon, J. C. Moure, A. Espinosa, and P. Hernandez, \n-step fm-index for faster
pattern matching," Procedia Computer Science, vol. 18, pp. 70{79, 2013.
[22] J. Zhang, H. Lin, P. Balaji, and W.-c. Feng, \Optimizing Burrows-Wheeler
transform-based sequence alignment on multicore architectures," in Cluster, Cloud
and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium
on. IEEE, 2013, pp. 377{384.
133
[23] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang, \SOAP2:
an improved ultrafast tool for short read alignment," Bioinformatics, vol. 25, no. 15,
pp. 1966{1967, 2009.
[24] E. Fernandez, W. Najjar, and S. Lonardi, \String matching in hardware using the
FM-index," in Field-Programmable Custom Computing Machines (FCCM), 2011
IEEE 19th Annual International Symposium on. IEEE, 2011, pp. 218{225.
[25] P. Draghicescu, G. Edvenson, and C. Olson, \Inexact search acceleration on fpgas
using the burrows-wheeler transform," Pico Computing, Inc., Tech. Rep, 2012.
[26] Y. Xie, G. Wu, J. Tang, R. Luo, J. Patterson, S. Liu, W. Huang, G. He, S. Gu,
S. Li et al., \Soapdenovo-trans: de novo transcriptome assembly with short rna-seq
reads," Bioinformatics, vol. 30, no. 12, pp. 1660{1666, 2014.
[27] E. Fredkin, \Trie memory," Communications of the ACM, vol. 3, no. 9, pp. 490{499,
1960.
[28] S. Kurtz, \Reducing the space requirement of sux trees," Software-Practice and
Experience, vol. 29, no. 13, pp. 1149{71, 1999.
[29] U. Manber and G. Myers, \Sux arrays: a new method for on-line string searches,"
siam Journal on Computing, vol. 22, no. 5, pp. 935{948, 1993.
[30] D. K. Kim, J. S. Sim, H. Park, and K. Park, \Constructing sux arrays in linear
time," Journal of Discrete Algorithms, vol. 3, no. 2, pp. 126{142, 2005.
[31] M. Burrows and D. J. Wheeler, \A block-sorting lossless data compression algo-
rithm," 1994.
[32] T. Gagie, \Sequential-access fm-indexes," arXiv preprint arXiv:1205.1195, 2012.
[33] T. W. Lam, R. Li, A. Tam, S. Wong, E. Wu, and S.-M. Yiu, \High throughput
short read alignment via bi-directional BWT," in Bioinformatics and Biomedicine,
2009. BIBM'09. IEEE International Conference on. IEEE, 2009, pp. 31{36.
134
[34] \NVIDIA Titan X specications," http://www.geforce.co.uk/hardware/10series/
titan-xp/#specs, accessed: 2016-08-02.
[35] \Nallatech 510T accelerator card data sheet," http://www.nallatech.com/
wp-content/uploads/Data-Sheet-510T-V1.5.pdf, accessed: 2016-08-02.
[36] G. Zhang, P. H. W. Leong, C. H. Ho, K. H. Tsoi, C. C. Cheung, D.-U. Lee, R. C.
Cheung, and W. Luk, \Recongurable acceleration for Monte Carlo based nan-
cial simulation," in Proceedings. 2005 IEEE International Conference on Field-
Programmable Technology, 2005. IEEE, 2005, pp. 215{222.
[37] F. P. Russell, P. D. Duben, X. Niu, W. Luk, and T. N. Palmer, \Architectures
and precision analysis for modelling atmospheric variables with chaotic behaviour,"
in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd
Annual International Symposium on. IEEE, 2015, pp. 171{178.
[38] S. Deorowicz and S. Grabowski, \Data compression for sequencing data," Algo-
rithms for Molecular Biology, vol. 8, no. 1, p. 25, 2013.
[39] \Alpha Data ADM-PCIE-9V3 data sheet," http://www.alpha-data.com/pdfs/
adm-pcie-7v3.pdf, accessed: 2016-08-02.
[40] \Maxeler MPC-X2000 dataow computing platform," https://www.maxeler.com/
products/mpc-xseries/, accessed: 2016-06-28.
[41] \Maxeler app gallery," http://appgallery.maxeler.com, accessed: 2016-08-02.
[42] H. Li and R. Durbin, \Fast and accurate short read alignment with Burrows{
Wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754{1760, 2009.
[43] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu, X. Chu,
K. Zhao et al., \Soap3: ultra-fast gpu-based parallel alignment tool for short reads,"
Bioinformatics, vol. 28, no. 6, pp. 878{879, 2012.
135
[44] T. F. Smith and M. S. Waterman, \Identication of common molecular subse-
quences," Journal of molecular biology, vol. 147, no. 1, pp. 195{197, 1981.
[45] M. Gok and C. Yilmaz, \Ecient cell designs for systolic Smith-Waterman imple-
mentations," in 2006 International Conference on Field Programmable Logic and
Applications. IEEE, 2006, pp. 1{4.
[46] A. Sirasao, E. Delaye, R. Sunkavalli, and S. Neuendorer, \FPGA based OpenCL
acceleration of genome sequencing software," System, vol. 128, no. 8.7, p. 11, 2015.
[47] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, and W. L. Ruzzo,
\Hardware acceleration of short read mapping," in Field-Programmable Custom
Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium
on. IEEE, 2012, pp. 161{168.
[48] Y. Sogabe and T. Maruyama, \A variable length hash method for faster short read
mapping on FPGA," in 2015 25th International Conference on Field Programmable
Logic and Applications (FPL). IEEE, 2015, pp. 1{6.
[49] C. Nelson, K. Townsend, B. S. Rao, P. Jones, and J. Zambreno, \Shepard: A
fast exact match short read aligner," in Formal Methods and Models for Codesign
(MEMOCODE), 2012 10th IEEE/ACM International Conference on. IEEE, 2012,
pp. 91{94.
[50] E. Fernandez, W. Najjar, E. Harris, and S. Lonardi, \Exploration of short reads
genome mapping in hardware," in 2010 International Conference on Field Pro-
grammable Logic and Applications. IEEE, 2010, pp. 360{363.
[51] N. Homer, B. Merriman, and S. F. Nelson, \Bfast: an alignment tool for large scale
genome resequencing," PloS one, vol. 4, no. 11, p. e7767, 2009.
[52] \Human genome (assembly GRCh38)," http://hgdownload.cse.ucsc.edu/
goldenPath/hg38, accessed: 2016-08-02.
136
[53] \PRJEB3177 study," http://www.ebi.ac.uk/ena/data/view/PRJEB3177, accessed:
2016-08-15.
[54] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer,
M. C. Schatz, S. Sinha, and G. E. Robinson, \Big data: astronomical or genomical?"
PLoS biology, vol. 13, no. 7, p. e1002195, 2015.
[55] J. K. Boneld and M. V. Mahoney, \Compression of FASTQ and SAM format
sequencing data," PloS one, vol. 8, no. 3, p. e59190, 2013.
[56] S. Deorowicz, A. Danek, and M. Niemiec, \GDC 2: Compression of large collections
of genomes," Scientic reports, vol. 5, 2015.
[57] S. Wandelt and U. Leser, \FRESCO: Referential compression of highly similar se-
quences," IEEE/ACM Transactions on Computational Biology and Bioinformatics,
vol. 10, no. 5, pp. 1275{1288, 2013.
[58] Y. Zhang, L. Li, J. Xiao, Y. Yang, and Z. Zhu, \FQZip: Lossless reference-based
compression of next generation sequencing data in FASTQ format," in Proceedings
of the 18th Asia Pacic Symposium on Intelligent and Evolutionary Systems-Volume
2. Springer, 2015, pp. 127{135.
[59] \Illumina quality score binning," https://www.illumina.com/Documents/products/
whitepapers/whitepaper datacompression.pdf, accessed: 2016-08-02.
[60] M. Adler, \pigz: A parallel implementation of gzip for modern multi-processor,
multi-core machines," Jet Propulsion Laboratory, 2015.
[61] J. Gilchrist, \Parallel data compression with bzip2," in Proceedings of the 16th
IASTED international conference on parallel and distributed computing and sys-
tems, vol. 16, 2004, pp. 559{564.
[62] D. C. Jones, W. L. Ruzzo, X. Peng, and M. G. Katze, \Compression of next-
generation sequencing reads aided by highly ecient de novo assembly," Nucleic
acids research, vol. 40, no. 22, pp. e171{e171, 2012.
137
[63]  L. Roguski and S. Deorowicz, \DSRC 2|Industry-oriented compression of FASTQ
les," Bioinformatics, vol. 30, no. 15, pp. 2213{2215, 2014.
[64] \Blat suite program specications and user guide," https://genome.ucsc.edu/
goldenpath/help/blatSpec.html, accessed: 2016-08-02.
[65] A. J. Pinho and D. Pratas, \MFCompress: a compression tool for FASTA and
multi-FASTA data," Bioinformatics, vol. 30, no. 1, pp. 117{118, 2014.
[66] J. Arram, T. Kaplan, W. Luk, and P. Jiang, \Leveraging FPGAs for accelerating
short read alignment," 2016.
[67] J. Arram, M. Panzer, T. Kaplan, and W. Luk, \FPGA acceleration of reference-
based compression for genomic data," in Field Programmable Technology (FPT),
2015 International Conference on. IEEE, 2015, pp. 9{16.
[68] V. Moncunill, S. Gonzalez, S. Bea, L. O. Andrieux, I. Salaverria, C. Royo, L. Mar-
tinez, M. Puiggros, M. Segura-Wang, A. M. Stutz et al., \Comprehensive charac-
terization of complex structural variations in cancer by directly comparing genome
sequence reads," Nature biotechnology, vol. 32, no. 11, pp. 1106{1112, 2014.
[69] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kris-
tiansen et al., \De novo assembly of human genomes with massively parallel short
read sequencing," Genome research, vol. 20, no. 2, pp. 265{272, 2010.
[70] M. Kurek, M. P. Deisenroth, W. Luk, and T. Todman, \Knowledge transfer in
automatic optimisation of recongurable designs," in Field-Programmable Custom
Computing Machines (FCCM), 2016 IEEE 24th Annual International Symposium
on. IEEE, 2016, pp. 84{87.
[71] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky,
K. Garimella, D. Altshuler, S. Gabriel, M. Daly et al., \The Genome Analysis
Toolkit: a Mapreduce framework for analyzing next-generation DNA sequencing
data," Genome research, vol. 20, no. 9, pp. 1297{1303, 2010.
138
[72] J. T. Pawlowski, \Hybrid memory cube (HMC)," in Hot Chips 23 Symposium
(HCS), 2011 IEEE. IEEE, 2011, pp. 1{24.
139
