Hardware implementation of efficient path reconstruction for the Smith-Waterman algorithm by Buhagiar, Karl et al.
Hardware Implementation of Efficient Path
Reconstruction for the Smith-Waterman Algorithm
Karl Buhagiar, Owen Casha∗, Ivan Grech, Edward Gatt and Joseph Micallef
Department of Microelectronics and Nanoelectronics
Faculty of Information and Communication Technology
University of Malta
Msida, Malta
∗Email: owen.casha@um.edu.mt
Abstract—This paper presents the hardware implementation
of a differential coded Smith-Waterman algorithm on a
field programmable gate array (FPGA). A novel path
reconstruction algorithm is proposed to overcome the need
of a dedicated memory for the storage of the similarity
matrix, thus limiting the on-chip hardware utilization. In
addition, this algorithm is also efficient in terms of
computational speed since the path reconstruction is carried
out during the computation of the similarity score, rather
then afterwards, as in the case of the original algorithm.
The implementation was done on the Xilinx Spartan-6
XC6LX16-CS324 FPGA using VHDL, consisting of 1,024
processing elements and exhibits a throughput of 204.8 BCUPS
at a data rate of 600 Mbps.
I. INTRODUCTION
Bioinformatics is a continuously growing research area
which merges biology, computational science and information
technology to uncover the genetic information hidden in
every living species. The huge advances in the fields of
molecular biology and genetics have induced an increase in
large DNA sequencing projects and a correspondingly massive
increase in the available biological data. Sequence alignment
algorithms are tools used to compare two or more sequences
so as to identify highly similar segments and the amount of
mutations present in these sequences. This is indispensable
in basic research as well as in practical applications such
as criminal forensics, disease prevention, drug discovery and
pharmaceutical development [1].
There are three types of mutations: insertions, deletions
and substitutions. The amount of mutations determines
the similarity and the possible relationship between a
pair of biological sequences. In fact, sequence alignment
algorithms determine functional, structural or evolutionary
relationships between the molecular sequences through a
process known as trace-back or path reconstruction [2]. Several
sequence alignment algorithms, such as BLAST, FASTA,
Needleman-Wunsch Algorithm, Smith-Waterman Algorithm,
PatternHunter and MUMmer [1], were developed to solve
the sequence alignment problem. The Smith-Waterman
and Needleman-Wunsch algorithms are known as dynamic
programming algorithms, where a complex problem is solved
by breaking it down into simpler sub-problems, while the rest
are all heuristic algorithms. The Smith-Waterman algorithm
(SWA) is a local sequence alignment algorithm while the
Needleman-Wunsch is a global alignment algorithm. The SWA
outperforms heuristic algorithms on scalability and accuracy
but lacks in terms of computational speed and memory
requirements. Nonetheless, the steps involved in the algorithm
are simple, therefore a high degree of parallelization is
possible, by employing state-of-the-art platforms such as Field
Programmable Gate Arrays (FPGA), Graphics Processing
Units (GPU) and multiple core CPU [1].
Generally, the path reconstruction process is not
implemented on an FPGA platform due to its iterative
nature which translates into a large amount of hardware
utilization that increases exponentially with the length of
the source and target sequences. This paper presents the
FPGA implementation of a differential coded SWA [3], in
which a novel path reconstruction algorithm is proposed to
overcome the need of a dedicated memory for the storage
of the similarity matrix, thus limiting the on-chip hardware
utilization. This allows more processing elements to be placed
on the FPGA. The proposed implementation is also efficient
in terms of computational speed since the path reconstruction
is carried out concurrently with the computation of the
similarity score.
II. SMITH-WATERMAN ALGORITHM
The SWA is a local sequence alignment algorithm while the
Needleman-Wunsch is a global alignment algorithm. Instead
of looking at the total sequence, it performs local sequence
alignment by comparing segments of all possible lengths and
optimizes the similarity measure, thus determining similar
regions between two molecular sequences [3]. The similarity
between two sequences is determined using a similarity score
table based on statistical values or else a similarity score of
a match and a mismatch penalty cost are defined a priori. In
the case of DNA sequences the similarity score is given for
the four elements A, G, C and T which represent the four
nucleobases Adenine, Guanine, Cytosine and Thymine.
A similarity matrix, also known as the edit-distance matrix
Hi,j , is built to find pairs of segments with high degree
of similarity. The similarity matrix is a two dimensional
matrix of size i by j, where i and j are the lengths of the
source and target sequences respectively. Figure 1 illustrates
the similarity matrix constructed for the following source (S)
and target (T) sequences: S: {C,G,C,C,T,G,A,G,T,G,A,T} and
T: {C,G,A,T,C,C,A,G,A,G,A,T}.
Fig. 1. The similarity matrix Hi,j including the edit-distances (dashed lines)
and the alignment path (shaded).
The SWA was introduced back in 1981 [4] and
its importance motivated many researchers to develop
simplifications in order to improve its performance and permit
easier implementation on a hardware platform [3][5][6]. As
shown in Figure 1, the edit-distances marked by the dashed
anti-diagonal line can be computed concurrently and so this
permits a systolic array implementation [3][6]. Considering a
penalty cost of 1 for both insertions and deletions and a penalty
cost of 2 for substitutions, the work in [6] has shown that the
edit-distance Hi,j can be computed using (1).
d =
{
a (Si = Tj) + (b = a− 1) + (c = a− 1)
a+ 2 (Si 6= Tj) · (b = a+ 1) · (c = a+ 1) (1)
where a = Hi−1,j−1, b = Hi,j−1, c = Hi−1,j and
d = Hi,j . As indicated by the shaded path in Figure 1, the
optimum alignment for these two sequences is given by:
S: {C G – – C C T G A G T – G A T}
T: {C G A T C C – – A G – A G A T}
where ‘–’ indicates a gap. The type of jump in the path
represents an operation: a diagonal jump is an alignment,
a left-right jump is a deletion and a top-down jump is an
insertion. A substitution is a succession of a deletion and an
insertion. In the given example, there are two deletions, two
insertions and one substitution, thus 6 penalty costs are needed
to optimally align the two sequences. The simplification
proposed in [3], known as differential coding, exploits the fact
that each cell in Hi,j can be either the value of the cell above
it add one, or the value of the cell above it minus one. Thus
a cell can be in two states which can represented by one bit
thus limiting the hardware utilization. This is summarized in
(2) and can be used to obtain the simplified similarity matrix
Ii,j (refer to Figure 2).
Ii,j =
{
1 Hi,j = Hi,j−1 + 1
0 Hi,j = Hi,j−1 − 1 (2)
Fig. 2. The simplified similarity matrix Ii,j using differential coding.
III. PATH RECONSTRUCTION ALGORITHM
The algorithm presented here was developed specifically
for the systolic array implementation of the differential coded
SWA [3]. It utilizes the edit-distances Ii,j computed by the
various processing element in the systolic array, in order to
reconstruct the path without the need of storing the similarity
matrix. This algorithm saves the memory required to store the
similarity matrix and the computational time involved, since
the path reconstruction is carried out concurrently with the
similarity score computation via dedicated hardware. On every
clock cycle, the output of the two processing elements is stored
into a 4-bit register (used to emulate a 2x2 matrix) and based
on its value, an alignment or mutation is detected accordingly.
The algorithm is explained using the example provided in
Section II. Figure 3 shows the hardware implementation of the
path reconstruction algorithm and Figure 4 shows the process
element data flow taking place in the systolic array.
Fig. 3. Hardware implementation of the path reconstruction algorithm.
The algorithm starts at position 0 and loads the R3 and
R2 with the outputs of PE1 and PE2 on clock cycle 1.
On clock cycle 2, R1 and R0 are then loaded with the
outputs of PE1 and PE2. An alignment is detected whenever
the register contains the value “0X1X”, where X indicates
a don’t care. A position pointer is used to determine which
process element outputs are to be loaded into the register. An
alignment is detected on clock cycle 2 and the position pointer
is incremented such that the R3 and R2 are now fed with
the outputs of PE2 and PE3. The process is repeated until
clock cycle 5, where a “11” pattern is detected indicating the
occurrence of an insertion. This happens also on clock cycle
6. Note that during these two clock cycles, the position pointer
is not increased. This process can be associated with that of
the path reconstruction in the original algorithm, where an
Fig. 4. Data flow within the process element (PE) for a particular source-target sequence pair.
insertion is represented by a top-down jump (refer to Figure 2).
On clock cycle 7, a “10” is detected in R3R2 indicating that a
possible alignment may result in the coming clock cycles and
so the pointer position is incremented.
Deletions are identified from the other mutations because
the 2x2 matrix corresponding to the next operation overlaps the
bottom right cell of the 2x2 matrix of the current operation
(R3 = R0). If “1000” is detected, a single deletion would
have occurred whereas if a “1011” pattern is noted, this
indicates that multiple deletions are occurring (refer to clock
cycle 12). Multiple deletions are terminated when the register
holds a value of “1000”. Once again, this can be associated
with that of the path reconstruction in the original algorithm,
where a deletion is represented by a left-right jump (refer to
Figure 2). When the value of the register is “X0X1”, this
indicates a substitution and it is distinguished from the multiple
deletions case, because there is no overlapping of the bottom
right cell of the next matrix. Note that during the last clock
cycle, the register cannot be fully loaded since the computation
would be terminated. In this case, the cell is checked and if it
is a ‘0’, then it indicates an alignment.
IV. SYSTEM IMPLEMENTATION
Figure 5 shows a simplified block diagram of the systolic
array SWA with path reconstruction implemented on an FPGA
which is clocked at 200 MHz. The FPGA is connected to a
personal computer (PC) via an RS-232 protocol implemented
by means of a Universal Asynchronous Receiver Transmitter
(UART) module which permits two-way communication. The
PC has internet access such that the nucleotide sequences can
be accessed and downloaded from open source web based
databases such as NCBI or EMBL, using a sequence accession
Fig. 5. A block diagram of the implemented system.
number, which is a unique identification assigned to each
sequence. A MATLAB program with a graphical user interface
(GUI) was implemented on the PC so that the nucleotide
sequences can be easily retrieved and transmitted to the FPGA.
This program is also used to display the results generated by
the FPGA. Once the sequences are retrieved from the online
database, the source sequence is truncated to the length of
the systolic array (1,024 processing elements), converted to a
binary stream and then divided into bytes, where each byte is
made up of four two-bit nucleotides.
Truncation is not necessary in the case of the target
sequence since its length is independent of the size of systolic
array. Each byte is transmitted to the FPGA via the UART
module and then the 8-to-2 shifter converts the data into a 2-bit
serial stream to be processed in the systolic array. Sequence
alignment takes place by first loading the source sequence in
the systolic array, in a last in first out manner, and then shifting
the target sequence through the 1,024 processing elements. The
similarity score counter and the path reconstruction blocks are
interfaced with the array in order to generate the similarity
score and the optimal alignment. This information is then
sent back to the UART via the transmission controller to
be displayed on the GUI. The following subsections briefly
describe the implementation of the processing element and the
similarity score counter.
A. Processing Element
The processing element employed in this implementation is
based on that proposed in [5] with some minor modifications
(refer to Figure 6). It is the hardware synthesis of (1) after
differential coding is applied. Signal SourceLoad is connected
to the chip enable (CE) of the flip-flops holding the 2-bit source
nucleotide such that their value can be shifted to the next
processing element on every rising edge. Similarly, an enable
signal TargetLoad is connected to the CE of the other flip-flops
to synchronize the combinational logic operations taking place
in the processing element.
Fig. 6. Hardware implementation of the processing element [5].
B. Similarity Score Counter
At the start of the alignment process, the similarity
score counter is initialized with a score equal to the
number of processing elements in the systolic array, and on
every clock edge, provided the TargetLoad enable signal is
high, the score is either increased or decreased depending
on the value of the dout of the last processing element
in the systolic array: if dout is high then the score is
incremented by one, whereas when dout is low the score is
decremented by one. Considering the 12 processing element
systolic array example given in Section II, one can note that
the sequence coming out from dout of the twelfth processing
element PE12 contains 9 zeros and 3 ones (refer to Figure 4).
So a similarity score of 12+9∗(−1)+3∗(1) = 6 is obtained,
indicating that 6 penalty costs are needed to optimally align
the two sequences.
V. EVALUATION
The proposed system was implemented on the Xilinx
Spartan-6 XC6LX16-CS324 FPGA. This FPGA contains
2,278 slices which limited the implementation to 1,024
processing elements together with the required interface
and control logic. The systolic array operates at a
clock frequency of 200 MHz and achieves a throughput
of 204.8 BCUPS (billion cell updates per second) at
a data rate of 600 Mbps. The performance of the
designed system is comparable to that reported by similar
implementations. The implementation in [3] utilised an
FPGA containing 12,288 slices to place 4,032 processing
elements operating at a clock frequency of 133 MHz,
and resulting in a performance of 136 BCUPS. A similar
work to that presented in [3], operates at a maximum
clock frequency of 202 MHz and features a performance of
814 BCUPS [5].
The results obtained in this work, show that a comparable
operating clock frequency and an improvement in FPGA slice
utilization was achieved given that the system includes also a
path reconstruction process. The performance in BCUPS was
limited by the number of available slices since theoretically,
it increases linearly with the amount of processing elements
available in the systolic array. Most often the target sequences
will be of a length comparable to the source sequence, but
this implementation allows for the target sequence to be up to
2,044 base pairs long to allow for any worst case conditions.
VI. CONCLUSION
The original path reconstruction algorithm cannot be
efficiently implemented on an FPGA platform due to its
iterative nature which translates into a large amount of
hardware utilization which increases exponentially with the
size of the source and target sequences. In addition, it requires
a dedicated memory for the storage of the similarity matrix,
whose size also scales up with the length of the source
and target sequences. Thus, this paper presented the FPGA
implementation of a differential coded SWA, in which a
hardware efficient path reconstruction algorithm is proposed
to overcome these issues. Even though the FPGA employed
in this work had a limited amount of slices when compared
to other platforms, a processing element to FPGA slice ratio
of 0.45 was achieved when compared to 0.33 in [3][5], while
also including a path reconstruction algorithm on chip. The
proposed algorithm is also efficient in terms of computational
speed since the path reconstruction is carried out during the
computation of the similarity score, rather then afterwards,
as in the case of the original algorithm. The throughput
performance could be improved by using an FPGA which has
more slices, since theoretically, throughput increases linearly
with the amount of processing elements in the systolic array.
REFERENCES
[1] L. Hasan and Z. Al-Ars, Computational Biology and Applied
Bioinformatics: An Overview of Hardware-Based Acceleration of
Biological Sequence. The Netherlands: InTech, 2011.
[2] Z. Nawaz, M. Nadeem, J. van Someren, and K. Bertels, “A parallel fpga
design of the smith-waterman traceback,” International Conference on
Field-Programmable Technology (FPT 2010), pp. 454–459, 2010.
[3] C. Yu, K. Kwong, K. Lee, and P. Leong, “A smith-waterman systolic
cell,” Field-Programmable Logic and Applications (FPL 2013), pp.
375–384, 2003.
[4] T. Smith and M. Waterman, “Identification of common molecular
subsequences,” Journal of Molecular Biology, pp. 195–197, 1981.
[5] G. Caffarena, C. Pedreira, C. Carreras, S. Bojanic, and O. Nieto-Taladriz,
“Fpga acceleration for dna sequence alignment,” Journal of Circuits,
Systems, and Computers, pp. 245–266, 2007.
[6] R. Lipton and D. Lopresti, “A systolic array for rapid string comparison,”
Chapel Hill Conference on VLSI, pp. 363–376, 1985.
