First Experiences Optimizing Smith-Waterman on Intel's Knights Landing
  Processor by Rucci, Enzo et al.
ar
X
iv
:1
70
2.
07
19
5v
1 
 [c
s.D
C]
  2
3 F
eb
 20
17
First Experiences Optimizing Smith-Waterman on
Intel’s Knights Landing Processor
Enzo Rucci∗1, Carlos Garcia†1, Guillermo Botella‡1, Armando De
Giusti2, Marcelo Naiouf3, and Manuel Prieto-Matias1
1III-LIDI, CONICET, Facultad de Informa´tica, Universidad Nacional
de La Plata
2Depto. Arquitectura de Computadores y Automa´tica, Universidad
Complutense de Madrid
3III-LIDI, Facultad de Informa´tica, Universidad Nacional de La Plata
February 24, 2017
Abstract
The well-known Smith-Waterman (SW) algorithm is the most com-
monly used method for local sequence alignments. However, SW is
very computationally demanding for large protein databases. There
exist several implementations that take advantage of computing par-
allelization on many-cores, FPGAs or GPUs, in order to increase the
alignment throughtput. In this paper, we have explored SW accelera-
tion on Intel KNL processor. The novelty of this architecture requires
the revision of previous programming and optimization techniques on
many-core architectures. To the best of authors knowledge, this is the
first KNL architecture assessment for SW algorithm. Our evaluation,
using the renowned Environmental NR database as benchmark, has
shown that multi-threading and SIMD exploitation reports competitive
performance (351 GCUPS) in comparison with other implementations.
1 Introduction
Nowadays the greatest challenge of Bioinformatics is no longer data gen-
eration but also efficient information analysis and interpretation. In fact,
sequencing technologies [11] is currently considered one of the most success-
ful instruments in Bioinformatics, basically solved by heuristic methods.
∗erucci@lidi.info.unlp.edu.ar
†garsanca@ucm.es
‡gbotella@ucm.es
1
The key aspect of Smith-Waterman (SW) algorithm [20] is that always
finds the optimal local alignment between two sequences. This characteristic
makes this method the basis of more sophisticated alignment technologies,
so its study and acceleration in different platforms has motivated a great
interest for the scientific community. Although, many approaches, such as
BLAST [1] and FASTA [13] are more efficient in term of execution time,
they do not guarantee the optimal alignment.
SW establishes similar regions between two DNA or protein sequences.
A score matrix must be built in order to determine the best alignment. Be-
sides, matrix size depends on sequence lengths which determines the parallel
scalability. From parallel processing perspective, regarding DNA alignment
with sequences up to hundreds of million nucleotide, the huge matrix cre-
ated only permits to perform a single sequence pair, so the low-level paral-
lelism available in the alignment can be exploited by means of the intra-task
scheme. Nevertheless, protein sequences which are shorter requires small
matrices. This aspect permits to exploit coarse level parallelism computing
multiple independent alignments simultaneously in inter-task approach way.
The computational complexity of the SW algorithm has motivated a
large amount of research in order to reduce execution time by means of
acceleration on a great variety of architectures. In the last years in the
context of SW protein alignment, we have witnessed SIMD-vector exploita-
tion [4, 16, 15] available now on modern CPUs, highlighting the recently
released Parasail library [3]. In the field of heterogeneous computing, the
most successful solution is the CUDASW++ software [9] for multi CUDA-
enabled Graphics Processor Units (GPUs) with concurrent CPU computing.
Moreover, for Intel’s co-processors based on Xeon Phi, we highlight both op-
timized hand-tuned SW implementations denominated as SWAPHI [10] and
LSBDS [7]. Besides centering on Intel Xeon Phi alternative, Rucci et al. [17]
have recently studied also energy efficiency on a hybrid implementation that
exploits both CPU and co-processors simultaneously. Using FPGAs as ac-
celerators, we can found linear systolic array implementations for Xilinx
Virtex FPGAs [6, 12], custom instructions [8] and the proposal of Rucci et
al. [18] where the behavior of the novel paradigm of OpenCL on Altera’s
FPGAs is studied, whose most relevant results show that these devices are
the most efficient from energy footprint perspective.
Our paper proposes and evaluates a SW algorithm using the last gen-
eration of Intel’s Xeon Phi with the Knights Landing (KNL) architecture.
We would like to note that although there exist SW studies in old Xeon
Phi with Knights Corner (KNC) architecture [10, 7, 17], to the best of au-
thors knowledge there are no related works in Bioinformatics scenario with
KNL architecture due to its recent commercialization. Among the main
differences of KNL respect to its predecessor, are the incorporation of AVX-
512 extensions, a remarkable number of vector units increment and new
on-package high-bandwidth memory. These aspects make necessary the re-
2
vision of the previous optimization proposals for the SW algorithm.
Section 2 introduces the basic concepts of the Smith-Waterman algo-
rithm. Section 3 briefly introduces the Intel’s Xeon Phi architecture and
in Section 4 we describe our implementation of the SW algorithm. In Sec-
tion 5 we discuss performance results and finally in Section 6 we conclude
with some ideas for future research.
2 Smith-Waterman Algorithm
Given two sequences S1 and S2, with sizes |S1| = m and |S2| = n, the
recurrence relations for the SW algorithm with affine gap penalties [5] are
defined below.
Hi,j = max{0,Hi−1,j−1 + SM(S1[i], S2[j]), Ei,j , Fi,j} (1)
Ei,j = max{Hi,j−1 −Goe, Ei,j−1 −Ge} (2)
Fi,j = max{Hi−1,j −Goe, Fi−1,j −Ge} (3)
Hi,j contains the score for aligning the prefixes S1[1..i] and S2[1..j]. Ei,j
and Fi,j are the scores of prefix S1[1..i] aligned to a gap and prefix S2[1..j]
aligned to a gap, respectively. SM is the scoring matrix which defines the
substitution scores for all residue pairs. Generally SM rewards with a pos-
itive value when qi and dj are identical or relatives, and punishes with a
negative value otherwise. Goe is the sum of gap open and gap extension
penalties while Ge is the gap extension penalty. The recurrences should be
calculated with 1 ≤ i ≤ m and 1 ≤ j ≤ n, after initializing H, E and F
with 0 when i = 0 or j = 0. The maximum value in the alignment matrix
H is the optimal local alignment score.
It is important to note that H values can not be computed in any order
due to the data dependences inherent to this problem. To be able to calculate
the value of any cell, all the values of the previous cells at the same row and
column have to be computed first, as shown in Figure 1. These dependences
restrict the ways in that H can be computed.
3 Intel’s Xeon Phi
With the Exascale challenge as a target in High Performance Computing
(HPC), accelerators seem to be the alternative to achieve such goals due to
consumption constrains in general-purpose processors. Xeon Phi (Phi) is the
code brand name given by Intel to a series of massively many-core processor
designed for HPC purposes. Phi corresponds to an specialized architecture
3
Figure 1: Data dependences to compute H.
denominated as Intel Many Core Architecture (MIC) in contract to Multi-
Core Architecture for General-purpose processors. Phi architecture derived
from the defunct Larrabee project [19] and the Teraflops Research Chip
research project. In 2012, Intel launches the first Phi generation (KNC)
which main features up to 61 x86 pentium cores with extended vector units
(512-bit) and simultaneous multithreading (four hardware threads per core).
Meanwhile first Phi was attached to the host processor via PCI Express bus,
second generation (KNL) can operate as standalone processor.
As shown in Figure 2, KNL architecture corresponds up to 36 Tiles inter-
connected by 2D mesh. Each Tile includes 2 cores based on the out-of-order
Intel’s Atom micro-architecture (4 threads per core), 2 Vector Processing
Units (VPUs) with AVX-512 support and a shared L2 cache of 1 MByte.
One of the main differences of the KNL architecture with respect to
its predecessor is the availability of on-package high-bandwidth memory
(HBM). This particular technology permits three configuration modes: Flat
mode, Cache Mode and Hybrid mode. While in Cache mode, HBM is used as
classical cache with lower performance rates and null source code changes,
in Flat mode the HBM is used as addressable memory being necessary the
programmer intervention to indicate manually which part of its source code
is allocated to HBM. It is important to note that in Flat mode, MCDRAM
is treated as Non-Uniform-Memory-Access architectures (NUMA), thus pro-
grammer should take special care for achieving efficient memory access from
the cores [2].
KNL supports not only old Intel’s multimedia extensions such as 128-bit
SSEx and 256-bit AVXx, but also modern 512-bit AVX-512. In fact, In-
tel will unify the SIMD instruction-set on both general purpose (announce
its support on Xeon E5-26xx V5 at 2017) and KNL processors by means
of AVX-512. AVX-512 performs 512-bit SIMD capabilities, 32 logical reg-
isters, vector predication via eight new mask registers and gather/scatter
indirect vector accesses. Currently, modern Phi has two VPUs per core al-
4
Figure 2: Xeon Phi KNL architecture.
lowing SIMD parallelism which acts as 32 SIMD-lanes for single-precision
(512 bits registers/32 bits in SP× 2 VPUs = 32 lanes) and 16 SIMD-lanes for
double-precision [21]. Although Intel AVX-512 instructions contains several
categories, Xeon Phi KNL architecture only supports four: AVX-512F (foun-
dation instructions); AVX-512CD (conflict-detection); AVX-512ER (expo-
nential and reciprocal); and AVX-512PF (prefetch instructions).
From a programming point of view, one of the main goals of this plat-
form is the support of existing parallel programming models traditionally
used on HPC scenario such as the OpenMP, MPI or TBB paradigms [14],
which simplifies code development and improves portability over other alter-
natives based on accelerator-specific programming languages such as CUDA
or OpenCL. In fact, although it should not be the most efficient way, KNL
allows binary compatibility with Xeon families.
However, minimal programming efforts such as the introduction of some
directives to inform the compiler about pointer disambiguation or data align-
ment data dependencies usually provide poor performance rates. In fact,
guided auto-vectorization is not able to achieve the best performance in most
cases and programmers usually need to make an extra effort to hand-tune
the codes to exploit SIMD capabilities. Indeed, intrinsics are currently the
only option for complex applications which suffer from data dependencies or
irregular access patterns that can be hidden using specific code transforma-
tions. Unfortunately, improving performance comes at the expense of losing
cross-platform portability.
5
4 SW Implementation
In this section we will address the optimizations performed on the Intel Xeon
Phi KNL processor. Before describing them in detail, we would like to point
out the algorithm flow which can be summarized in the following steps:
1. Pre-processing stage: database sequences are pre-processed to allow
subsequent parallel computation.
2. SW stage: alignments are carried out.
3. Sorting stage: alignment scores are sorted in descending order.
The inter-task parallelism approach is performed in order to exploit the
SIMD vector capabilities available on the Xeon Phi KNL processor. In
that sense, database sequences are processed in groups and the size of the
groups is determined by the number of SIMD vector lanes. Before grouping
sequences, database sequences are sorted by their lengths in ascending order
and padded with dummy symbols. This is done to favor memory pattern
access and minimize workload imbalances.
4.1 Multiple Parallelism Levels
Our implementation exploits both data and thread parallelism levels. On
one hand, we have used SIMD instructions by means of hand-tuned intrin-
sic functions. In particular, we have explored the usage of SSE4.1, AVX2
and AVX-512 extensions. On the other hand, we take advantage of the
OpenMP programming model to express parallelism across multiple cores.
The database sequences are dinamically distributed among the cores as soon
as the threads become idle. Each alignment matrix is divided into vertical
blocks and computed in a row-by-row manner (see Figure 3). This blocking
technique improves data locality reducing the number of cache misses. In
addition, the inner loop is fully unrolled to increase performance.
Figures 4, 5 and 6 show the core instructions of SSE4.1, AVX2 and AVX-
512 extensions, respectively. vCur is the block row being calculated while
vPrev is the previous one. After computing the current block row, vCur
and vPrev are swapped to process the next row. Besides, vSub represents
the substitution scores for the database sequence residues against the query
residue. vE and vF are the score vectors for alignments ending in a gap
in the query and the database sequence, respectively. vGoe represents the
vector for the sum of gap open and gap extension penalties while vGe is
the vector for gap extension penalty. Last, vS keeps the current optimal
alignment score.
6
Figure 3: Schematic representation
of the inter-task matrix computation
Figure 4: SSE4.1 core instructions
Figure 5: AVX2 core instructions Figure 6: AVX-512 core instructions
4.2 Instruction Set and Integer Range Selection
Although almost all alignment scores can be represented using an 8-bit in-
teger range in order to express as much SIMD parallelism available, there
are some alignments that can not be expressed with this integer range so
a wider range should be used. In the context of KNL processors, it is
supported SSEx, AVXx and AVX-512 instructions sets. While SSE4.1 ex-
tensions allow computation of 16 alignments in parallel, AVX2 instructions
double this number. Saturated arithmetic operations are used in additions
operation to detect overflow computation. When potential overflow is de-
tected (i.e. the alignment score is equal to the maximum value of the integer
representation employed), the alignment is recalculated using the next wider
integer range. Overflow checking is performed to verify if overflow occurred
in the lower/upper half or in both halves of the score vector in order to
avoid unnecessary recalculations. Unfortunately Xeon Phi KNL processors
do not include AVX-512BW subset (byte and word version of instructions in
AVX-512F). This fact means that the narrowest integer range in these de-
vices is 32 bit for AVX-512. So AVX-512 cannot compute more alignments
simultaneously than SSE4.1 or AVX2. In contrast, operations for overflow
7
detection are not required.
4.3 Substitution scores
Our code also implements other well-known optimizations of the SW algo-
rithm that have been proposed in previous works, such as the Query Profile
(QP) [16] and Score Profile (SP) [15] optimisations.
• The QP strategy is based on creating an auxiliary two-dimensional
array of size |q| × |
∑
|, where q is the query sequence and
∑
is the
alphabet. Each row of this array contains the scores of the corre-
sponding query residue against each possible residue in the alphabet.
Since each thread compares the same query residue against different
ones from the database, this optimization improves data locality at
the cost of a negligible increment in memory requirements.
• The SP technique is based on constructing an auxiliary n × L ×
∑
score array, where n is the length of the database sequence, L is the
number of vector lanes and
∑
is the alphabet. This auxiliary struc-
ture contains the substitutions scores for each query-database residue
combination and is constructed before matrix computation. Since each
row of the SP forms an L-lane score vector, an advantage is that its
values can be gathered using a single vector load reducing the number
of operations in the innermost loop. However, because the SP must be
re-built for each database sequence, its suitability must be evaluated,
especially for short queries.
5 Experimental Results
5.1 Experimental Design
All tests have been performed on an Intel server running CentOS 7.2 equipped
with a Xeon Phi 7250 processor 68-core 1.40GHz (4 hw thread per core and
16GB HBW memory) and 64GB main memory. The processor was run in
Flat memory mode and Quadrant cluster mode.
We have used Intel’s ICC compiler (version 17.0.1.132) with the -O3
optimization level by default. The experiments used to assess performance
are similar to those in previous work [18, 17, 9, 15]. We have evaluated
our application by searching 20 query protein sequences against the well-
known Environmental NR database (release 2016 11) 1. This database com-
prises 1384686404 amino acid residues in 6962291 sequences, 11944 being
the maximum length. The queries have been extracted from the Swiss-Prot
1The Environmental NR database is available online at
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/env_nr.gz
8
Figure 7: Performance for the different instruction sets used varying the
number of threads.
database 2 (accession numbers: P02232, P05013, P14942, P07327, P01008,
P03435, P42357, P21177, Q38941, P27895, P07756, P04775, P19096, P28167,
P0C6B8, P20930, P08519, Q7TMA5, P33450, and Q9UKN1), ranging in
length from 144 to 5478. The scoring matrix selected was BLOSUM62, and
gap insertion and extension penalties were set to 10 and 2, respectively.
5.2 Performance Results
Cell updates per second (CUPS) is a commonly used performance measure in
the Smith-Waterman context, because it allows removal of the dependency
on the query sequences and the databases utilized for the different tests. A
CUPS represents the time for a complete computation of one cell in matrix
H, including all memory operations and the corresponding computation of
the values in the E and F arrays. Given a query sequence Q and a database
D, the GCUPS (billion cell updates per second) value is calculated by:
|Q| × |D|
t× 109
(4)
where |Q| is the total number of residues in the query sequence, |D| is the
total number of residues in the database and t is the runtime in seconds [15].
Figure 7 shows the performance for the different instruction sets used
varying the number of threads 3. The best performances are achieved by
2The Swiss-Prot database is available online at
http://web.expasy.org/docs/swiss-prot_guideline.html
3SSE4.1 and AVX2 versions using QP technique were excluded from the analysis to
improve figure readability since we found that SP scheme always achieved the best per-
formance, as in previous work [17]
9
Figure 8: Performance evolution varying query length.
AVX2 extensions (340.3 GCUPS) followed by AVX-512 (157.8 GCUPS) and,
last, SSE4.1 (97.6 GCUPS). As mentioned before, data level exploitation is
critical to achieve maximum performance in this application. Even though
AVX-512 doubles vectorial width of AVX2 instructions, the lack of low-
range integer operations imposes a strong limit to its performance taking
into account that almost all alignment scores can be represented using 8-
bit integer data. Despite the fact that the SSE4.1 version computes 16
alignments in parallel as the AVX-512 counterparts, the performance of the
former is slower compared to the latter. As only one of the VPUs of each
core has support for a subset of byte and word SSE instructions, codes that
use these operations suffer performance losses.
In relation to the number of threads, AVX2 implementation reaches top
performance using 136 threads although performance with 68 threads is very
close (just 1% slower). Similar behaviors are presented with AVX-512 and
SSE4.1 intrinsics. In the AVX-512 case, performance with 68 threads is
3% higher than the corresponding to 136 threads; while SSE4.1 version is
slightly better (1%) employing 204 threads compared to 272 threads.
Lastly, this figure also allows us to evaluate the performance gains ob-
tained by HBM usage. As the entire application fits in the MCDRAM, we
can benefit from placing all data in that memory using the numactl utility
(without source code modification). In particular, MCDRAM exploitation
achieves an average speedup of 1.04× and a maximum speedup of 1.1×.
Figure 8 illustrates performance evolution varying query length with
the most favorable configuration for each implementation: 204, 136 and 68
threads for SSE4.1, AVX2 and AVX-512 intrinsics, respectively. Also, data is
placed in MCDRAM memory. SSE4.1 and AVX-512 implementations have
a almost constant performance achievement. As expected, this behavior
is motivated by the exploitation of inter-task parallelism scheme. AVX2
10
Figure 9: Performance comparison to Parasail library.
version achieves an increasing performance tendency that becomes soft with
larger query sequences (m ≥ 2504). For AVX-512, the behavior of QP and
SP differ, observing better performance for short sequences in QP. This
aspect, also observed in previous research for the Xeon Phi KNC [10, 17], is
due to the additional overhead incurred by the SP construction, which does
not compensate for the indexation benefits in shorter queries. As summary,
peak performances achieved are 351.2, 162.8, 157.2 and 98.9 GCUPS for
AVX2, AVX-512 (SP), AVX-512 (QP) and SSE4.1 implementations.
5.3 Performance Comparison to Parasail Library
Finally, we have compared our implementation with the parasail aligner ap-
plication included in the Parasail library. As Parasail offers many different
alignment scenarios, we tested all and select which reports the best perfor-
mance rates: parasail sw striped profile avx2 256 sat. This variant is based
on the stripped approach for intra-task parallelism with AVX2 intrinsics.
Besides, it performes QP optimization using also 8-bit integer data with
overflow checking.
Figure 9 shows the performance comparison between Parasail and our
SW version on KNL. Both implementations run with 136 threads and make
use of MCDRAM. Parasail intra-task approach limits parallel scalability
for small aligments. Moreover, our developed version which is based on
the inter-task and SP scheme outperforms Parasail for all query lengths
considered. In particular, it runs on average 4.6× faster highlighting the
larger differences for shorter queries.
11
6 Conclusions
The SW algorithm is a critical application in bioinformatics scenario and has
become the base of more sophisticated alignment technologies, so its study
and acceleration in different platforms has motivated a great interest for the
scientific community. In this paper, we have explored SW acceleration on
the last generation of Intel’s Xeon Phi processors with the KNL architecture.
To the best of the authors knowledge, this is the first study of this kind.
Among main contributions of this research we can summarize:
• Exploitation of low-range integer vectors is crucial to achieve top per-
formance. Even though AVX-512 doubles vectorial width of AVX2 in-
structions, the latter reach the maximal performance. The lack of this
class of AVX-512 instructions in Xeon Phi KNL processors imposes
a strong limit to its performance taking into account that almost all
alignment scores can be represented using 8-bit integer data.
• Multi-threading must be carefully evaluated. Different number of
threads produced the best results for each instruction set.
• MCDRAM usage demonstrated to be an effective way to increase per-
formance with practically null programmer intervention. In particular,
it produced an average speedup of 1.04× and a maximum speedup of
1.1×.
• Peak performances are 351.2, 162.8, 157.2 and 98.9 GCUPS for AVX2,
AVX-512 (SP), AVX-512 (QP) and SSE4.1 implementations.
In view of the obtained results, as future works we will consider:
• Xeon Phi KNL processors offer different cluster and memory modes.
We are interested in exploring the Flat mode with larger genomic
databases that do not fit in MCDRAM, like UniProtKB/TrEMBL.
Also, we will evaluate programming and optimization techniques in
other available modes as a way to extract more performance.
• As Xeon Phi KNL processors reported competitive performance, we
plan to perform a comparison with other accelerators not only from
performance perspective but also from power efficiency point of view.
• Future Xeon KNL processors will include AVX-512BW set. As this
characteristic enables more SIMD parallelism, we see a promising op-
portunity in accelerating SW database searches on these devices.
Acknowledgments
This work has been partially supported by Spanish government through re-
search contract TIN2015-65277-R and CAPAP-H5 network (TIN2014-53522).
12
References
[1] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schffer, Jinghui
Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped blast
and psiblast: a new generation of protein database search programs.
NUCLEIC ACIDS RESEARCH, 25(17):3389–3402, 1997.
[2] Ryo Asai. MCDRAM as High-Bandidth Memory (HBM)
in Knights Landing Processors: Developer’s Guide, 2016.
https://goparallel.sourceforge.net/wp-content/uploads/2016/05/Colfax_KNL_MCDRAM_Guide
[3] Jeff Daily. Parasail: SIMD C library for global, semi-global, and local
pairwise sequence alignments. BMC Bioinformatics, 17 (81), 2016.
[4] Michael Farrar. Striped Smith-Waterman speeds database searches six
time over other SIMD implementations. Bioinformatics, 23 (2):156–
161, 2007.
[5] O. Gotoh. An improved algorithm for matching biological sequences.
In Journal of Molecular Biology, volume 162, pages 705–708, 1981.
[6] M.N. Isa, K. Benkrid, T. Clayton, C. Ling, and A.T. Erdogan. An
FPGA-based parameterised and scalable optimal solutions for pairwise
biological sequence analysis. In Adaptive Hardware and Systems (AHS),
2011 NASA/ESA Conference on, pages 344–351, June 2011.
[7] Haidong Lan, W. Liu, B. Schmidt, and B. Wang. Accelerating large-
scale biological database search on xeon phi-based neo-heterogeneous
architectures. In 2015 IEEE International Conference on Bioinformat-
ics and Biomedicine (BIBM), pages 503–510, Nov 2015.
[8] T I Li, W Shum, and K Truong. 160-fold acceleration of the Smith-
Waterman algorithm using a field programmable gate array (FPGA).
BMC Bioinformatics, 8:I85, 2007.
[9] Y Liu, A Wirawan, and B Schmidt. CUDASW++ 3.0: accelerating
Smith-Waterman protein database search by coupling CPU and GPU
SIMD instructions. BMC Bioinformatics, 14:117, 2013.
[10] Yongchao Liu and Bertil Schmidt. Swaphi: Smith-waterman protein
database search on xeon phi coprocessors. In 25th IEEE International
Conference on Application-specific Systems, Architectures and Proces-
sors (ASAP 2014), 2014.
[11] David W. Mount. Bioinformatics: Sequence and Genome Analysis.
Mount, Bioinformatics. Cold Spring Harbor Laboratory Press, 2004.
13
[12] T. F. Oliver, B. Schmidt, and D. L. Maskell. Reconfigurable architec-
tures for bio-sequence database scanning on fpgas. IEEE Transactions
on Circuits and Systems II: Express Briefs, 52(12):851–855, Dec 2005.
[13] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence
comparison. Proceedings of the National Academy of Sciences of the
United States of America, 85(8):2444–2448, April 1988.
[14] James Reinders, Jim Jeffers, and Avinash Sodani. Intel Xeon Phi Pro-
cessor High Performance Programming Knights Landing Edition. Mor-
gan Kaufmann Publishers Inc., Boston, MA, USA, 2016.
[15] Torbjørn Rognes. Faster smith-waterman database searches with inter-
sequence simd parallelisation. BMC Bioinformatics, 12(1):221, 2011.
[16] Torbjrn Rognes and Erling Seeberg. Six-fold speed-up of smithwater-
man sequence database searches using parallel processing on common
microprocessors. Bioinformatics, 16(8):699, 2000.
[17] Enzo Rucci, Carlos Garcia, Guillermo Botella, Armando De Giusti,
Marcelo Naiouf, and Manuel Prieto-Matas. An energy-aware perfor-
mance analysis of SWIMM: SmithWaterman implementation on Intel’s
Multicore and Manycore architectures. Concurrency and Computation:
Practice and Experience, 27(18):5517–5537, 2015.
[18] Enzo Rucci, Carlos Garcia, Guillermo Botella, Armando De Giusti,
Marcelo Naiouf, and Manuel Prieto-Matas. OSWALD: OpenCL Smith-
Waterman Algorithm on Altera FPGA for Large Protein Databases.
International Journal of High Performance Computing Applications,
page 1094342016654215, 06 2016.
[19] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Pradeep
Dubey, Stephen Junkins, Adam Lake, Robert Cavin, Roger Espasa,
Ed Grochowski, Toni Juan, Michael Abrash, Jeremy Sugerman, and
Pat Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual
Computing. IEEE Micro, 29(1):10–21, 2009.
[20] Temple F. Smith and Michael S. Waterman. Identification of common
molecular subsequences. Journal of Molecular Biology, 147(1):195–197,
March 1981.
[21] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod,
S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. Knights Land-
ing: Second-Generation Intel Xeon Phi Product. IEEE Micro, 36(2):34–
46, Mar 2016.
14
