Striped variation of the Smith-Waterman algorithm is known as extremely efficient and easily adaptable for the SIMD architectures. However, the potential for improvement has not been exhausted yet. The popular Lazy-F loop heuristic requires additional memory access operations, and the worstcase performance of the loop could be as bad as the nonvectorized version. We demonstrate the progression of the lazy-F loop transformations that improve the loop performance, and ultimately eliminate the loop completely. Our algorithm achieves the best asymptotic performance of all scan-based SW algorithms O(n/p+log(p)), and is very efficient in practice.
I. BACKGROUND
Sequence alignment is an essential component of many bioinformatics data processing workflows. Smith-Waterman algorithm (SW) produces an optimal local alignment between the two sequences. At the same time it is one of the slowest sequence alignment algorithms used, due to its quadratic computational complexity. Aligning two sequences of the lengths m and n requires O(mn) time. This makes the SW algorithm a continual focal point of optimization attempts using all kinds of the hardware accelerators.
Farrar [5] has proposed the Striped algorithm that uses the SSE vector instructions over an array of vectors containing evenly spaced data cells along the length of the input sequence. Striped approach eliminates most of the dependencies between the vectors in the array, and ignores the remaining dependencies during the first pass through the array. An additional pass, known as the Lazy-F update loop, is required to account for the inter-vector dependencies. However, it is often possible to break from the lazy-F loop early, making the Striped algorithm an excellent performer.
Algorithm 1 shows the striped algorithm with the lazy-F loop adaptation from [9] . Here operator ≪ denotes the vector shift, and operator − indicates subtraction with saturation as none of the vector values can be negative. All vector variable names are PascalCased while all scalar variable names are camelCased.
With the input sequence length n and vector size p the computational complexity of the both passes comprising the outer loop iteration is O((1 + C) * n/p) where C is a corrective factor describing the "laziness" of the second pass. The detailed analysis in [3] has shown that c could vary
then break end if end for end for end for end procedure significantly depending on the vector width, query length, and the query data. We will demonstrate the real world example where simply increasing the vector width changes C by the factor of 10.
An alternative approach to linearizing the data dependencies between the vectors has been presented in [7] for their GPU SW implementation. The correction pass is formulated in terms of the parallel scan operation requiring additional log(n) steps [1] . The overall computational complexity of the outer loop iteration is guaranteed to be O(n/p + log(n)).
In [4] the parallel scan has been translated to SIMD instructions. However, the computational complexity of that implementation is limited to O(n/p + p).
The performance comparisons of the two SIMD approaches have revealed no clear winner [3] . In tests of different query lengths and vector widths the advantage shifts from striped to scan and back. This is partially due to the fact that the lazy-F is a heuristic and, while fast for many real world inputs, gives no performance guarantees.
II. SOLUTION
We will develop a hybrid approach that uses the striped layout but replaces the correction loop with the scan pass. Our work was inspired by [7] , but instead of the multi-page arithmetic proof, we will arrive at scan by performing a series of the equivalent program transformations of the lazy-F loop.
We start with removing the early loop exit condition for clarity (Algorithm 2).
Note that in the lazy-F loop there is no dependency between the values of HStore[j], and, using the associative property of operator max, we can separate the F and H loops by storing the corrected F vector for every segment (Algorithm 3).
Algorithm 3 F and H loop separation
Operators ≪ and − are mutually associative in a sense that (A ≪ n) − cI = (A − cI) ≪ n where I is a unity vector. Using that we can swap the inner and outer loops for F (Algorithm 4).
Now it is obvious that the input vectors for every inner F loop invocation differ only by a constant vector, so the results can be calculated simply by subtracting the same constant from the result of the first and only inner loop execution. Note the change in the H loop of Algorithm 5.
Algorithm 4 Inner/outer loop inversion
for
The loop for computing F can be replaced by the parallel scan operation, further reducing the number of operations from p to log(p). Actually, this scan is very similar to one in [7] indicating that our algorithm exploits the same optimization but for the striped version of the algorithm. The remaining H update loop can be executed even lazier, in the next iteration of the main loop, thus eliminating the lazy-F loop completely save for the scan with the log(p) execution time (Algorithm 6).
The asymptotic complexity of the outer loop iteration is O(n/p + log(p)), making our scan-based algorithm the fastest asymptotically.
III. IMPLEMENTATION
We have implemented the scan algorithm using the SSE4.1, AVX2, and AVX512 instruction set. Later we have added another implementation that uses all 32 vector registers of the AVX512 architecture but issues only 256-bit wide instructions from the AVX2 and AVX512VL instruction sets. We will discuss its advantages in the results section below.
A. Experimental Setup
The computer platform is an Intel Xeon Platinum 8168 system with 16 cores running at 2.7 GHz and 32GB of RAM. To test the software performance we have run the BWA alignment tool [8] with 16 threads (-t 16) on a 30X Human genome sample NA12878 from the 1000 Genomes database[2] using hg38 as a reference. We have executed the BWA version Algorithm 6 Striped Smith-Waterman Scan procedure SMITHWATERMANSCAN segLen ← (length(query)
GapExtend end scan end for end procedure 7.15 to establish the baseline, and then replaced the Smith-Waterman mate rescue code with our SSE, AVX2 and AVX512 implementations. Additionally, we have extended the Lazy-F implementation to the AVX2 and AVX512 architectures. The total run times got collected from the BWA reports, and the percentage of time spent in the SW routine was measured via profiling. The alignment for every configuration has been run 3 times, and presented numbers are the averages of the observed run times. Figure 1 shows the times spent in the SW routine for the Scan and Lazy-F algorithm implementations on the various vector width architectures. The obvious standout is the poor performance of the AVX512 lazy-F implementation. This early observation commensurates with the results in [3] , and it has motivated us to search for better solution in the first place.
B. Results
Surprisingly, the AVX512 version of the Scan algorithm shows no improvement over the AVX2 implementation. To explain this we need to take a closer look at the microarchitecture of the vector pipeline. Every vector core in Intel class of instructions [6] . In the absence of data dependencies between the instructions, the core is capable of executing up to 4 commands per the CPU cycle so the minimum (best) achievable cycle-per-instruction ratio is 0.25. The CPI rate is a good metric of the instruction level parallelism, so we have recorded the CPI measurements for our experimental run, as presented in Figure 2 . The bulk of SW computation consists of additions and comparisons (max operations). These instructions could execute on ports 0, 1, and 5. Additionally, port 5 handles all vector shift operations of the striped algorithm. So it is convenient to model the SW instruction parallelism simply as the distribution of the arithmetic instructions across ports 0 and 1 with CPI approximately 0.5. Potential increase in CPI due to stall caused by the data dependencies is roughly compensated by the additional shuffle instructions executed on port 5. Numbers for the SSE and AVX2 implementations support this simplistic model.
The CPU in our experimental setup is build on the Skylake microarchitecture. In Skylake, port 5 is the only fully built port capable of processing 512-bit wide vectors. Ports 0 and 1 are still 256 bit wide, and for handling the 512 bit vectors they are fused together to form a single port 0+1. So in our SW execution model, transition to AVX512 simply trades executing the arithmetic instructions across two ports 0 and 1 for a single double wide port 0+1 with CPI rising to 1 as clear from the graph. Obviously, the overall throughput does not change and neither does the run time.
After such a disappointing result we have redoubled our efforts to produce better solution on the AVX512 microarchitecture. Turns out that using only 256-bit wide instructions from the AVX512VL set, along with masking and utilizing all 32 available vector registers, keeps ports 0 and 1 separate and yields the best execution times. CPI is still quite high due the masked instructions executing only on port 5 and clogging it further.
The last observation is that CPI of the scan algorithm is consistently higher that CPI of the lazy-F. Implementing vector scan requires a long sequence of shuffle and arithmetic instructions depending on the completion of the previous operation. The instruction level parallelism is underutilized, leaving the scan algorithm just marginally faster even though the total number of instructions executed is significantly lower. Interleaving two SW executions on the same core could improve the data parallelism and further improve the performance of the scan algorithm in future.
IV. CONCLUSION
We have enhanced the striped version of the Smith Waterman algorithm by achieving better asymptotic and practical performance, and by future-proofing it for the increasing width vector architectures. Our version combines the best features of scan and striped approaches making it equally useful for all sequence lengths and vector width.
Modern CPUs allow for multiple levels of parallel execution, and achieving top performance requires all levels working in coordination. We have augmented the data parallel vector approach with the simple model for instruction level parallelism that has provided practical insights and directions for further advancements.
