Sequence alignments are fundamental to bioinformatics which has resulted in a variety of optimized implementations. Unfortunately, the vast majority of them are hand-tuned and specific to certain architectures and execution models. This not only makes them challenging to understand and extend, but also difficult to port to other platforms. We present AnySeq-a novel library for computing different types of pairwise alignments of DNA sequences. Our approach combines high performance with an intuitively understandable implementation, which is achieved through the concept of partial evaluation. Using the AnyDSL compiler framework, AnySeq enables the compilation of algorithmic variants that are highly optimized for specific usage scenarios and hardware targets with a single, uniform codebase. The resulting domain-specific library thus allows the variation of alignment parameters (such as alignment type, scoring scheme, and traceback vs. plain score) by simple function composition rather than metaprogramming techniques which are often hard to understand. Our implementation supports multithreading and SIMD vectorization on CPUs, CUDA-enabled GPUs, and FPGAs. AnySeq is at most 7% slower and in many cases faster (up to 12%) than state-of-the art manually optimized alignment libraries on CPUs (SeqAn) and on GPUs (NVBio).
I. INTRODUCTION
Recent years have seen a tremendous increase in the volume of data generated in the life sciences, especially propelled by the rapid progress of next-generation sequencing (NGS) technologies. As a consequence, modern bioinformatics tools often require highly efficient implementations of core sequence analysis algorithms.
Given a pair of genomic sequences, a common operation in bioinformatics is to identify their similarity under a model of evolution which allows for certain sequence modifications. This leads to so-called sequence alignments that map characters across the sequences in an order-preserving way while potentially inserting gaps such that a mathematical model of their similarity is maximized. For pairwise alignment computation, the Smith-Waterman algorithm [1] , the Needleman-Wunsch algorithm [2] , and their variants are widely used. These compute an optimal local, global, or semi-global alignment of two sequences under a given scoring scheme by means of dynamic programming (DP). However, the associated time complexity proportional to the product of sequence lengths makes this approach a time consuming component of various bioinformatics workflows. As a consequence, these algorithms have been optimized on numerous architectures including CPUs [3] - [5] , GPUs [6] - [9] , and FPGAs [10] - [12] . Unfortunately, the majority of implementations are hand-tuned and specific to certain architectures and execution models. This not only makes them challenging to understand and extend, but also difficult to port to other platforms. For a typical bioinformatics setting it would be more desirable to use a flexible library that can exploit a variety of modern hardware. This motivates the design of reusable and extensible sequence alignment components that can ensure compatibility and performance while at the same time reducing bioinformatics application development time. Existing optimized alignment libraries have so far only targeted specific architectures using either C/C ++ for CPUs [13] - [15] or CUDA C ++ for GPUs [16] .
A. Contributions
We present AnySeq, a novel high-performance library for computing pairwise alignments of genomic sequences implemented using the AnyDSL compiler [17] , [18] . AnyDSL and our approach is based upon the concept of partial evaluation [19] - [21] , which allows the compilation of different variants of the DP algorithm that are highly optimized for specific alignment types, scoring schemes, and hardware targets with a single, uniform codebase (Section II provides necessary background to understand the rest of the paper). We show that using AnyDSL, we can design an alignment library that • separates computation into two parts (a common part and an architecture-specific part) using higher-order abstractions without sacrificing performance (see Section III),
• provides mappings for CPUs using multithreading and vectorization without resorting to architecture-specific intrinsics, GPUs, and FPGAs using high-level synthesis (HLS) (see Section IV),
• allows for the variation of algorithmic parameters (e.g. type of alignment (global, local, or semi-global), scoring scheme (substitution matrix, linear or affine gap penalties), calculating only the optimal alignment score or an actual optimal alignment) by simple function compo-sition rather than hard-to-understand metaprogramming techniques,
• is competitive in terms of performance compared to state-of-the-art manually optimized sequence alignment libraries; that is, AnySeq is at most 7% slower and in many cases (up to 12%) faster than SeqAn on CPUs and NVBio on GPUs (see Section V). In addition, we demonstrate that our proof-of-concept FPGA implementation runs more than 3 times (4 times) more energy efficient than the corresponding CPU (GPU) implementation.
II. OVERVIEW

A. Motivation
Parallelization of pairwise sequence alignment has been proposed on various architectures ranging from CPUs to GPUs and FPGAs. However, most implementations are tied to certain alignment scenarios, specific hardware, and execution models. This motivates the design of a flexible yet efficient sequence alignment library that abstracts a generic DP algorithm and specializes only hardware-specific parts for a certain architecture.
To achieve high performance it is often inevitable to instantiate the code with different algorithmic variants and parameters and tailor it towards the target hardware architecture. Ideally, changing performance-impacting parameters should require minimum effort and have a negligible impact on size and complexity of the codebase. For example, the two distinct cases of constructing an actual alignment or just computing the optimal score, should not require two independent implementations of the alignment algorithm. Parts that are the same for both cases should use the same code and avoid duplication.
Existing alignment libraries (such as NVBio [16] for GPUs, and Parasail [14] , SSW [15] , and SeqAn [13] for CPUs) are optimized for a specific architecture. However, while especially GPU implementations differ greatly from their CPU counterparts due to the fundamentally different processor architectures, many parts of the algorithm are in principle the same for both platforms. When targeting different hardware architectures more than one programming language is necessary, e. g., C ++ for the CPU code, CUDA for GPUs, or Verilog for FPGAs. This in turn leads to code duplication and increased code complexity.
We meet the challenges laid out above with the AnyDSL framework. It provides the programming language Impala, which features a partial evaluator. This approach allows for the design of the library AnySeq with high-level abstractions that are successively instantiated with parameters and hardwaredependent code parts using partial evaluation instead of providing hand-tailored implementations for specific hardware and parameter sets as opposed to existing libraries. This way, most code is generic and reusable across different hardware architectures and application scenarios to a substantial amount.
For example, DP matrices are not directly accessed as a 2D array in AnySeq but through accessor functions that act as a view on a 2D data space. This decouples the implementation of the core algorithm from the concrete choice of the data layout. In conventional settings, programmers would most likely refrain from such an implementation because it puts them into the hands of the compiler to succinctly optimize the runtime overhead of this indirection away. In AnySeq we reliably direct the partial evaluator to remove any such overhead at compile time.
B. AnyDSL Compiler Framework
AnySeq is written in Impala, an imperative and functional language, which is part of the AnyDSL framework [17] , [18] .
a) Partial Evaluation: Impala integrates a partial evaluator [22] . Programmers control the partial evaluator via filters [23] . These are Boolean expressions of the form @(expr) that annotate function signatures. Each call site instantiates the callee's filter with the corresponding argument list. If the expression evaluates to true, the call will be specialized. Additionally, the expression ?expr yields true, if expr is known at compile time; the expression $expr is never considered constant by the evaluator. For example, the following @(?n) filter will only specialize calls to pow if n is statically known at compile time:
Thus, the call pow(x, 5) produces a loop-less sequence of multiplications, pow(3, 5) evaluates to 243 while pow(x, $5) remains untouched. This is called polyvariance. As syntactic sugar, @ is available as shorthand for @(true). This causes the partial evaluator to always specialize the annotated function. What is more, Impala automatically annotates higher-order parameters for specialization. c) Intrinsic Generators: Impala exposes several forms of parallelism via built-in generators. These do not possess an implementation in Impala itself but are recognized by the compiler. The intrinsics used in our implementation are parallel for spawning threads, vectorize for SIMD vectorization on the CPU, cuda for generating GPU code, and hls for HLS.
The passed function is executed in a single program, multiple data (SPMD) context. Most notably, they are not pragmas but regular functions that can be captured and passed around like any other function (see below). The only restriction is that after partial evaluation, their higher-order arguments have to be function literals. body each time. The filter of unroll tells the partial evaluator to completely unroll the recursion if both loop bounds are statically known at a particular call site (as in the example above). The function range wraps a call to unroll but prevents the partial evaluator from unrolling because it always considers $a as dynamic.
Since generators are ordinary functions, programmers can pass them around, or combine them to build more sophisticated patterns. The following example is parametric in an arbitrary 2D generator loop2d. The function tile is a more sophisticated variant of combine that sets up a tiled 2D loop nest. Note that both combine and tile are ordinary functions that are implemented as library functions. Partial evaluation will completely remove all overhead and generate a program that only consists of the desired loop nest while the function compute is not tainted with any hardware-specific details. Programming idioms like this are typical in our code base and showcase what is possible with AnyDSL's partial evaluator.
III. ABSTRACTIONS FOR SEQUENCE ALIGNMENT
AnySeq features a modular design that allows changing performance-impacting algorithm parameters at compile time:
• hardware platform: CPU, CPU-SIMD, GPU, or FPGA • alignment reconstruction needed: yes/no • substitution function • gap penalty scheme (linear or affine) and values AnySeq achieves this via function composition, mostly in the form of providing behavior-controlling functions as arguments to higher-order functions (see Section III-B).
A. Pairwise Sequence Alignment
Consider two sequences Q = (q 1 q 2 . . . q n ) and S = (s 1 s 2 . . . s m ) of length n and m over the alphabet Σ. Their optimal alignment can be found in O(n · m) by recursively solving three smaller subproblems. For each pair (q i , s j ) of characters one has to decide if these characters should be aligned or if a gap should be inserted. The optimal alignment score H(i, j) for the prefixes (q 1 . . . q i ) and (s 1 . . . s j ) is given by the recurrence relation
where σ is a substitution function over Σ × Σ that determines the score of aligning two characters. The parameter ν is needed to distinguish between local and global alignments as we will explain below.
In case of a linear gap penalty g we set:
For affine gap penalties, a gap of length k is penalized with
where G o is the cost of opening a gap and G e is the cost of extending a gap. We then need two additional auxiliary DP matrices to determine the optimal value for E and F at position (i, j). The following functions determine the best alignment score for the prefixes (q 1 . . . q i ) and (s 1 . . . s j ) under the constraint that s j (or q i ) is aligned to a gap: Initialization of the first rows and columns of H, E, and Fas well as in what cell(s) to look for the optimal scoredepends on whether the alignment shall be global, local, or semi-global. Note that for the following equations holds 1 ≤ i ≤ n, 1 ≤ j ≤ m; the initialized rows and columns have index 0.
In the case of computing an optimal local alignment, we look for the best scoring alignment starting at any position (q i , s j ) and ending at any other position (s k , q l ) with i ≤ k ≤ n, j ≤ l ≤ m. The parameter ν in Equation 1 has to be set to 0 in order to ensure that scores will always be positive. The DP matrices are initialized as follows:
Global alignments always start at position (0, 0) and end at position (n, m). Hence, the optimal score lies in cell H(n, m). In order to model unbounded scores, the parameter ν in Equation 1 must be −∞. The DP matrices are initialized as follows:
For semi-global alignments, gaps at the beginning and at the end are not penalized. This leads to the same initialization as for local alignments. The optimal score lies in the last row or column of H.
Note that score-only computations can be performed in linear space and quadratic time with respect to the length of the alignment targets. Actual alignments producing this value can be constructed by tracing back the predecessor information in the DP matrices. In order to avoid quadratic space consumption (which is prohibitive for long DNA strings), the traceback procedure can be implemented in linear space by a divide-and-conquer approach [24] that recursively determines optimal midpoints of the DP matrix (at the cost of at most doubling the amount of computed DP cells).
Each cell of H in Equation 1 depends on three neighboring cells (see Figure 1 ) which means that relaxing them in parallel can be done along minor diagonals. When relaxing all cells in a submatrix row of H, only the subproblem scores for the row above and one column left of the current cell are needed. If we want to compute submatrices in parallel, the first row and first column of a submatrix must have been computed earlier and the last row and last column must be kept available for the computation of subsequent submatrices (to the right of and below the current one).
B. Data Access Abstractions
Efficient scalar CPU implementations need radically different iteration and blocking schemes than efficient SIMD-vectorized CPU implementations or efficient GPU code. This in turn leads to different data access patterns and data storage requirements (memory alignment, RAM vs. VideoRAM, GPU shared memory, etc.). A key idea of our design is the usage of data accessor objects for decoupling data access from the actual data storage strategy.
For example, while the resulting data indexing schemes may differ, the underlying relaxation equations (1), (4), (5) stay the same. It is therefore desirable to decouple these two aspects of the algorithm. AnySeq encodes the basic recurrence relation for one DP matrix cell update in one function. The following example showcases this for global alignments: Member functions of type PrevScores are used to access scoring information of the three ancestral subproblems. If no gap should be introduced, the scoring substitution function is used to determine the cost. In case of affine gap penalties the functions gap_s and gap_q will access the auxiliary matrix elements from E and F , for linear penalties they return a constant. Objects of type NextStep store the maximum score for the current cell and what previous subproblem was chosen as ancestor. The ancestor information encoded in res.predc is sometimes needed for the innermost level of the traceback since recursion on subsequences is only done if the subsequence sizes exceed a hardware-specific threshold. Whether any of the functions access main memory, GPU memory, or simply returns a constant is determined based on the algorithm parameterization at compile time. The relaxation order (which for the recursive traceback is reversed for half of the DP matrix) and subproblem sizes are also determined based on algorithmic parameters and the target hardware platform.
Accesses to the values of previously computed subproblems as well as to the input sequence characters are also abstracted by function calls. This makes it possible to vary intermediate result storage strategies, indexing, and blocking schemes independently of other parts of the algorithm: AnySeq resorts to generators to define indexing/blocking schemes. These are used in CPU as well as in GPU code. This is explained in more detail in the following subsections.
Since AnyDSL supports partial evaluation and the function call hierarchy is known at compile time, AnySeq can use several layers of indirection without impacting performance. This allows us to isolate hardware-specific parts from architectureindependent code such as the relaxation function in a way that is much easier to write, understand, and reason about than for C ++ templates.
C. Interface & Data Flow
The outermost interface functions build accessor objects to the sequence storage and combine different initialization, relaxation, and matrix traversal/blocking functions to realize different alignment schemes. To make interfacing with other languages easier, AnySeq provides C wrapper functions as entry points to the different algorithmic parameterization scenarios: ) -> Score // optimum score { // yields data accessor objects to input and output let input = input_view(query, subj); let output = output_view(qAlign, sAlign); // yields a struct with functions to control algo behavior let scheme = global_scheme( linear_gap_scoring(simple_subst_scoring(2,-1), -1)); construct_alignment(scheme, input, output) } In order to use it in C ++ , it just needs to be declared: extern "C" { // C linkage score_t construct_global_alignment(const char * query, const char * subj, char * qAlign, char * sAlign); } Algorithmic variants can be obtained by exchanging parameters of higher-order functions. Scoring schemes, memory access, and iteration strategies are all encapsulated in functions that can be interchanged. Most of the time programmers need to exchange several functions in order to get the desired behavior. Thus, functions are often grouped into descriptively named structs to make parameterization more convenient and the resulting code more expressive.
Because all accesses to actual data are abstracted through function calls we can simply exchange the order in which memory is accessed. If, for example, parts of the input sequences need to be reversed (this is the case for the divide-and-conquer traceback) we reverse the indexing in the sequence accessor function.
The following code example shows accessors that decouple memory access patterns. A MatrixView instance provides an interface for reading and writing data addressed by two indices. Functions like view_matrix_coal_offset create accessors that are optimized for a specific hardware architecture or algorithmic stage. Scoring schemes are built by combining a gap scoring strategy with a substitution function (see the listing in Section III-C). The substitution function takes an accessor to the scores of the ancestral problems and the next character pair. We obtain a simple scoring function with a match score of 2 and a mismatch score of −1 by calling simple_subst_scoring(2, -1). It returns the corresponding substitution function. This example also demonstrates how the ability to return functions makes parameterizations easier to use at the call site. To obtain a matrix substitution scheme, the programmer simply has to pass a substitution function that reads scores from a lookup table. AnySeq implements gap scoring functions similarly.
Since partial evaluation ensures that no machine code is generated for calls to functions that either do not contain instructions or return a compile-time constant, we can implement most algorithmic details in their most general form and rely on partial evaluation to eliminate such calls. For example, if the predecessor information or affine gap storage is not needed in a concrete algorithmic variant, the functions in the specialized accessor objects do either nothing or return compile-time constant values. The same holds for writing score data. For local and semi-global alignment computation the algorithm must also keep track of the current maximum score while this is not necessary for global alignments. For this, we just have to swap out one variant of the Scores accessor's member function update (see Section III-B) for a different (more efficient) one at compile time.
The resulting control flow for computing an alignment is as follows: 1) allocate memory for the input data, 2) read and store input data, 3) allocate storage needed for the output, 4) build accessors to input and output data storage, 5) allocate temporary storage for the alignment algorithm (RAM, GPU global and/or shared memory), 6) build accessors to the temporary storage(s), 7) run relaxation procedure for computing the DP matrix cells, 8) look up optimal score, 9) if needed, build alignment strings, 10) output results.
IV. MAPPINGS TO CPUS, GPUS, AND FPGAS
If high performance is desired, it becomes inevitable that some parts of an algorithm need to be specialized for different acceleration technologies. One way of supporting different hardware platforms, is to choose one of several implementations of the same struct or function, i. e., by writing different, specialized variants with the same name or signature, respectively. As an example, consider the following CPU version of a function inplace_map that transforms the array data within the range a to b by applying a function f: The interface of inplace_map can be made more generic by using data accessor functions instead of pointers to arrays.
We have used hardware-specific generator functions to encapsulate iteration and blocking strategies and data accessors to encapsulate memory access patterns. Therefore, supporting a new hardware platform means replacing only those iteration and data access abstractions for which benchmarks have shown that they perform sub-optimally on that platform. A breakdown of our code base-excluding supporting code like benchmarking functions, I/O, and C interfacing functions and also the FPGA-specific parts-reveals that approximately 23% of all lines of code are specifically written for the GPU, 14% are specific to CPU vectorization and less than 11% are only needed for the non-vectorized CPU version while the remaining 52% are shared among all three variants.
A. CPU Parallelization
We compute the values of different DP submatrices concurrently using CPU threads. As mentioned before, relaxing DP matrix cells in parallel can be done along (minor) diagonals. In the non-vectorized version, cells within a submatrix will be relaxed in row-major order.
Zero-overhead data access abstractions allow for the convenient separation of iteration logic and memory layout. Note that scores must be accessed following a row-major order within a tile. This is because if we want to compute the value at local position (i, j) an intra-tile cyclic buffer must always contain the previously computed values at local positions (i−1, j−1) through (i−1, m) and (i, 0) through (i, j−1) (see the right image in Figure 1 ). Furthermore, the values of the rightmost and bottommost border cells of a submatrix need to be kept as long as neighboring submatrices that depend on these values have not been computed yet (see Figure 2 ). Again, data accessors help to hide the fact that not the entire DP matrix is stored, but only such border stripes.
In our preliminary version of AnySeq [18] we used a static wavefront schedule along diagonals of submatrices and vectorization was done by computing consecutive rows within a submatrix while substitution scores were looked up from a precomputed auxiliary array. This approach did not yield satisfactory performance (see red line in Figure 6 ).
Instead of a static schedule we now use a dynamic wavefront approach where submatrices are scheduled in a thread-safe queue which allows threads to add and extract work items concurrently. Not only does this eliminate load imbalances due to a mismatch between the number of available threads and the number of submatrices that can be relaxed in parallel, this approach also leads to better load-balancing when computing several alignments of different sizes concurrently.
Vectorization is done over blocks that consist of rows from independent submatrices which also obviates the need for an auxiliary data structure for score lookup. The number of rows in each block corresponds to the number of available vector lanes l which is determined by the SIMD instruction set and the width of the data type used for computing score values.
Impala provides the generator vectorize that triggers CPU vector code generation which in turn relies on the Region Vectorizer [25] . Within the body of vectorize memory reads and writes as well as arithmetic operations are transformed into the corresponding vector instructions. Most existing vectorized alignment implementations [3] - [5] , [14] , [15] are based on intrinsics specific to certain architectures. A major advantage of our approach is that the vectorize generator supports several SIMD instruction sets.
Since only differences to the global score are relevant, we use smaller data types (e.g. 16 bits for our use cases) for scores within a block. Whether this is feasible without overor underflow, depends on the block size and the scoring scheme. The largest possible differential score value occurs if all characters in the corresponding subalignment match. The smallest possible differential score is obtained if no pair of characters in both sequences matches and either the largest possible mismatch penalty (along the alignment diagonal) or the largest possible gap penalty (along the first row or first column) is subtracted.
A thread only computes a vectorized block, if l work items are enqueued. This means that the preconditions of l independent alignment submatrices already hold. Depending on the workload, e.g., when computing alignments of several sequence pairs in parallel, or when starting new alignments, there might be less than l submatrices available per thread. In these cases threads will compute single submatrices using the scalar method (see Figure 3 ). After computation of a block has finished, all of the l associated alignment submatrices are marked as finished by the current thread. Next, for each of the completed submatrices, their neighboring submatrices are enqueued if they have neither been computed nor been enqueued yet and their preconditions have already been fulfilled, i. e., their predecessor submatrices have been computed. The completion and queuing status of all submatrices is tracked using preallocated arrays of atomic flags.
B. GPU Parallelization
Similar to the CPU code we iterate over matrix tiles in a diagonal fashion. This is done in host code that starts a GPU kernel for each diagonal. The GPU kernel uses a onedimensional grid of thread-blocks where each block computes one matrix tile. Each tile is further divided into stripes which are computed in sequence by a thread-block. Within a stripe, threads compute diagonals in parallel (see Figure 4 ). Computation of one stripe is divided into three parts in order to avoid branch divergence. This division is due to the fact that full diagonals are only computed in the middle part of a stripe but not at the start or the end.
Again, data accessors are used to manage memory buffers that can now point to global or block-local shared GPU memory. In order to enable coalesced memory access we have to use memory access patterns different from the CPU code. Since memory access is abstracted by functions, it is easy to exchange access schemes for the CPU and the GPU code. Device-independent code, like the core relaxation function, remain the same, because it only calls those memory access functions to read and write values.
Before starting a block of threads, the segments of the input sequences that are needed for the current tile are stored in block-local shared memory. The values of the row immediately above the current stripe are needed to access ancestral problems. Also the values in the bottom-most row of the current stripe will be needed by the next stripe. Since we are relaxing in diagonals along the stripe, we re-use the memory cells with the values of the uppermost row that are no longer needed and store the results of the last row to them. The buffer needed for this does also reside in shared memory. This way, we compute all stripes within a tile in succession without starting a new kernel. The last row and column of the current tile are always written back to global memory (indicated by circles in Figure 4 ). Figure 4 : Striped computation of a tile on the GPU. Values of the cells shown in gray and hatched cells are located in shared memory. Light gray indicates values still needed for the current tile. Dark gray indicates values needed for the next stripe (re-use memory previously claimed by initialization cells that are no longer needed). The circles indicate values that need to be written to global memory after the current tile has been completed.
C. FPGA Parallelization
The FPGA implementation iterates over matrix tiles in a diagonal fashion, too. We use several processing elements for this that all compute a single DP cell per clock cycle. The iteration in diagonal fashion happens in stripes of width K PE , i. e., the number of processing elements. In turn, for each clock cycle we have K PE updates in parallel, which renders K PE and the frequency crucial for performance. The shorter sequence is divided into blocks of maximum size K PE which are used to initialize the processing elements. Characters of the longer sequence are streamed one-by-one through the linear array of processing elements; that is, processing elements perform the relaxation and pass the character and their results to the next processing element. The last element emits the score to the host system.
In case the shorter sequence is longer than K PE , we buffer the rightmost DP column of a stripe with the help of a predefined hardware component in DDR memory of the host system. This component is necessary to store the results in parallel to the ongoing computation. We have similar components to stream the characters and for fetching the previously stored values from DDR.
V. PERFORMANCE EVALUATION
Experiments were run on a system with two Intel Xeon Gold 6130 (Skylake) CPUs with 192 GB of DDR4 RAM and L1, L2, and L3 caches sizes of 1 MB, 16 MB, and 22 MB respectively. Each CPU has 16 physical cores and all CPU algorithms were run with 32 threads. The operating system used was CentOS Linux release 7.6.1810. The vectorized variants were compiled for the AVX2 and AVX512 instruction sets and use 16 bit scores within a SIMD lane. The GPU experiments were performed on a Titan V.
We compared AnySeq to the well-established sequence alignment libraries SeqAn 2.4.0, Parasail 2.0, and NVBio 1.1. Any-Seq was compiled with the AnyDSL version from July 2019 and libraries were compiled with gcc 7.3.0. AnyDSL is based We have evaluated performance for two common use cases: (i) pairwise alignment for long DNA sequences and (ii) comparisons of large amounts of short Illumina reads. We mostly computed global alignments using a simple scoring scheme with +2 for a match, −1 for a mismatch and a linear gap penalty of −1 and in addition global alignments with an affine gap scoring scheme using G o = −2 and G e = −1 if supported. Parasail does not explicitly specialize the case of linear gap penalties which means that it effectively always computes affine gaps, even if G o = 0.
For the first use case we aligned three pairs of long genomic sequences of roughly similar length listed in Table I , which have been used as a benchmark before [3] , [26] . Part a) of Figure 5 shows the median performance in terms of giga cell updates per second (GCUPS) on the tested CPU and GPUs for both score-only computation and traceback. For score-only computation AnySeq is faster than both SeqAn and Parasail for the multithreaded, non-vectorized version, and for the AVX2 version while for the AVX512 version SeqAn is faster. For the traceback, AnySeq and SeqAn have roughly the same speed for all versions-both outperforming Parasail. On the tested GPU, AnySeq outperforms NVBio for both score-only computation and alignment reconstruction by a factor of up to 1.1. Using affine gap penalties does generally not change the relative performances of the tested libraries, although it is always slower than using linear gap penalties due to an increased amount of memory reads and writes. Note that alignment computation on the GPU relies on 32-bit integer arithmetic due to the lack of native support for shorter integer data types (as used by our AVX vectorization).
For the second use case we performed 12.5 million pairwise alignments of Illumina reads of length 150 base-pairs each. The set of reads was simulated with Mason using chromosome 10 of GRCH38 as reference. Part b) of Figure 5 shows the achieved performance in GCUPS on the tested CPU and GPU for both score-only computation and traceback. For score-only computation on the CPU, AnySeq outperforms both SeqAn and Parasail for the AVX2 version, while for the AVX512 version SeqAn is slightly faster. On the GPU, AnySeq consistently outperforms NVBio with a factor of up to 1.12.
We have also compared the CPU thread scalability of the proposed dynamic wavefront approach with a purely static wavefront along diagonals in Figure 6 . Our preliminary version [18] and Parasail rely on the latter strategy. This also explains the low Parasail performance in Figure 5 part a). The dynamic approach achieves an efficiency of 75% and 65% for 16 and 32 threads, respectively, while the corresponding efficiencies of the static approach are merely 15% and 8%. SeqAn is also based upon a dynamic wavefront approach but relies on low-level intrinsics for vectorization to support various instructions sets. This makes this code not only hard to port to different SIMD architectures but also requires to emulate control flow constructs such as if, while, or break with masked data flow-a time-consuming and errorprone process.
AnySeq slightly outperforms SeqAn and NVBio in several use cases due to different implementation details like the internals of the concurrent queue used for scheduling tiles or different Figure 5 for that device.
parameter choices for recursion cutoff points or tile sizes. That being said it is hard to pinpoint the exact reason that attributes to performance differences of about 5%: The code bases are very different and even slight differences in the generated code can easily impact performance.
FPGA Performance
Our FPGA implementation is not as mature as the CPU or GPU one and-at the time of writing-only supports scoreonly long genome alignment. We chose the Xilinx Zynq Ultra-Scale+ MPSoC ZCU104 Evaluation Kit for our experiments. AnySeq runs with a frequency of 187.5 MHz and achieves a median performance of about 20 GCUPS on the ZCU104 (see Figure 5 ; long genomes, scores only).
The runtime is not affected by the gap penalty scheme as the computation happens in a single clock-cycle nonetheless. Moreover, without modifying the data transfer methodology, a no-operation hardware module is as fast as our alignment core. This indicates that our implementation only improves if the data transfer rate increases.
Even though the total GCUPS of the ZCU104 are not competitive to the tested high-end CPU/GPU systems, low-end FPGAs consume significantly less power. Table II compares the power efficiency for each tested device in terms of achieved GCUPS per watt. The ZCU104 is more than 3 times more energy efficient than the corresponding CPU implementations and 4.2-4.5 times more energy efficient than the corresponding GPU implementations.
VI. RELATED WORK
Parallelization of genomic sequence alignment has been investigated mainly in the context of two types of application scenarios: (i) pairwise alignment for long DNA sequences, and (ii) comparison of large amounts NGS reads. Existing monolithic implementations and libraries have been optimized for specific architectures including CPUs, GPUs, and FP-GAs. Two widely adopted parallelization schemes are intersequence (computes DP matrix cells of a number of independent alignment tasks in parallel) and intra-sequence (computes DP matrix cells of a single alignment in parallel).
FPGA solutions using systolic arrays [10] , early approaches using SIMD registers of standard CPUs [27] , and a number of GPU approaches [7] are based on the intra-sequence method vectorizing over minor diagonals of the DP matrix.
Farrar's approach [28] is a popular intra-sequence method using a striped layout for SIMD registers. Unfortunately, its performance relies on efficient branch prediction units which are often inefficient on modern many-core architectures. Both multithreading and vectorization have been effectively employed on CPUs for the inter-sequence approach if a large number of alignments need to be computed in parallel [5] . This is typically the case when processing large-scale NGS datasets. However, for the alignment of a pair of long genomic sequences, most approaches employ a wavefront pattern for intra-sequence parallelization [3] , [4] , [6] , [8] , [12] . An additional level of coarse-grained parallelism (e.g. on CPU/GPU clusters) has been proposed based on speculative execution [9] , [29] .
These approaches have in common that they are based on a highly optimized architecture-specific kernel but are often not flexible enough to support different alignment scenarios and are not compatible to different architectures. Consequently, libraries have been introduced that expose alignment algorithms as reusable components. Among them SeqAn [13] and NVBio [16] provide the highest flexibility on CPUs and GPUs, respectively. A recent paper [26] has shown that the performance of SeqAn is superior to other CPU-based libraries including Parasail [14] and SSW [15] . Unfortunately, existing approaches are specific to certain architectures and execution models and are therefore not able to support a variety of modern hardware required for modern bioinformatics pipelines that need to deal with fast-growing biological sequence databases. AnySeq demonstrates that using partial evaluation it is possible to build an alignment library based on higher-order abstractions that can specialize on a variety of architectures (CPUs, GPUs, and FPGAs) with comparable performance to state-of-the-art libraries for two application scenarios (long sequence alignment and NGS read comparison) that are hand-optimized for a specific architecture (e.g. NVBio on GPUs, SeqAn on CPUs).
Both SeqAn and NVBio use C ++ metaprogramming in order to specialize algorithmic variants at compile time. Some projects even use scripts (e.g. written in Python or Ruby) that generate other programs (e.g. in C/C ++ or CUDA). However, metaprogramming entails severe drawbacks in terms of programmer productivity: First, the meta language is different than the core language and/or intrusively pervades the program with quoted code snippets of the program to be generated (C ++ template language vs. C ++ core). For this reason, metaprograms are hard to read, write, and understand. What is more, programmers cannot easily move code between the core and the metaprogram. Thus, they must manually implement all needed versions like pow(a, b) vs. pow<b>(a) vs. pow<a, b>().
In addition, C++ templates do not accept lambda expressions as arguments. Finally, running the metaprogram does not guarantee the well-typedness of the generated program. This leads to the infamous, difficult to understand error messages after C ++ template instantiation.
AnyDSL on the other hand uses partial evaluation and does not suffer from any of these problems. The only annotations required are filters that determine whether the evaluator should specialize a particular call site. This allows in contrast to metaprogramming for polyvariance and avoids the need to implement several variants of the very same function (see Section II-B). This paper displays in several examples that we pass functions with free variables (lambda expressions in C ++ terminology) around and later specialize these calls via partial evaluation. This is not possible with C ++ . Finally, the partial evaluator runs in contrast to template instantiation on a welltyped program and, hence, will not produce an ill-typed one. Please refer to Leißa et. al. [18, §3] for a thorough discussion of metaprogramming and partial evaluation.
VII. CONCLUSION
The continually increasing volume of genomic sequencing data poses severe challenges when developing bioinformatics methods of practical relevance. To achieve sufficient performance frequently requires the use of modern hardware architectures. One way to separate the concerns of bioinformatics method development and low-level parallelization and optimization is the use of sequence alignment libraries.
Unfortunately, existing approaches such as SeqAn, Parasail, or NVBio only provide optimized implementations for a single architecture and add significant code complexity through template metaprogramming.
We have presented a new approach for designing a high performance library for genomic sequence alignment based on partial evaluation. By using higher-order abstractions, AnySeq separates the computation into common parts implemented in a generic way and parts allowing for architecturespecific optimization. Implementations of different alignment variants by simple function composition and mappings for CPU, GPU, and FPGA-based hardware targets have been presented. Through the use of recent compiler technology that incorporates partial evaluation any overhead incurred by the utilized higher level abstractions can be effectively eliminated. As a result AnySeq achieves highly competitive performance comparable to a number of state-of-theart, hand-optimized alignment libraries on various platforms. AnySeq is open source software and can be downloaded at https://github.com/AnyDSL/anyseq.
