Sequence comparison is one of the most fundamental computational problems in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. This work 1 addresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequencespecific. Specifically, we develop algorithms for sequence-specific strategies for hardware acceleration of pairwise sequence alignment in conjunction with statistical significance estimation. Using pairwise statistical significance has been shown to give better retrieval accuracy compared to database statistical significance reported by popular database search programs like BLAST and PSI-BLAST. We provide a 'flexible array' hardware architecture, which provides a scalable systolic array suitable for both long and short sequences. The results with Xtremedata XD1000 FPGA platform show a speed-up by up to a factor of more than 200.
Introduction
Estimation of statistical significance of a pairwise local alignment is an important problem in bioinformatics. It has attracted a lot of attention in recent years [1] [2] [3] [4] , since it is a crucial step in many sequence comparison based applications in bioinformatics requiring homology detection, including BLAST, PSI-BLAST, and FASTA. Statistical significance represents the likelihood that 1 The conference version of this paper appears in BIOCOMP 2010 the similarity between two given sequences could have arisen by chance alone, which is very helpful in identifying potential homologs (biologically related sequences). In the last couple of years, pairwise statistical significance [5] [6] [7] [8] has been shown to be a promising alternative to database statistical significance (statistical significance reported by database search programs like BLAST, FASTA, etc.) for the purpose of identifying homologs, which is the central application of sequence comparison. Almost everything in bioinformatics depends on the inter-relationship between sequence, structure and function (all encapsulated in the term relatedness), which is far from being well understood. Pairwise sequence alignment (PSA) [9] is one of the most widely used techniques that aim to mine such relatedness from proteomic and genomic data. Many bioinformatics applications have been developed on the basis of pairwise sequence alignment, such as BLAST [10] , PSI-BLAST [11] , and FASTA [12] .
However, the current implementations of the variants of pairwise statistical significance estimation are too slow for them to be practically useful in many large scale applications, such as database search, because the algorithm involves generation of sequence-specific empirical score distributions by performing hundreds of alignments for a single sequence pair. But the demonstrated significant improvement in retrieval accuracy using pairwise statistical significance strongly motivates the use of high performance computing techniques to speed-up the pairwise statistical significance estimation procedure.
Usually, the sequence alignment programs report alignment scores for the alignments constructed, and related (homologous) sequences will have higher alignment scores. But the threshold alignment score T below which the two sequences can be considered unrelated depends on the probability distribution of alignment scores between random, unrelated sequences. Therefore, the biological significance of a pairwise sequence alignment should be gauged by the statistical significance rather than by alignment score alone.
This means that if an alignment score has a low probability of occurring by chance, the alignment is considered statistically significant, and hence biologically significant. It is thus possible to have two alignment scores x and y with x < y, but x more statistically significant than y.
Pairwise Statistical Significance (PSS) is the statistical significance of the specific pairwise alignment under consideration and is independent of any database. Accurate estimates of the statistical significance of pairwise alignments can be very useful to comment on the relatedness of a pair of sequences aligned by an alignment program independent of any database. And thus, it can also be used to compare different combination of alignment parameters -like the alignment program itself, substitution matrices, gap costs.
PSS generates a score for an alignment as a measure of the similarity between two sequences. PSS gives significantly better retrieval accuracy compared to BLAST and PSI-BLAST, and comparable to SSEARCH. BLAST and PSI-BLAST are heuristic based methods for database search whereas SSEARCH uses a full implementation of Smith-Waterman algorithm, and hence takes much more time. Time complexity of PSS is further reduced by hardware acceleration, as explained in this work.
In this paper, we use a large field programmable gate array (FPGA) as a coprocessor to accelerate the computationally intensive sequence alignment task performed during the pairwise statistical significance estimation procedure. An FPGA is a hardware device that can be configured at runtime to implement an arbitrary digital circuit. They excel in situations where simple operations, such as comparisons, logical operations and integer arithmetic, are performed on data streams. The regular data dependencies, abundant parallelism, and straightforward integer arithmetic of the alignment task make it suitable for implementation on an FPGA. We customize our implementation based on specific features of the pairwise statistical significance estimation procedure, which allows us to achieve significantly better performance than we would with a generic local alignment implementation. The 'flexible array' architecture, presented in this work alleviates the limitations of traditional systolic array implementations over commodity hardware for small sequence lengths without sacrificing the implementation efficiency for large sequence lengths. A speedup of 25x (for small) -200x (for large sequences) was obtained with hardware implementation on XtremeData XD1000 FPGA Development system.
Background Pairwise Statistical Significance
The statistical significance of the resulting sequence alignment score between the sequences can be presented by the probability (i.e., P-value) that random or unrelated sequences could be aligned to generate the same score. "P-value" represents the probability of an alignment with this score or higher occurring by chance alone. If an alignment score has a low probability of occurring by chance, the alignment is considered statistically significant, and hence potentially biologically significant. "Chance" here depends on the methodology of modeling randomness in the P-value estimation procedures [13] . These statistical approaches can be based upon theoretical models or upon permutation reconstructions of the observed sequences. However, it is only possible when the distribution of the alignment scores is known precisely, which can be obtained by aligning many pairs of random sequences.
Pairwise alignment has been used to develop many bioinformatics applications such as BLAST, SSEARCH, FASTA and others. They use full implementation of Smith-Waterman algorithm [14] . A score is generated for an alignment as a measure of the similarity between two sequences with a higher value indicating relatedness of the sequences. The Pairwise Statistical Significance algorithm [15] [16] [17] [18] [19] [20] [21] helps to find the statistical significance of the result, i.e. the score is measured against likelihood of its occurrence by chance.
Algorithm
The PSS estimation aligns an input sequence Seq1 with N permutations (usually N=1000) of another sequence, Seq2, using the Smith-Waterman local alignment algorithm. It then fits an Extreme Value Distribution to the resulting score distribution using censored maximum likelihood. Finally, the PSS is calculated to be the P-value of the Smith-Waterman alignment score between Seq1 and Seq2, evaluated against the generated distribution (see Figure 1 ).
The Smith-Waterman algorithm is an O(mn) dynamic programming algorithm, where m and n are the lengths of the input sequences. The processing amounts to constructing an m x n similarity matrix, H, in which each element depends on its upper, left, and upper-left neighbor.
The Extreme-Value-Distribution holds true for gapless alignment. For alignments with gap penalties, local sequence alignments methods are used. Experiments suggest that they follow Gumbel law after estimation of λ and K parameters [22] [23] [24] [25] .
As illustrated by Table 1 , scoring the local alignments accounts for the overwhelming majority of the overall execution time of the PSS estimation routine. Clearly, reducing the time required to complete the alignment task will net significant improvements to the performance of the algorithm as a whole. For this reason, we turn our attention to an architectural design capable of rapidly scoring pairwise alignments between Seq1 and many permutations of Seq2. 
High performance computing
Large scale parallel processing is achieved by efficient deployment of multiple processing units in parallel using suitable programming interfaces to solve computationally expensive problems or computationally expensive portions of a problem. In our context, alignment takes most of the computing power and it is composed to several noninterdependent tasks that can be performed in parallel.
There are many parallel programming models, most popular of which are the shared memory model using threads (OpenMP or POSIX threads), distributed memory model (MPI), and their hybrids. General-purpose graphical processing units (GPGPUs) have also become increasingly common in the last few years. Different high performance computing techniques have been HPC techniques have been successfully applied in many fields like aerospace [35] , climate science [36] , nuclear physics [37] [38] , power systems [39] [40] [41] , linear programming solvers [42] [43] [44] and others. The PSS problem has also been accelerated on multiple hardware architectures such as OpenMP [45] , MPI [48] , GPGPU [46] and NUMA [47] machines. In this work, we present hardware acceleration on FPGAs that provide orders of magnitude speedup over software counterparts owing to customization of their architecture.
General-purpose computing devices allow us to customize computation on an already-fabricated hardware and conserve area by reusing expensive active circuitry for different functions in time. Special purpose architectures have dedicated hardware to perform specific tasks, which gives them orders of magnitude speedup over general purpose computing devices (or micro-processors as we commonly call them). Reconfigurable computing is a hybrid approach where we have an array of computing elements connected by flexible interconnects, the devices are typically called as FPGAs or Field-Programmable Gate Arrays. This allows us to implement custom application-specific hardware in a configurable manner over FPGAs.
Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with them: (1) instructions that tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. The principal difference when compared to using ordinary microprocessors is the ability to make substantial changes to the datapath itself in addition to the control flow. On the other hand, the main difference with custom hardware, i.e. applicationspecific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric.
There are numerous examples of FPGA-based accelerators for local alignment, including those presented in [26] [27] [28] [29] . From the operational standpoint, they differ based on the type of sequence comparison performed (DNA or protein), the supported gap model (linear or affine), and the supported sequence length. Protein alignments demand more resources than DNA alignments, and supporting an affine gap model is more complicated than a constant or linear model. The maximum supported sequence length also has a significant effect on the resource requirements of an implementation. In this work, we propose a FPGA implementation that supports long protein alignments with an affine gap model, making it very resource-intensive among FPGA implementations.
Systolic arrays are overwhelmingly the processing paradigm of choice for FPGA implementations of sequence alignment algorithms. A systolic array is an interconnected network of processing elements, or cells, in which each cell takes data in from one or more of its neighbors, performs some computation, and passes the results along to other neighbors. Systolic arrays map well to FPGAs and are useful in situations where data parallelism is abundant and data dependencies are regular.
Typically, linear arrays are used to process a single alignment at a time. Each cell holds on to a specific element of A, while B streams through the array. While the theoretical peak performance of linear systolic arrays is often impressive, performance tends to be poor for short sequences due to low utilization of processing elements. This, in addition to data transfer overhead, is the reason why FPGA implementations commonly report a large discrepancy between the actual performance for short sequences and the theoretical performance of the array as a whole. In [28] , various methods involving runtime reconfiguration are proposed to adapt the array to the length of the input query, in order to address the problem of low utilization for long arrays. Our implementation, by contrast, takes advantage of the task parallelism inherent in the pairwise statistical significance estimation procedure and uses straightforward architectural methods to minimize in order to address the problem of low utilization for long arrays. Our implementation, by contrast, takes advantage of the task parallelism inherent in the pairwise statistical significance estimation procedure and uses straightforward architectural methods to minimize transfer overhead and improve array utilization with a single FPGA configuration.
Recent advances in the programmability of Graphics Processing Units (GPUs) have also given rise to high performance local alignment implementations [30] [31] .
GPU implementations map individual alignment tasks to lightweight GPU threads or thread blocks, and process a large number of tasks concurrently. The GPU typically needs thousands of active threads in order to distribute the computation across its many processing elements and hide memory access latency. The implementation in [30] , for example, indicates that 28,800 threads are optimal for performance. These implementations thus perform well for applications like database searches, where a very large number of alignments are performed for a single query. Pairwise statistical significance estimations, by contrast, require a comparatively small number of alignments. A few highly optimized software alignment algorithms using Streaming Single Instruction Multiple Data Extensions (SSE) have also been proposed. SSE is a set of extensions to the x86 architecture that enable the processor to perform an operation on multiple data elements simultaneously. Using SSE extensions can yield sizeable performance benefits for software routines exhibiting abundant data parallelism. SSE implementations of local alignment algorithms, such as the ones described in [32] [33] [34] leverage SSE to simultaneously update multiple cells in the similarity matrix. The performance gains of SSE implementations are due in large part to a compact representation of scores. Since the value of any element in H stays smaller than 255 throughout many alignment tasks, representing scores as bytes usually poses no problem. When an overflow is detected for an alignment using byte representations, the query is re-executed with 2-byte representations of scores.
In this way, the processor is able to simultaneously perform either 16 operations when scores are represented as bytes, or 8 operations when scores are represented as 2 bytes. The efficiency gains from smaller score representations far offset the penalty of reprocessing a small number of the alignments.
Further performance benefits can be achieved by splitting the alignment tasks among multiple cores or multiple processors. SSE implementations are typically evaluated in terms of their ability to perform full database searches, but they can be applied to the processing of a smaller number of alignments as well.
"Flexible Array" Architecture
The task parallelism inherent to PSS provides an opportunity to tweak the design of the basic linear systolic array for better performance. Instead of focusing on the alignment of a single pair of sequences, we approach the design of an FPGA accelerator with hundreds of pairs in mind. We implement a pipelined systolic processing element for increased throughput, provide runtime support for partitioning the systolic processing elements into independent arrays to address utilization, and apply input and output buffering to reduce data transfer overhead between the CPU and FPGA. As a result, we are able to achieve high array utilization for all but the shortest of input sequences, and are able to support a wide range of input sequence lengths with a single FPGA configuration 0.
A linear systolic array is an effective processing method for pairwise alignment algorithms on FPGAs. The array is made up of many processing elements, or cells, that are connected in series. Each cell in the array is responsible for calculating a single column of the similarity matrix, H. To accomplish this, each cell holds onto a single element of Seq1, and a permutation of Seq2 streams through the array in its entirety.
Flexible Array:
The lengths of Seq1 and Seq2 are typically in the neighborhood of a few hundred characters, and for sequences this short, long systolic arrays tend to exhibit poor performance because many cells do not have an element of Seq1 to operate on. To address this problem, we propose a flexible accelerator, shown in Figure 2 , made up of 2 i processing blocks (i ∈ N). This array can be configured at runtime to act as a single long array for long sequences or 2 j shorter arrays (1 ≤ j ≤ i) for shorter sequences. When the processing blocks are configured as independent arrays, multiple pairs of permutations are processed concurrently.
This strategy improves the throughput of the accelerator for short sequences by increasing the number of systolic cells that can actively participate in calculations. The configuration is accomplished by multiplexing the inputs to the head and tail cells in each processing block. The select signals of these multiplexers are tied to a software-accessible register and are written before the input data is dispatched to the accelerator. By providing architectural support for dynamic allocation of systolic cells, our flexible accelerator offers increased performance for short sequences, without sacrificing the ability to process long sequences or requiring runtime reconfiguration of the device.
Systolic Cell:
The systolic cell is based on the implementations presented in [28] [29] . Since the Figure 4 Test system critical path of the accelerator lies within the systolic cell, breaking up the cell's calculations into two stages (Figure 3 ) results in a higher system clock and, consequently, greater throughput. Introducing pipelining creates a data hazard on H_out when elements of a single sequence are fed into the cell in successive cycles. This is because it takes two cycles for H_out to be calculated, but its result would be needed in one cycle.
To avoid the data hazard, two independent sequences are interleaved and fed into the cell. The tradeoff of a pipelined cell is higher resource usage associated with the increased register count. Compared to the multi-cycle implementation presented in [29] , adopting the pipelined cell design results in 33% fewer systolic cells, but an 86% increase in the operating frequency of the main clock. This equates to 24% higher peak throughput for the pipelined design.
Buffers: Typically, the number of systolic cells allocated to each alignment is less than the length of Seq1, so scoring the alignments cannot be accomplished in a single pass. In this case, the scoring process is broken up into phases. Seq1 is shifted into the arrays section by section, with the active permutations of Seq2 being streamed though the array once for every phase. Ga_out and H_out at the tail of each array are written into the multi-pass buffer, and the contents of the buffer are used as inputs to the head of each array during the subsequent phase.
The multi-pass buffer is implemented as 2 i independent FIFOs, where 2 i is the number of processing blocks. When the processing blocks are configured as a smaller number of independent arrays, the push controller interleaves writes to a subset of the 2 i FIFOs, while the pop controller interleaves reads from the same subset of FIFOs in the same order.
The input buffer A stores Seq1, while B0 and B1 each store bundles of Seq2 permutations. B0 and B1 each hold 2 j permutations of Seq2, where 2 j is the number of independent arrays that the processing blocks have been currently configured to support. B0 and B1 are configured to act together as a double buffer, allowing the contents of one buffer to be updated while the accelerator is processing the contents of the other buffer. Implementing a double buffer for the permutations helps to hide the latency of data transfers from the CPU to the FPGA.
The output scores are also buffered in an output FIFO. This FIFO is large enough to accommodate scores for the 1000 alignments typically called for during a PSS calculation.
Testing & Results
Our test system is an XtremeData XD1000 Development system (Figure 4 ). It is a Sun ULTRA 40 workstation in which an Altera Stratix II EP2S180 FPGA occupies one of the two-processor sockets. The other socket in the system holds an AMD Opteron 248 processor, which controls 4 GB of the total 8 GB DDR SDRAM on the system. The OS on the development system is Fedora Linux.
256 total systolic cells fit on device, split into 8 processing blocks. The choice of 8 as the number of processing blocks was made to provide a variety of array configurations without adversely affecting the number of cells that could be placed on the device. It operates at 125 MHz and supports sequence lengths up to 65,535.
The FPGA implementation was compiled using the Quartus II software suite, version 9.1. 256 total systolic cells fit on the device, split into 8 processing blocks. The design uses 99,309 LUTs, 70,053 registers, and 4.8 million BRAM bits. It operates at 125 MHz and supports permutation lengths up to 32,768. The processing blocks can be configured as 1, 2, 4, or 8 independent arrays. The selection of 8 as the maximum number of independent arrays was made to provide a variety of potential configurations without adversely affecting cell count and performance on the target device.
A common metric of performance in this problem domain is a Cell Update Per Second (CUPS). A cell update refers to the process of updating H(i,j) for a particular i and j. Since our implementation is comprised of 256 cells, each of which is capable of performing one cell update per clock cycle, its peak performance is 32 billion CUPS (GCUPS). To verify the functional correctness of the accelerator, we compared the scores generated by the accelerator to the scores generated by the software-only alignment task using the same common sequence and sequence of permutation blocks. Since the Smith-Waterman algorithm has a unique solution for given inputs, it is guaranteed that the quality of results for the pairwise statistical significance estimation will be unaffected by the use of this accelerator. Figure 5 plots the comparative performance of our flexible accelerator. The numbers represent performance in GCUPS for the alignment of a common sequence Seq1 against 1000 permutations of an equal-length sequence Seq2. The task of permuting Seq2 is excluded from this analysis. The line labeled "FPGA Flexible Array" plots the performance of our flexible accelerator. The processing blocks are configured as 8 arrays for sequences of length less than 4096, 4 arrays for sequences of length less than 8192, 2 arrays for sequences of length less than 16384, and 1 array otherwise. In other words, we use maximum number of arrays possible for the input length. The line labeled "Software Baseline" plots the performance of a sequential, scalar implementation of the SmithWaterman algorithm for protein sequences with support for affine gaps. The runs were conducted on the XD1000 development system. This implementation achieves at most 155 million CUPS (MCUPS). The line labeled "Software FastFlow" measures the performance of the implementation presented in [21] on a test system equipped with an Intel Core 2 Quad Q6600 processor. To our knowledge, this is the fastest software implementation available as of this writing, and is even capable of outperforming a GPU-based implementation [18] for full database searches.
Accelerator Performance
Direct comparisons between FPGA implementations are difficult for a few reasons. First off, targeting DNA sequences instead of protein sequences results in a simpler cell design with much lower BRAM requirements. The choice of gap model also has a substantial effect on the complexity of the systolic cell design, with the affine gap model requiring a more complicated cell than a constant or linear gap model. Comparing GCUPS ratings from implementations such as the ones in [13] and [14] , which operate on DNA and implement constant gap support, are therefore not telling.
Additionally, reducing the maximum supported score decreases the resource utilization of the systolic cell. This is because the width of the data values being operated on determines the resources needed to implement each addition and max operation, as well as the cell's register count. Since the systolic cells collectively account for the vast majority of resources used by the accelerator, decreasing their resource utilization allows for more cells on the device and higher peak performance. On top of all this, the resources available on the target FPGA, as well as the operating frequency allowed by the FPGA, are huge determining factors of the accelerator's performance. Qualitatively, compared to the linear array presented in [15] , our flexible array supports longer query sequences and achieves better peak performance, albeit with a larger FPGA.
The supported query (common sequence) length for our implementation is also not constrained by the number of systolic cells, so our implementation does not require runtime reconfiguration for varying query lengths. The implementation presented in [16] provides a good quantitative comparison to our flexible array. Like our accelerator, it handles protein sequences and supports the affine gap model. It also handles similar sequence lengths (65536). The cell design is highly optimized, and, most importantly, it targets the same FPGA and host platform that we do, allowing an evaluation of the design and not the FPGA capacity and platform overhead. The line labeled "FPGA linear" plots the performance of this implementation based on reported results. Though the reported numbers reflect the performance a single alignment, we assume that computing N alignments with this implementation would take N times as long as computing a single alignment.
The performance results indicate that our flexible array offers significant performance advantages over software and single linear systolic arrays. The single linear array fails to outperform the FastFlow implementation for sequences under length 4096 in this case, while our flexible array is significantly faster over the entire input space, and achieves 84% of its peak throughput for sequences as short as 256 residues.
For our flexible array, the primary performance limiter for short sequences is the large contribution of data transfer overhead. The data cannot be supplied to the FPGA fast enough when sequences are very short. Also, the overhead associated with configuring the accelerator, transferring inputs and results, and managing the state of the accelerator amount to a higher percentage of total execution time when the sequences are short.
Increasing the number of permutations, N, beyond 1000, will have no negative effect on the average throughput of our flexible array. The time required to dispatch and compute 2N permutations is twice the time required to compute N permutations. But the costs of initializing the FPGA for computation, dispatching the first permutation block to the FPGA, and reading the results back from the FPGA, are amortized over twice the computation time. Decreasing N, inversely, results in overhead being amortized over fewer cycles and causes a drop in average throughput. In the worst case, N is small enough to eliminate the task parallelism our flexible array is designed to extract. When N=1, for example, the average throughput degrades to roughly that of the linear array. Table 2 displays the end-to-end speedup for the pairwise statistical significance estimation algorithm over a sequential, scalar software implementation. For these tests, the processing blocks were again set as multiple independent arrays.
Application Speedup
In general, the results indicate substantial speedups. Shorter sequences see lower speedups because the speedup for the alignment task is lower, and because the rest of the algorithm (specifically, the permutation time) accounts for a higher percentage of the overall serial execution time. A common metric of performance in this problem domain is a Cell Update Per Second (CUPS). A cell update refers to the process of updating an element of the similarity matrix, H. Since our implementation is comprised of 256 cells, each of which is capable of performing one cell update per clock cycle, its peak performance is 32 billion CUPS (GCUPS). Figure 5 plots the comparative performance of our flexible accelerator. The numbers represent performance in GCUPS for the alignment of Seq1 against 1000 permutations of an equal-length sequence Seq2. "Software Baseline" plots the performance of a basic software implementation of Smith-Waterman running on the host CPU. "Software FastFlow" measures the performance of the highly optimized, SSE, multi-core implementation from on a system equipped with an Intel Core 2 Quad Q6600 processor. "FPGA Linear Array" plots the performance of the FPGA implementation, assuming that N alignments would take N times as long as a single alignment. "FPGA Flexible Array" represents our implementation, configured to use the most arrays possible for the given sequence length.
The single linear array fails to outperform the FastFlow implementation for sequences under length 4096 in this case, while our flexible array is significantly faster over the entire input space, and achieves 84% of its peak throughput for sequences as short as 256 residues. Figure 5 displays the endto-end speedup for the PSS estimation algorithm over a sequential, scalar software implementation.
Conclusion and Future Work
In this paper, we have presented a reconfigurable architecture to accelerate the calculation of pairwise local alignments for PSS estimations. Our hardware implementation focuses on maximizing systolic cell utilization and minimizing data transfer overhead. It is capable of providing high throughput for short queries without sacrificing performance for long queries, and its configuration can be adjusted at run-time by simply writing a register value to the FPGA. We have demonstrated measured performance as high as 31.9 GCUPS for our implementation, and have shown resulting end-to-end speedups up to 216x for the PSS algorithm. Future work can include using FPGAs to accelerate other bioinformatics applications like multiple sequence alignment [49] , exon prediction [50] , gene network identification [51] , and so on.
