Abstract
Introduction
Over the last few years, reconfigurable computing has proved to be a very attractive solution for implementing compute intensive bio-computing algorithms. FPGA implementations of Smith and Waterman [15] or BLAST [1] algorithms have exhibited impressive speedup factors, making them a very viable alternative to more expensive supercomputing infrastructures such as vector computers or PC clusters.
Profile based Hidden Markov Models (HMM) have been recently used by biologists to predict the structure and function of a protein directly from its representation as an amino-acid sequence [8] . Among existing software implementations of this model, the HMMER software package is one of the most widely used.
We propose to speed-up in hardware the most time consuming routine of the hmmsearch tool [4] of HM-MER, namely the P7Viterbi procedure which computes the score between a HMM and an observed sequence. We do so by using powerful linear space-time mappings based on the so-called polyedral model. This leads to a flexible parallel architecture template which handles the feedback loop present in the Plan7 HMM model used by HMMER. This paper is organized as follows. Section 2 briefly introduces the principles of the profiling based HMM algorithm and surveys related work on accelerating this algorithm both in hardware and software. Section 3 presents the principles of our parallelization schemes, and Section 4 details the space-time mapping refinements which lead to the final hardware architecture. Section 5 gives experimental results in terms of resource usage and performance improvement. Section 6 concludes and presents future work directions. Fig. 1 depicts the Plan7 HMM used in HMMER. It is a probabilistic model of a family of sequences (see [8] for more details) which contains three types of states: match states (square nodes), insertion states (diamond nodes) and deletion states (circle nodes). Running the hmmsearch tool consists in matching a single profile HMM against a large number of input sequences, and in finding the sequences having high similarity with this HMM.
Background

Principles
The most time consuming routine in this program is the P7Viterbi kernel. This routine computes a similarity score between the target HMM model and the sequence at hand using dynamic programming. A simplified version of P7Viterbi is given in Fig. 2 
Matching a single HMM against a protein database is a very time consuming process which is repeated many times during intensive comparisons. Profiling shows that the P7Viterbi kernel accounts for more than 97 % of the execution time. It is therefore a perfect candidate for hardware acceleration.
Related Work
There have been several attempts to accelerate HM-MER using SIMD features of modern CPUs [9] , parallel machines [16] , GPUs [7] and even Network Processors [17] .
Some authors have proposed to speed-up HMMER by using reconfigurable hardware. Maddimsetty et al. [10] have studied different hardware-software partitioning schemes, but they do not present actual hardware implementations. Oliver et al. [12] propose an FPGA implementation of an HMM profile application and claim speed-up factors up to two orders of magnitude. But both [10, 12] consider a simplified version of the HMM model without feedback loop, in order to simplify the parallelization of the algorithm. We believe that such an approach is very unlikely to match biologists needs, since this simplified HMM might miss some potentially insteresting match. Besides, experience has shown that whenever an algorithm emerges as a standard (as it is now the case for HMMER), it is very difficult to convince its users community to accept algorithmic variations even though they run much faster.
In a recent work, Oliver at al. [13] propose an architecture which handles the P7Viterbi feedback loop. Their approach is quite similar to the wavefront schedule we detail in Section 4.2. However, instead of duplicating the look-up table memory as we do, they use a crossbar network architecture to distribute the look-up memory among the PEs. This approach is not scalable with the number of processors and is therefore limited to a small number of PEs.
In addition to adhere strictly to the original algorithm, the hardware realization that we target should also be scalable so as to be easily retargetable to different FPGA hardware platforms with varying amount of resources such as logic cells and memory blocks. Such constraints can hardly be handled by a fixed manual design, and this is why we propose instead to use an architecture template which can be seamlessly specialized to suit the user's needs. To derive this template Figure 2 . Simplified P7Viterbi code kernel we rely on well-known parallelization techniques based on the so-called polyedral model which we introduce in the following section.
Parallelizing the hmmsearch Software
The P7Viterbi kernel (Fig. 2) consists of a double nested loop that implements the Viterbi dynamic programming algorithm. The loop carried dependencies in the inner loop prevents from parallelizing the execution using classical loop unrolling techniques, as noticed in [10] . Dependencies in the outer loop prevent loop interchange, which in turn forbids the parallel execution of the outer loop. Although it is still possible to take advantage of the instruction level parallelism available within the loop body, this would expose to few parallelism: in practice, FPGAs must be able to run several hundreds of operations per cycle to outperform CPUs.
For the sake of conciseness, we will use in Section 3 and 4 our compacted version of the P7Viterbi loop nest which contains exactly the same features -datadependencies, indirect addressing, and existence of a feedback loop -as the original algorithm: the following results can therefore be applied to the original algorithm without loss of generality. The remaining of this Section shows how, by expressing the loop nest as a System of Affine Recurrent Equations (SARE), it is possible to derive an efficient parallel realization of the algorithm.
Expressing P7Viterbi as a SARE
The first step toward parallelization is to express the loop nest as a System of Affine Recurrent Equations, an intermediate representation that exposes data de-pendences at the loop level. Let L denote the protein sequence length and M the HMM profile length. The SARE corresponding to our compacted P7Viterbi algorithm is given below:
This SARE is defined over the domain D given by
and we let x i,k = −∞ and y i,k = −∞ for i = 0 and/or j = 0. From (1), we can retrieve data dependencies between indexed variables. Denote p1 δ p2 if point p2 depends on point p1. These dependencies are summarized below:
Managing Arbitrary Sized Problems
We observe that several calls to P7Viterbi using the same HMM profile are done by the hmmsearch tool and we thus propose to merge all the sequences into a single macro sequence. Each sequence is delimited by a special character, the role of which is to reset the matching scores to their initial −∞ value and to indicate that a new Viterbi algorithm instance is to take place.
On the other hand, even though the length of the HMM profile M remains constant for a given execution of hmmsearch, it can be different for each hmmsearch execution instance (hmmsearch motif length vary between 50 and 650, with an average size of 200). We must therefore design an architecture capable of handling arbitrary sized HMM motifs. This can be done by simply inserting idle states in the HMM motif. The role of these idle states is to propagate the scores of the last non-idle state until it reaches the feedback loop. In the rest of the paper, we will therefore consider that constant M denotes the maximum allowed model size of our architecture.
Linear Space-Time Mappings
Given a system of recurrent equations similar to the one presented in Equ.
(1), we want to derive a spacetime mapping that is a linear transformation which gives, for any indexed variable in the SARE:
• A logical execution time instant, in the form of a linear function of the index (which we call schedule), written as
• A physical location, i.e. coordinates in a processor space. This location is also expressed in the form of a linear function of the index (that we call allocation function). In our case we are only interested in linear arrays. We therefore write
Of course, this space-time mapping must satisfy several conditions.
• First, the chosen schedule must enforce all data dependencies present in the SARE. For u, v ∈ D with u δ v, the schedule function must guarantee that s(v) > s(u).
• Then, the space-time mapping must be conflictfree: there must be no u and
Proving that a schedule is conflict-free when each PE executes computations in a one dimensional domain is relatively straightforward (see [14] ), but this is more involved for higher dimensional domains. To solve this problem, we use Darte et al. [2] results on juggling schedules, which can be summarized as follows. Given a rectangular domain P in Z m+1 defined as :
a schedule is said to be juggling if it has the following properties:
• It is conflict-free, i.e. there are no two distinct iterations (i 0 , . . . , i m ) and (i 0 , . . . , i m ) in P which are scheduled at the same time instant.
• 
Notice that we may have used well established results on multi-dimensional schedules [5, 6] to find conflictfree schedules, but this would lead to more complex proofs in our case where the domains are hyperparallelepipeds.
In the next Section, we evaluate and refine several space-time mappings in order to obtain the best possible implementation.
Design Space Exploration
Adding a Dimension to the SARE
Although loop level parallel schedules cannot be found in the P7Viterbi routine, we observe that running the hmmsearch tool consists in completely independent matchings of a single HMM against a large number of input sequences: these matchings are completely independent and can therefore be run in parallel.
We model this additional parallelism by adding to the SARE a new index which identifies instances of the kernel running in parallel. Call j this additional index. The new iteration domain D is then:
where N stands for the number of sequences that are matched in parallel during each realization of the modified SARE, which is shown below:
Denote
the scheduling function of this new SARE. The intrinsic data-dependencies remain unchanged by the transformation (all P7Viterbi instance are independent), therefore we can write the constraints on the scheduling function as:
This lead to many possible schedules, among which only two really deserve attention: a wavefront and an interlaced schedule.
Wavefront Space-Time Mapping
In this approach we use the space-time mapping given by
which obviously enforces data dependencies and is conflict-free. It is illustrated (for M = 4, N = 4 and The wavefront space-time mapping is a very natural parallelization scheme in which each PE executes a distinct P7Viterbi kernel instance, and the chosen value for parameter N directly controls the number of PEs in the architecture. Although these P7Viterbi instances share the same HMM parameters values, we must account for the fact that variable TAB in (5) is used as a lookup table with seq i,j and k as indices. At a given time instant t, all PEs share the same value of i, k, therefore they must access the same subset TAB[*][k]. Moreover, each PE accesses the whole table across time. As a consequence, all PEs must either hold a copy of the TAB variable, or they must access a shared copy of it. In the full P7Viterbi kernel the size of TAB is quite large: for an average HMM model size of 250, each processor requires a 10000 × 32 lookup table: this would severely limit the number of PEs that can be implemented.
One solution to this problem is to use a crossbarlike interconnect structure to distribute the table content (see subsection 2.2). Instead, we propose another space-time mapping which allows the TAB variable to be distributed among the PEs, thereby reducing the architecture memory footprint to its minimum, while avoiding the need for a complex crossbar interconnect structure (as opposed to Oliver et al. [13] 
Interlaced Space-Time Mapping
This improved space-time mapping is given in Equ. (8) , and is illustrated in Fig. 4 : The reader can check that this mapping enforces data-dependencies and is conflict free if and only if M = N . This results in a mapping in which N = M PEs are running in parallel. Since P E p only executes iterations for which k = p, the hardware cost of the lookup On the other hand, we have no control over the resource usage of our architecture, since the number of PEs must be equal to the HMM model size M . This is again a severe limitation, since it is very unlikely that such architecture can be implemented on a FPGA even for moderate size model (M ≈ 50).
Managing Resource Constraints
To overcome this new problem, we propose to use LSPG (Locally Parallel Globally Sequential) partitioning [11] . (LPGS -Local Parallel Globally Sequential, -partitioning [2] is of no interest here because of the feed-back loop: it would not allow partitioning along dimension k in order to reduce the number of processors in the architecture.) The LSGP partitioning transformation consists of two steps. We first consider the processors of the initial space-time mapping as being virtual, and we tile the virtual processor space. Each tile of the virtual processor space is then mapped to a single physical processor which executes in turn the calculations associated with all virtual processors of the tile.
As far as the SARE is concerned, tiling the virtual processor domain amounts to replacing the processor space index k by two new indexes (v 
To cope with this transformation, the SARE is rewritten by using indices (i, j, v, p) instead of (i, j, k) and by modifying the data-dependencies accordingly. For example, data dependency between (i − 1, j, k) and (i, j, k) now holds between points (i − 1, j, v, p) and (i, j, v, p) and is thus still uniform. When a dependency spans the k index, it leads to more complex situation. For example, consider the dependency between (i, j, k− 1) and (i, j, k). Depending on whether the values of k and k − 1 belongs to the same tile (i.e lead to the same value of p), the original dependency is then transformed into two distinct data dependencies:
The partitioned SARE thus becomes:
We now can use the following space-time mapping:
We can show that this mapping juggles for the domain D " if we choose N = M . It is illustrated in Fig. 5 where the black dots correspond to the iteration subspace allocated to the first PE. Here σ = 2, and therefore M σ = 2 iterations are executed at a given time step.
Pipelining the PE Datapath
Although the architecture of Section 4.4 allows for a resource constrained implementation, it still requires that a whole loop iteration be executed within a single clock cycle. Given that the complete P7Viterbi loop body contains more than 20 arithmetic operations among which 7 lie in the critical path, the final maximum clock frequency of our design is likely to be disappointing. One solution is to modify the scheduling so that the pipelining of the loop body execution becomes possible (see our previous work [3] ). This is achieved by applying another tiling transformation along axis j. Let us write j = λjj + l, with jj and l being the two new index. The new SARE is given in Equ. (14), and its variables are defined over the domain:
Using the following space-time mapping:
(15) we can show that this mapping juggles iff 0 ≤ jj < N and 0 ≤ l < λ when N and λ satisfy λN = M . This space-time mapping is illustrated in Fig. 6 (with σ = 2 and λ = 2).
The execution of two dependent iterations is now separated by at least λ cycles (for dependency (i, jj, l, σ, p − 1) δ (i, jj, l, v, p) ), and by at most λσM + 1 cycles (for dependency jj, l, v, p) ). By carefully selecting λ, it is therefore possible to find the best tradeoff between resource cost (each PE must implement a λM + 1 deep delayline) and the operating frequency (the data-path corresponding to the loop body can be implemented with λ pipeline stages). 
Resulting architecture
Using the mapping of section 4.5, we obtain an architecture which consists of a linear array of M processors as illustrated in Fig. 7 . All PEs in the array have local interconnect except for the last one (P E M ) which broadcasts part of its results (this broadcast corresponds to the HMM feedback loop). PEs of this array are active every cycle, except for P E M in which the variable y i,M is only updated (and broadcasted) every σ cycles.
From the schedule, we can derive the depth of the delay lines associated to each data dependency in a PE. This leads to a PE internal architecture shown in Fig.8 , where functions f 1 and f 2 are implemented as a λ stage pipeline datapath (for the sake of readability, the data pipelines for the Amino Acid sequence seq i,j and the HMM parameters hmm k pipelines are not represented. Table. 1 summarizes the characteristics of the various space-time mappings for the full P7Viterbi kernel as a function of the HMM model size (M ), the design parameters (σ,λ,N ) and the number of distinct bases in the Amino Acid set (here 25).
Experimental Results
The hardware resource usage (in terms of logic cells and memory) of our hardware accelerator is highly dependent on the size M of the HMM model at hand, on Although the original P7Viterbi uses 32 bits integer arithmetic, it is possible to reduce the datapath bitwidth to 24 bits without effect on the final results. As a consequence, this benchmark considers scores and parameters encoded as 24-bit integer values.
For a fair comparison, we implemented the architecture corresponding to the wave-front schedule. This architecture was obtained by generating an instance of our architecture in which we set σ = M and λ = 1. The results are summarized in Table 2 . We can observe that the resource bottleneck is the limited amount of embedded memory blocks on the FPGA, which strongly affects the number of PEs which can actually be implemented on the device, and the level of pipelining in the datapath.
Performance figure for hmmsearch are generally measured in Million Cells Update per Second (MCUP/s), where a cell update corresponds to an iteration of the Viterbi algorithm. In our implementation, each PE performs up to 1 iteration per cycle, whith clock frequencies ranging from 33 MHz to 66 MHz. This is to be compared to a reported software performance of 24 MCUP/s on a Intel P4 CPU (see Oliver et al. [12] ).
In short, an architecture consisting of 10 PEs should outperform a desktop CPU by a factor of almost thirty.
However, this performance gain is balanced by the following observation: when using a HMM profile of size M 0 with an architecure which accomodates a maximum model size of M , there is only a fraction M 0 /M of the computations that actually contribute to the matching score. For example for M 0 = 50 and M = 200, our architecture performs only one useful computation every four cycles. This strongly encourages the use of dynamic reconfiguration, in order to adapt at run-time the architecture to the HMM profile at hand. In our case, this would consists in using an architecture instance with a value of M suited (i.e close) to the size M 0 of the query HMM. This is made possible thanks to our flexible architectural model which can be seamlessly parameterized to obtain a set of hardware configurations with different values for M .
Conclusion
In this paper, we have presented an original parallelization scheme for the P7Viterbi algorithm, which represents the most time consuming kernel of the hmmsearch application. This parallelization scheme is based on the polyedral model and allowed us to derive a simple yet flexible parallel architecture for accelerating the hmmsearch program. Our ongoing work includes an in depth analysis of the P7Viterbi precision requirements, and a system level exploration of the trade-off between performance and Quality of Results, under the constraint that no false negative should be tolerated.
