Summary: Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-ofthe-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.
INTRODUCTION
Progressive alignment is a widely used heuristic method to compute multiple sequence alignments (MSAs). Progressive alignment basically consists of three steps. First, a distance value between each pair of input sequences is computed (pairalign). Second, a phylogenetic tree is calculated based on the distance matrix (guided tree). Finally, pairwise alignment of various profiles is carried out following the branching order in the phylogenetic tree to form the final MSA (malign). A popular tool based on this method is ClustalW (Thompson et al., 1994) . Unfortunately, progressive alignment programs suffer from high-computational complexity, for instance the alignment of a few hundred protein sequences using ClustalW requires several hours on a state-of-the-art workstation. Owing to the rapid growth of sequence databases, biologists would like to compute MSAs of an increasing number of sequences in reasonable time.
Consequently, several parallel implementations of ClustalW have been developed to speed up this time-consuming task. The solutions recently presented by Li (2003) and by Ebedes and Datta (2004) use message-passing on a PC cluster. Parallel ClustalW implementations have also been designed for more expensive shared memory machines (Mikhailov et al., 2001; Duzlevski, 2002 , http:// bioinfo.pbi.nrc.ca/clustalw-smp/).
The parallelization strategy taken in this paper is based on field programmable gate arrays (FPGAs). FPGAs provide a flexible platform for fine-grained parallel computing based on reconfigurable hardware. Since there is a large overall FPGA market, this approach has a relatively small price/unit and it also facilitates regular upgrading to FPGAs based on state-of-the-art technology. The usefulness of this * To whom correspondence should be addressed. technology for fast sequence analysis has already been recognized (Marongiu et al., 2003) . Previous work has mainly focused on high speed homology searching (see e.g. Huang, 1993; Yamaguchi et al., 2002) .
METHODS
Profiling the three stages of ClustalW for different numbers of input sequences reveals that >90% of the overall runtime is spent on the first stage (pairalign). Hence, we have decided to accelerate only this stage using reconfigurable hardware.
Our mapping strategy uses a fine-grained parallelization of the pairwise distance computation. ClustalW derives the distance score of a pair of sequences from the number of matching characters in their optimal local alignment. Sequentially, this score can be computed by locating the maximum value in the Smith-Waterman dynamic programming (DP) matrix and then performing a trace back procedure to produce the corresponding alignment. However, the trace back operation is far too control intensive for efficient FPGA acceleration. We have therefore extended the Smith-Waterman recurrence relation to include the counting of exact matches during computation of the DP matrix.
The DP calculation exhibits a high regularity and can be efficiently mapped onto a linear array of small processing elements (PEs). One PE is assigned to each character of a (primary) sequence. A (secondary) sequence is then systolically shifted through the linear chain of PEs (Fig. 1) . Each PE is capable of computing one value for the DP matrix within each clock cycle. Amino acid sequences are usually longer than the number of PEs that can fit on a reconfigurable hardware chip. Therefore, we have partitioned the computation on a fixed-sized processor array. The partitioning is implemented by means of a loop FIFO memory that connects the right PE with the left PE.
RESULTS
We have described the hardware design in Verilog. Using a Xilinx Virtex II XC2V6000, we are able to accommodate 92 PEs at a maximum allowable clock speed of 34 MHz. Our implementation achieves a sustained performance (including all data transfer) of ∼1 GCUPS (billion cell updates per second in the DP matrix), a T.Oliver et al. measure commonly used to compare parallelized Smith-Waterman implementations. A set of performance evaluation tests have been conducted using a different number of globin sequences to evaluate the processing time of the FPGA-accelerated implementation versus that of the sequential ClustalW code. The PCI based ADP-WRC-II board from Alpha-Data with a Xilinx XC2V6000 FPGA has been used in the tests. The ClustalW application is benchmarked on an Intel Pentium IV 3 GHz processor with 1 GB of RAM. The results for this are shown in Table 1 . We have also set-up a web server for ClustalW running on this hardware configuration, which is available online at http://beta.projectproteus.org. The FPGA-accelerated pairwise alignment stage achieves speedups between 45 and 50. At least the same number of PCs connected by a fast switch is required to achieve a similar speedup using the ClustalW-MPI code from Li (2003) . A comparison of these two parallelization approaches shows that reconfigurable hardware acceleration is superior in terms of price/performance. Our solution is also easily scalable to next-generation FPGAs such as the Virtex-4 family by simply increasing the number of PEs in our design.
Table 1 also shows that the progressive alignment stage of ClustalW (malign) dominates the runtime of the FPGA-accelerated ClustalW. This stage comprises computation of several profileprofile alignments based on DP. It will be interesting to investigate how this can be efficiently mapped onto reconfigurable hardware. The first stage of several other MSA tools such as T-Coffee (Notredame et al., 2000) and MUSCLE (Edgar, 2004) is also based on the computation of pairwise distances. Hence, these tools will see a similar speedup from the accelerator presented in this paper.
Conflict of Interest: none declared.
