In 
Introduction
Scanning protein sequence databases is a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the exponential growth of the biosequence banks: every year their size scaled by a factor 1.5 to 2. The scan operation consists in finding similarities between a particular query sequence and all the sequences of a bank. This operation allows biologists to point out sequences sharing common subsequences. From a biological point of view, it leads to identify similar functionality.
Comparison algorithms whose complexities are quadratic with respect to the length of the sequences detect similarities between the query sequence and a subject sequence. One frequently used approach to speed up this time consuming operation is to introduce heuristics in the search algorithms [2] . The main drawback of this solution is that the more time efficient the heuristics, the worse is the quality of the results [13] .
Another approach to get high quality results in a short time is to use parallel processing. There are two basic methods of mapping the scanning of protein sequence databases to a parallel processor: one is based on the systolisation of the sequence comparison algorithm, the other is based on the distribution of the computation of pairwise comparisons. Systolic arrays have been proven as a good candidate structure for the first approach [5, 17] , while more expensive supercomputers and networks of workstations are suitable architectures for the second [10] . This paper presents two solutions to high performance database scanning on two new architectures: a hybrid parallel computer and the Fuzion 150.
Hybrid computing denotes the combination of the SIMD and MIMD paradigm within a parallel architecture, i.e. within the processors of a computer cluster (MIMD) massively parallel processor boards (SIMD) are installed in order to accelerate compute intensive regular tasks. The driving force and motivation behind hybrid computing is the price/performance ratio. Using PC-clusters as in the Beowulf approach is currently one of the most efficient and simple ways to gain supercomputer power for a reasonable price. Installing in addition massively parallel processor cards within each PC can further improve the cost/performance ratio significantly. We designed a parallel sequence comparison algorithm in order to fit the characteristics of the hybrid architecture for a protein sequence database scanning application. Its implementation is described on our hybrid system consisting of Systola 1024 cards within the 16 PCs of a PC-cluster connected via a Myrinet switch.
Our second solution is based on the Fuzion 150, a parallel computer consisting of a single-chip SIMD array of 1536 processing elements (PEs). Its architecture has been designed to accelerate large-scale visualisation and graphics. We will show that this approach is also beneficial for high performance computational biology. This paper is organised as follows. In Section 2, we introduce the basic sequence comparison algorithm for database scanning and highlight previous work in parallel sequence comparison. Section 3 provides a description of our hybrid architecture. The parallel algorithm and its mapping onto this parallel architecture are explained in Section 4. Section 5 introduces the Fuzion 150 and Section 6 discusses the corresponding application mapping. The performance of both approaches is evaluated and compared to previous implementations in Section 7. Section 8 concludes the paper with an outlook to further research topics.
Parallel Sequence Comparison
Surprising relationships have been discovered between protein sequences that have little overall similarity but in which similar subsequences can be found. In that sense, the identification of similar subsequences is probably the most useful and practical method for comparing two sequences. The Smith-Waterman (SW) algorithm [18] finds the most similar subsequences of two sequences (the local alignment) by dynamic programming. The algorithm compares two sequences by computing a distance that represents the minimal cost of transforming one segment into another. Two elementary operations are used: substitution and insertion/deletion (also called a gap operation). Through series of such elementary operations, any segments can be transformed into any other segment. The smallest number of operations required to change one segment into another can be taken into as the measure of the distance between the segments.
Consider two strings . To identify common subsequences, the SW algorithm computes the similarity ¤ £ producing this value can be determined by a backtracking procedure. Fig. 1 illustrates an example.
The dynamic programming calculation can be mapped efficiently to a linear array of processing elements. A com- A number of parallel architectures have been developed for sequence analysis. In addition to architectures specifically designed for sequence analysis, existing pro- grammable sequential and parallel architectures have been used for solving sequence problems.
Special-purpose systolic arrays can provide the fastest means of running a particular algorithm with very high PE density. However, they are limited to one single algorithm, and thus cannot supply the flexibility necessary to run a variety of algorithms required analyzing DNA, RNA, and proteins, e.g. P-NAC, SAMBA, Bioscan [11, 5, 17] . Reconfigurable systems are based on programmable logic such as field-programmable gate arrays (FPGAs), e.g. Spalsh-2, Biocellerator [6, 7] , or custom-designed arrays, e.g. MGAP [3] . They are generally slower and have far lower PE densities than special-purpose architectures. They are flexible, but the configuration must be changed for each algorithm, which is generally more complicated than writing new code for a programmable architecture.
Our first approach is based on instruction systolic arrays (ISAs). ISAs combine the speed and simplicity of systolic arrays with flexible programmability [8] , i.e. they achieve a high performance cost ratio and can at the same time be used for a wide range of applications, e.g. scientific computing, image processing, multimedia video compression, volume visualisation and cryptography [14, 15, 16] . The second approach is based on the SIMD concept. SIMD architectures achieve a high performance cost ratio and can at the same time used for a wide range of applications. Cost and ease of programming fall between the other two classes.
The Hybrid Architecture
We have built a hybrid MIMD-SIMD architecture from general available components (see Fig. 3 ). The MIMD part of the system is a cluster of 16 PCs (Pentium II, 450 MHz) running Linux. The machines are connected via a Gigabitper-second LAN (using Myrinet M2F-PCI32 as network interface cards and Myrinet M2L-SW16 as a switch). For application development we use the MPI library MPICH v.
1.1.2.
For the SIMD part we plugged a Systola 1024 PCI board [9] into each PC. Systola 1024 contains an instruction systolic array ISA of size q £ r p q £ . The ISA [8] is a meshconnected processor grid, where the processors are controlled by three streams of control information: instructions, row selectors, and column selectors (see Fig. 4 ). The instructions are input in the upper left corner of the processor array, and from there they move step by step in horizontal and vertical direction through the array. This guarantees that within each diagonal of the array the same instruction is active during each clock cycle. In clock cycle
and § © 7 " C ¡ vertically from top to bottom. Selectors mask the execution of the instructions within the processors, i.e. an instruction is executed if and only if both selector bits, currently in that processor, are equal to one. Otherwise, a no-operation is executed.
Every processor has read and write access to its own memory. Besides that, it has a designated communication register (C-register) that can also be read by the four neighbour processors. Within each clock phase reading access is always performed before writing access. Thus, two adjacent processors can exchange data within a single clock cycle in which both processors overwrite the contents of their own C-register with the contents of the C-register of its neighbour. The ISA combines the advantages of fine-grained SIMD machines with the capability of efficiently performing special communication operations, so called aggregate functions. These are associative and communtative functions to which each processor provides an argument value. Examples for aggregate functions are broadcast and ringshift operations along the rows or columns of the processor array. These are the key operations within the algorithm presented in the next Section.
In order to exploit the computation capabilities of the ISA, a cascaded memory concept is implemented on Systola 1024 (see Fig. 3 right) . For the fast data exchange with the ISA there are rows of intelligent memory units at the northern and western borders of the array called interface processors (IPs). Each IP is connected to its adjacent array processor for data transfer in each direction. S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 S1024 
Mapping of Sequence Comparison onto the ISA
The mapping of the database scanning application on our hybrid computer consists of two forms of parallelism: a fine-grained parallelelisation on Systola 1024 and a coarsegrained parallelisation on the PC-cluster. While the Systola implementation parallelises the cell computation in the SW algorithm, the cluster implementation splits the database into pieces and distributes them among the PCs using a suitable load balancing strategy. We will now describe both parts in more detail.
Systolic parallelisation of the SW algorithm on a linear array is well-known. In order to extend this algorithm to a mesh-architecture, we take advantage of ISAs capabilities to perform row broadcast and row ringshift efficiently. Since the length of the sequences may vary (several thousands in some cases, however commonly the length is only in hundreds), the computation must also be partitioned on the p ISA. For sake of clarity we firstly assume the processor array size c to be equal to the query sequence length , i.e. $ . Fig. 5 shows the data flow in the ISA for aligning the sequences 
£ c
characters of the query sequence are loaded. The data stored previously is loaded together with the corresponding subject sequences and sent again into the ISA. The process is iterated until the end of the query sequence is reached. Note that, no additional instructions are necessary for the I/O of the intermediate results with the processor array, because it is integrated in the dataflow (see Fig. 5 ).
For distributing of the computation among the PCs we have chosen a static split load balancing strategy: A similar sized subset of the database is assigned to each PC in a preprocessing step. The subsets remain stationary regardless of the query sequence. Thus, the distribution has only to be performed once for each database and does not influence the overall computing time. The input query sequence is broadcast to each PC and multiple independent subset scans are performed on each Systola 1024 board. Finally, the highest scores are accumulated in one PC. This strategy provides the best performance for our homogenous architecture, where each processing unit has the same processing power. However, a dynamic split load balancing strategy as used in [10] is more suitable for heterogeneous environments.
The Fuzion 150 SIMD architecture
Early SIMD architectures suffered to some extend due to the small amounts of area for each PE, e.g. [3, 4] . The increase in integration on ICs now allows a large SIMD array, with local PE memory and controllers on a single die. The Fuzion 150 system architecture shown in Fig. 6 provides a general-purpose processing solution for many application areas including network processing and large-scale visualisation and graphics [1, 12] . It combines control and data processing on the same silicon, whilst taking account of the very different processing requirements of the control and data planes. The control plane and housekeeping operations are performed on the embedded processing unit (EPU), which is a 32-bit ARC core. Data plane operations utilize the PEs in the Fuzion core.
The processor array is made up of six blocks of PEs. 
Mapping of Sequence Comparison on the Fuzion 150
Compared to Systola 1024 no fast ringshift operation is available on the Fuzion 150. Hence, we are using a different mapping scheme for SW cell computation. The communication pattern in each iteration step now consists of two steps: a read-left operation in even-numbered PE blocks and a read-right operation in odd-numbered PE blocks (see Fig.  7 ).
For the detailed description we firstly assume the processor array size to be equal to the query sequence length , i.e. $ ¡ v 0 q
. Fig. 7 shows the data flow for aligning the sequences So far we have assumed a processor array equal in size of the query sequence length (
). In practice, this rarely happens. Assuming a query sequence length less or larger the array size, our implementation is modified as follows: Thus, the required data transfer time is totally dominated by above computing times per iteration step of instructions and ¡ f q instructions, respectively. for Systola 1024. Extrapolating to this technology both approaches should perform equally. However, the difference between both architectures is that Fuzion 150 is purely a linear array, while Systola is a mesh. This makes the Systola 1024 a more flexible design, suitable for a wider range of applications, see e.g. [14, 15, 16] .
For the comparison of different parallel machines, we have taken data from Dahle, Grate, Rice and Hughey [4] for a database search of the SW algorithm for different query lengths. The Fuzion 150 is three to four times faster than the much larger 16K-PE MasPar. The 1-board Kestrel [4] is six times slower than a Fuzion 150 chip. Kestrel's design is also a programmable fine-grained SIMD array. It reaches the lower performance, because it has been built with older CMOS technology (3
¥ v 4 £
). SAMBA [5] is a special-purpose systolic array for sequence comparison implemented on two add-on boards, which are around five times slower than the Fuzion.
Conclusions and Future Work
In this paper we have demonstrated that fine-grained parallel architectures are suitable solutions for high performance scanning of biosequence databases. We have presented efficient mappings of the SW algorithm on an ISA and a SIMD array that lead to high-speed implementations on Systola 1024 and Fuzion 150. By combining the fine-grained parallelism with a coarse-grained distribution within a Beowulf PC-cluster an even higher performance can be achieved at a good price/performance ratio.
The exponentially growth of genomic databases demands even more powerful parallel solutions in the future. Because comparison and alignment algorithms that are favoured by biologists are not fixed, programmable parallel solutions are required to speed up these tasks. As an alternative to special-purpose systems, hard-to-program reconfigurable systems, and expensive supercomputers, we advocate the use of specialised yet programmable hardware whose development is tuned to system speed.
Our future work in hybrid computing will include identifying more applications that profit from this type of processing power consisting of a combination of fine-grained and coarse-grained parallelism, like scientific computing 
