Abstract
Introduction
The performance of proteic and nucleic sequence comparison algorithms is one of the key issues for sequence data bank scanning. Using the regular Smith-Waterman (SW) algorithm, the actual levels of performance of processors used in workstations varies from 2 to 10 million comparisons per second.
There are some specialized hardware solutions, like BIOCCELERATOR (Ltd., 1996) or SAMBA (Lavenier, 1996) , which can give performance of more than two orders of magnitude higher than workstations, but these solutions also have some drawbacks: high cost, difficult programming and debugging, hardware that cannot be re-used for other purposes. This makes the alternate, pure software implementation, INRIA, BP 105, 78153 Le Chesnay Cedex, France possibly running in parallel on many workstations, an attractive solution. This is the case of LASSAP developed at INRIA (Glemet and Codani, 1996) .
From the execution profile of LASSAP sequence comparison with the SW algorithm, we found that the program spends 99% of the time in the SW algorithm procedure. So every improvement in the speed of this procedure will give almost the same improvement at the application level.
In the following section, we describe our implementation of the algorithm; the next section presents benchmarks of real data bank scanning. The final section presents the conclusions. Figure 1 presents the basic affine-scoring SW (Smith and Waterman, 1981) algorithm expressed in C language.
Algorithm implementation
Two protein sequences, seqA and seqB, of length respectively lena and lenb, are compared using the scoring matrix matrix and two parameters gap_open and gap_ext, representing the penalties for opening and extending a gap, respectively.
One can see this algorithm as a matrix of calculi of size lena x lenb, where one letter of the sequence seqB (outer loop) is compared against all letters of the sequence seqA (inner loop). For each comparison, three intermediary results are produced in each iteration: nogap, a_gap and b_gap, i.e. perfect alignment and penalties for opening and extending the gap on seqA and seqB sequences, respectively. The final value of score is computed as the maximum of all the values found in each iteration. Only two vectors of intermediate results nogap [lena] and b_gap [lena] need to be stored. The overall complexity of the SW algorithm is O(lena x lenb).
Owing to growth of the size of databanks, scanning using the SW algorithm can take hours on a single workstation. Moreover, emerging large-scale analysis such as building families of sequences leads to inter-and intra-data bank comparisons. This process can take weeks of computation.
Execution speed-up can be obtained using multiprocessors or clusters of workstations because sequence comparisons are independent. In this article, we present a parallelization technique at the processor instruction level to speed-up execution of the SW algorithm further.
Two potential kinds of parallelization can be considered: for sequence-to-bank comparison and for sequence-tosequence comparison. The first is based on lack of dependency between two consecutive comparisons. Thus, one sequence can be compared against two or more sequences at the instruction level using the same control structure. Of course, each compared sequence has its own set of intermediate variables. An implementation using similar principles is described in Alpern et al. (1995) .
The above approach is limited to pure databank scanning. It is not possible, for example, to implement easily sequence shuffling procedures to compute Z-scores (Lipman et al., 1984; Pearson and Lipman, 1988) . Furthermore, it cannot take advantage of LASSAP's programming capabilities which allow combination 'on the fly' of pairwise comparison-based algorithms. As an example, LASSAP implements an algorithm called blsw which computes, for every pair of sequences, SW alignments depending on the results of blast (Altschul et al., 1990) .
Our approach is to parallelize at the single sequence comparison level. This preserves efficiency in every case. One can see from Figure 1 that results in the (i,;)th iteration depend on (/' -1,)), (/ -l,j -1) and (i,j -1) values. This can be visualized as vertical, diagonal and horizontal dependencies ( Figure 2 ).
If we shift the computation of the second row one place to the right, rows 1 and 2 can be executed in parallel, with only a small overhead of two steps: one on the beginning of the inner loop and the second on its end.
If the length of the sequence seqB is odd, on the last step only one row is computed. As the values of nogap[j] and b_gap(j] computed in the first row are used by the second one, we can transfer these values locally, dividing by two the memory band width for storing and loading these values.
To make the computation uniform, one can add one dummy symbol at the beginning and at the end of the sequence seqA, and a dummy symbol to sequence seqB in case its length is not even. This scheme can be extended to parallel computation of an arbitrary number of rows. vector argument and puts at the rightmost place its second integer argument. All' + ' and '-'are vector add and subtract operators, respectively.
Although implementable in standard integer instruction set, this parallelization scheme gives little performance improvement over the basic implementation (see benchmarks on Figure 6 ).
Substantial speed-up can be achieved using special, videooriented instructions like VIS (Visual Instruction Set) found in the SUN ULTRA SPARC processor (Sun Microelectronics, 1996a) .
VIS instructions are executed in specially enhanced floating point unit (FPU) and use the 64-bit registers of the FPU. Instructions operate on two 32-bit, or four 16-bit integer data packed in a 64-bit double word. These instructions include arithmetic ADD, SUBTRACT, COMPARE and MULTIPLY, logic AND, OR, XOR, NOT, and data conversion and manipulation. One can add in one instruction two sets of four 16-bit integers and get four 16-bit results, or multiply them by four 8-bit integers. Details can be found in Sun Microelectronics (1996b).
We can use VIS instructions to execute in parallel four rows of the SW algorithm. The implementation of most operations from Figure 3 using VIS inline interface is straightforward. We replace type Vint4 by type vis_d64 which is actually a vector of four 16-bit integers packet in a 64-bit word. VMAX operation is replaced by VIS_MAX_INT from Figure 5 . Add, subtract and VSHIFT are replaced, respectively, by vis_fpaddl6, vis_fpsubl6 and vis_faligndata instructions.
The ULTRA SPARC VIS compare instructions compute a 4-bit mask of result, which can be used by a selective store instruction (from four 16-bit words only these with corresponding status bits set to 1 are stored). Unfortunately, we need many comparisons with re-use of the intermediate results which implies reload of the stored values. This is time and memory band width consuming, and decreases effective execution speed.
Instead, we use maximum integer value calculation based on integer code presented on Figure 4 . It computes the maximum of two signed integer values int_max and int_val, the result is stored in int_max. This code does not use conditional branches, so it avoids potential jump misprediction and flushing of pre-decoded, partially executed or speculatively executed instructions on some CPU.
The result is correct as long as there is no overflow in the computation of the difference (int_max -int_val) so the method can be used safely on a restricted range of arguments. Figure 5 shows the implementation of the above method for packed 16-bit integers in VIS instructions. There is no arithmetical bit-shift instruction in VIS, so the mask is generated with the multiplication instruction vis_fmul8ulxl6. As this multiplication is performed on a separate unit and the graphic processor is a double-issue engine, there is little if any performance degradation.
The ULTRA SPARC graphics unit is fully pipelined and has a latency of three clock cycles for multiplication. So the result can be used by another instruction only three clock cycles later without stalling the CPU.
In our implementation, we process eight rows at once, so another multiplication can be started in the following cycle, keeping the CPU busy for maximum performance. The large set of 32 64-bit registers in the graphics unit helps keep all the computations local.
The inner loop is executed (lena+7)/8 times. As we mentioned before, we add seven dummy symbols at the beginning and at the end of the sequence A, and the sequence B is extended to a multiple of eight. Sixteen-bit signed precision is acceptable for the comparison of proteic and nucleic sequences. Indeed, it allows one to compute a score up to 32 767 which represents on average a perfect alignment of length -5000 with BLOSUM62 matrix. This length can be greater if mismatches and gaps are present.
The implementation guarantees saturation on the score of 32 767 (see below for details of the implementation). In case of saturation, one can compute the exact score with a standard integer implementation of the algorithm. In practice, such alignments with very high scores can be detected with faster algorithms such as fasta (Pearson and Lipman, 1988) and need, anyway, deeper studies due to their lengths. We believe that our implementation covers most of the needs when using the SW algorithm.
The result (i.e. score value) is correct up to 32 767 and then saturates at this value. The proof of correctness is as follows. The only case of overflow in the subtraction of signed integer values is when the two operands are of opposite sign and their difference is greater in magnitude than the maximum representable value (32 767, -32 768 for 16-bit integers).
In our case, the most negative value of a_gap and b_gap is -(gap_open + gap_ext). The value of nogap is always positive so the error condition could arise only when one argument of the maximum value calculation approaches 32 767 and the other one is negative.
From the algorithm in Figure 1 , one can see that at any point of the calculation the values of nogap, a_gap and b_gap can change only by limited steps based on scoring matrix values, gap_open and gap_ext. The absolute value of the difference between a_gap or b_gap and last_nogap arguments of the MAX function (1) is bound by the expression:
where max, /mn(matrix) are the extreme substitution costs of the scoring matrix. As this value is small (usually < 200) compared to 32 767, the values of nogap, a_gap and b_gap are strictly positive when the score is reaching the saturation value.
Benchmarks
We have implemented the SW algorithm from Figure 1 both in integer (32-bit) instructions and in VIS (16-bit) for an ULTRA SPARC machine. We have chosen four typical sequences of length 104, 319 and 52J7, respectively. The execution speed results are presented in Figure 6 .
The VIS implementation of the algorithm for a SPARC ULTRA machine is twice as fast as the integer code giving the figure of 18 Next, the code was integrated into LASSAP's software package and the real application of scanning a sequence of length 300 amino acids against SWISSPROT R33 (18531385 residues) was executed on a SUN ULTRA SPARC based Enterprise 6000 server with 12 processors running at 167 MHz. Figure 7 presents the results of the overall real execution time of the application as a function of the number of processors used.
The global performance scales quite perfectly with the number of processors used, reaching at 12 processors 200 million matrix cells per second. We believe we can get 500 million matrix cells per second on existing SUN Enterprise 30 processors servers. Similar performance can be obtained using ULTRA SPARC workstations connected by a network.
This level of performance can be compared to specialized commercially available hardware solutions such as the BIOCCELERATOR (Ltd., 1996) machine. Based on the XILINX field programmable gate arrays (FPGA), the accelerator is composed of a rack containing up to 16 accelerator chips connected to a workstation through the SCSI interface. 
Conclusion
Excellent performance has been obtained using VIS, albeit the VIS instructions were not intended initially for general purpose computation. We believe that the method is attractive and can be extended to other types of parallelizable algorithms giving more processing power to pure software implementations.
Although our solution offers less performance than specialized hardware, it offers more flexibility. Furthermore, the principle is applicable to other instruction sets such as MMX for INTEL PENTIUM and PENTIUM PRO processors which will be introduced soon (Gwennap, 1996) , enabling the use of clusters of PCs as computing power.
We expect that other processor families like DEC ALPHA, HP PA-RISC and PowerPC will follow the trend, adding VISlike instructions to their CPUs. Since the performance scales with the CPU clock speed, we expect proportionally higher performance on the announced ULTRA SPARC 250 MHz machines.
The parallelization scheme presented in this paper is not limited to VIS-like implementation On a 64-bit CPU, one can pack three 21-bit or four 16-bit integers in the machine word. Vector add and subtract can be computed using bit masks to avoid carry propagation between packed integers. The proof of lack of overflow can be used to simplify the VMAX operation. On some machines, one can expect performance gain compared to full integer implementation.
