This 
Introduction

Motivation
In recent times, large, government-sponsored scientific projects such as the Human Genome Project in the United States, have contributed to a massive increase in the quantity of data regarding proteins, DNA, and RNA. When scientists search for similar sequences amongst a vast array of sequences, the speed of search is of importance, particularly when a search on a softwarebased system can take up to a month to complete. The Smith and Waterman (S&W) algorithm is one of the fundamental methods of analysing and searching the large amounts of data now available. However, an alignment algorithm such as this requires a large computational load which leads to a performance bottleneck on CPU-based computers and servers. As researchers investigate larger organisms such as humans, the need for high performance sequence analysis and database similarity search become even more demanding. . An application-specific hardware design, which is both high speed and cost-effective is the best solution for the required computational load. The SWASAD is an application-specific hardware system that executes the DNA sequence comparison algorithm 'Smith and Waterman' faster and with a lower engineering design cost than existing products (for example, the Kestrel chip). The SWASAD is an improved implementation of the BISP design [1] . The SWASAD design has used an improved algorithm, and has
Related Work
Algorithms for sequence comparison can be categorized into two groups: dynamic and heuristic. Dynamic algorithms give optimal solutions, and well known searching algorithms like S&W, Needleman & Wunch (N&W) and Hidden Markov Models (HMM) are of the dynamic kind. Examples of heuristic algorithms are the BLAST, FASTA and Feng & Doolittle algorithms. Heuristic algorithms are statistically driven sequence searching and alignment methods, and might not be as sensitive for all protein searches as the full dynamic programming algorithms such as the S&W and N&W algorithms. Software approaches for the above algorithms are the heuristic programs such as BLAST [2] for the Blast Algorithm, FASTA [3] for the FASTA Algorithm, and HMMER [4] for HMM algorithm.
The following are the well-known hardware approaches to the sequence matching problem. DeCypher Accelerator [5] is a reconfigurable computer that runs CPU-based and hardware methods on the same system for both heuristic and dynamic algorithms. SAMBA [6, 7] is a 128 systolic array processors for high performance comparisons based on S&W algorithm. BIOCCELERATOR [8] is a reconfigurable system that operates based on the N&W algorithm and elucidated for optimal local alignments by S&W algorithm. GeneMatcher2 [9] is a system with parallel architecture implemented in ASIC technology. It performs similarity search algorithms for both position-independent (S&W, N&W) and position-specific (HMM) optimal algorithms. BLAST is the heuristic (non-optimal) algorithm that GeneMatcher2 also performs. Kestrel [10] is a singleboard programmable parallel processor with 512 Processing Elements (PEs), and the two primary target applications are the S&W and HMM algorithms. BISP [1] : the system is based on the S&W algorithm and is implemented in ASIC technology (more details in section 3).
The SWASAD
The SWASAD is designed to highlight the feasibility of incorporating sensitive searching algorithms like S&W into application-specific hardware systems. The SWASAD executes only the S&W searching algorithm, while other similar products can generally execute multiple algorithms. However, the cost of the SWASAD is significantly less than that of other products, both in terms of engineering design cost and actual product money cost, and the SWASAD design is retargetable, which highlights its potential to deal with the future growth in Bioinformatics. The clock speed of the SWASAD design is 50MHz. While the design is capable of being executed at many time more than this speed, getting the data out of the chip is far more difficult, as we had a PCI bus to interface to. The SWASAD chip incorporates a new distance approach which makes the output data smaller, and uses new design techniques to make it smaller and faster.
Paper Outline
The remaining sections of this paper detail the SWASAD project.
Section 2 reviews the S&W similarity approach in the BISP and then introduces the distance approach applied in the SWASAD. Section 3 discusses in detail the implementation of the SWASAD, while section 4 examines the results and compares the SWASAD to other existing products. The overall conclusions are drawn in section 5.
Algorithms
The S&W similarity approach in the BISP states that the similarity of the two sequences can be determined by finding the accumulated weights (maximum score) introduced by the steps of the best alignment.
For the purposes of this project only the DeoxyriboNucleic Acid (DNA) sequences are analysed, and A, T, G and C are referred to as DNA bases which make up the sequences.
When query sequence a = a 1 a 2 …a 11 = ACAGGACTACA is compared to database sequence b = b 1 b 2 …b 11 = ACAGACTATCA, the best alignment can be shown to be:
The symbol ¨stands for a gap, the insertion of which enables superior alignment. In the similarity approach, the higher the maximum score, the more similar the two compared sequences. Alternatively, the relationship of two compared sequences can be viewed as 'how different they are' instead of 'how similar they are'. This is referred to as a 'minimum distance approach' in [11] (the biological equivalence for both approaches with a complete proof appears in [12] ). In order to return the relationship between two closely related sequences in terms of distance approach, scores that are close to zero are more significant than scores close to the maximum value of the data bus. Minimum distance theory is useful for improving the VLSI implementations of the S&W algorithm since reducing the data-width of the system actually minimizes the chip size and optimises the performance and power consumption. Figure 1 shows the pseudo-code for the S&W distance approach. In the given pseudo code, constants u = 2 and v = 1, and (s i y j ) = 3 if bases are matched, -3 otherwise). N and M are lengths of the database and the query sequences.
The is initialised with H 0,0 = H i,0 = H 0,j = 0 , for i ≥ 0 and j ≥ 0 for the similarity approach. In the distance approach it is more complex. E and F are both intermediate scores. With the same query and database sequence bases it can be shown that we get the same alignment under the distance approach as we got with the similarity approach. The trade-off between the advantage of reducing the buswidth (by using distance theory), and the disadvantage of the extra hardware logic necessary (some adders, registers and control signals) for the initialisations need to be considered. The reason why the distance approach is implemented in SWASAD will be explained in the next section.
Implementation
The design implementation is discussed in a top-down fashion from the chip architecture to the Processor Elements (PEs). Figure 2 is the schematic diagram of the SWASAD chip. The control logic schedules the SWASAD chip's internal operations and responds to control signals from the microprocessor. The constant registers store SWASAD constants (like v and u). The status outputs reports the status of the chip to the microprocessor and then the microprocessor will respond to it. The clock 
Chip Architecture
--Initialisations for i = 1 to M do E i,0 ⇐ u d + v d (i -1); H i,0 ⇐ u d + v d (i -1); end for; for j = 1 to N do F 0,j ⇐ u q + v q (j -1); H 0,j ⇐ u q + v q (j -1); end for; --Process for i = 1 to M do for j = 1 to N do E i,j ⇐ min{H i,j-1 + u d , E i,j-1 + v d } F i,j ⇐ min{H i-1,j + u q , F i-1,j + v q } H i,j ⇐ min{H i-1,j-1 + (s i y j ) , E i,j , F i,
Processor Elements
The PEs are the main modules of the SWASAD chip, and small modifications done to each PE will affect the overall chip greatly, in terms of chip size, power consumption and chip performance, since SWASAD chips are composed of a number of identical PEs.
The operational state of the SWASAD begins exactly one clock cycle after the initialisation finishes. All bold connections in the initialised state are disconnected while all the dotted connections are established and are ready for operations, as shown in the PE architecture in Figure  3 . Each register indicated by the callout blocks needs to
Reg5
MIN 2
Similarity (2) d_ready (3) d_ready (4) d_ready ( The rest of section 3 discusses techniques to improve PE by optimising its main components.
Initialisation Logic
The distance approach is chosen for this thesis, because a methodology to implement the initialisation hardware for the distance approach without inferring any extra adder or register hardware is given here. As shown in Figure 3 , both initialised and operational states share common registers and adders by switching the wiring connections.
Moreover, the chip performance is enhanced since the initialisations are done on the fly while the chip is calculating the scores.
Comparator
Each PE implies three comparators and each comparator takes up significant area and increases the delay of the critical path, therefore the comparator is an excellent candidate for optimisation. From the hardware point of view, a subtractor is generally more efficient than a comparator in both area and speed [13] . Figure 4 shows two architectures implemented in VHDL code (RTL1 and RTL2) that perform the same task, i.e. compares the two inputs data1 and data2, and signals the result of the comparison to the output via output port LE [13] . RTL1 implies a comparator (line 9) while RTL2 implies a subtractor (line 19). From [13] , RTL2 infers hardware with less area and less delay than RTL1. The subtractor inferred in RTL2 from above example is 17 bits while each of the input ports data1 and data2 is 16 bits only. The 17-bit-subtracter is needed, in order to signal whether the subtraction has under flowed or not (that returns the comparison result -line 20 ~ 24). From the nature of how the S&W algorithm scores are accumulated and compared, the additional MSB of the subtractor (for example, z(16)) can be eliminated for the purpose of reducing the size and delay of the PE comparison modules without altering the algorithm. Instead, Z(15) is used to signal the subtraction result in this case.
Overflow Logic
The worst case of the S&W distance approach is that when there is an array of hundreds of PEs while two non-related sequences are compared, the accumulated H score will eventually exceed the bus-width of the SWASAD adders. Therefore overflow will occur and incorrect scores will be reported. The associated overflow logic is implemented by introducing an extra input element for each comparator in PE, for example constant T 1 for MIN 1 in Figure 3 . This constant makes sure that largest possible score outputted from this comparator is always T 1 . The value of T 1 is set in such a way to prevent the value outputted from the comparator to the next PE causing an overflow in the next PE and so on.
Same terminology applies to the other two comparators MIN 2 and MIN 3 .
The overflow logic is implemented with an equator instead of a comparator (since a comparator generally has worse hardware usage and delay). Assume system bus-width is 8 bits for example, and query sequence open gap and continuous gap penalties u q and v q are set to be 4 bits and 2 bits separately. Any scores larger than '11111111' for 8 bits adder will cause an overflow. From addition of Reg1 in Figure 3 (coming from F score of the previous PE and v q for the F i, j with v q = 01, 10 or 11). In general, for a system with bus-width of M bits and continuous gap penalty v q of N bits, constant T 1 is set as '111…10…00' where there are (M-N) number of '1's in the higher portion and N number of '0's in the lower portion. Supposing 8-bit-register Reg1 receives an incoming score of value T 1 that is constrained by the previous F score overflow logic. After adding Reg1 to v q (2 bits), the result exceeds T 1 which will cause an overflow in the next PE and therefore the result of F i, j is constrained to be T 1 . This event is detected by a 6 bits equator instead of system bus-width of 8 bits and saves additional hardware.
Results
The front-end HDL design for SWASAD is implemented using behavioural VHDL, and is simulated using the ActiveHDL simulator [14] , and ModelSim simulator [15] for verifying functionalities and timings. Then the VHDL is synthesized into netlist files using the Leonardo Spectrum synthesizer [16] . Chip area, chip performance and schematic layout of the synthesized design are all estimated using Leonardo. The IC layout is automatically generated, and standard-cells placed and routed, using the layout-editing tool ICstation [17] from Mentor Graphics. The design rule check and layout versus schematics check were then performed. Finally pad frames were created and the layout is manually edited to connect the I/O ports of the design to the pad cells. After all procedures mentioned above were completed, the design was ready to be fabricated. Figure  11 shows the layout of the SWASAD die with 32 PEs and 8-bit system bus. Table 1 shows the overall comparisons between specifications and performances of the SWASAD and the BISP chips. The improvements achieved by the SWASAD over the BISP are printed in bold, and the disadvantages of the SWASAD are printed in italics. If it is assumed that the BISP chip can operate up to 50 MHz (with help of better fabrication process etc.) just like the SWASAD chip, it improves the BISP performance up to 800 millions MCPS, which is still ¼ of the SWASAD performance. Table 2 . The SWASAD is an improved version of the BISP, and its specifications are similar to the Kestrel design, which is produced under the same fabrication process (0.5 µm) with a similar die size. Although Kestrel performs multiple dynamic algorithms while the SWASAD is only S&W algorithm-specific, the performance of the SWASAD (MCPS per chip) is calculated to be better than that of Kestrel. The SWASAD has a faster clock speed than Kestrel, while both have the same number of PEs within a chip. The design cost of SWASAD is 13 person-months. 
SPECIFICATIONS
Conclusion
As science progresses further into genomics, there is a pressing need for researchers to be able to swiftly convert an ever-expanding volume of data into meaningful and useful information via sequence analysis. The purpose of this thesis has been to design an application-specific hardware tool, the SWASAD, to satisfy the high computational load requirement of sequence analysis.
The benefits gained from improving the performance of the SWASAD chip in relation to the BISP chip are a reduction of the databus bandwidth (achieved by applying the S&W distance approach), a restructuring of the PE architecture, and the use of more advanced fabrication technology to develop the chip.
The SWASAD chip with 64 PEs is estimated to have a size of 7.11 mm by 7.11 mm in a 0.5 µm process (this could be further reduced by using a hierarchical design methodology -at present we use a flat structure). Future technology of 0.1 µm process will make possible a SWASAD that has 1024 PEs, a controller core and a storage element all within a single chip. The essential tasks in this implementation are predicted to be the inter-communication between modules, the architecture of the controller, the format of the data frame, and operational instructions and layout of the storage element.
