Kestrel is a high-performance programmable parallel co-processor 
Third, programmable processing elements that require several irist,ructions to comput? t,he basic function can be used. This solution, used by kestrel, essentially allows the addition of programmability into the processing elements with little pcnalty over the single-purpose system. If, for example, data can be retrieved from disk at a sustained rate of 1-5 Mbyte per second and :I3 million instructions can he processed per second, then a balanced applicatioii will executc at least 6 -33 instructions per byte of data.
The inner loop of one common form of sequence aiialysis requires 21 instructions on Kestrel. Our target single board, 512-PE system running at 3 3 MHz will perform this application at a speed similar to the vastly more expensive 16 384-processi11g element Maspar MP-2 [", 151.
2.2: I n s t r u c t i o n b r o a d c a s t
SIMD instruction broadcast can he slow, especially with many processing elements. The typical SIMD instruction memory is centralized for all processing elements to eliminate the overhead of local control. Sending an instruction to the array requires a broadcast of the instruction to each board, each chip, and each processing element within each chip, every clock cycle. The broadcast time can easily exceed processing time even for complex instructions.
One solution to the instruction broadcast bottleneck is to partially or completely eliminate instruction broadcast. This can be done in several ways, each with its own cost. A multiple instruction stream, multiple data stream (MIMD) machine, obviously, has no instruction broadcast (the program may be broadcast initially) with the penalty of local instruction memories and sequencers, not appropriate for the small and simple processing elements that we are proposing. A reconfigurable machine can be thought of as a SIMD machine with a single exceedingly long instruction that is executed many times [3, 41, 111. Unfortunately, this great flexibility can cause programming to be more difficult than on a true SIMD machine, and can cause a much lower ratio of processing power to area.
Alternatively, instructions can be cached at the P E (for a MIMD machine), chip, or hoard level. With, for example, per-chip instruction caching, the programming model can either he SIMD or multiple-SIMD (MSIMD). For SIMD control, synchronization betweeu the chips must still be broadcast. For MSIMD control, communication hetween chips will require queues or some other form of synchronization, and may degrade overall performance. Thus, we chose board-level instruction issue and broadcast.
Because instruction broadcast can take a fair amourit of time, it is important to match the broadcast time with the instruction processing time. Thus, an architecture should do many things during each instruction broadcast, reducing the rate at which instructions are required. A processing element must he complex enough to be able to do useful work during the cycle time of instruction broadcast, much the same way it was necessary to balance cell program length and data 1/0 requirements. Our initial estimates of maximum broadcast speed were 40-50 MHz, thus we were interested in designing a processor that could process instructions at approximately this rate, and in designing an instruction set that packed as much concurrent operation as feasible and usable.
The Kestrel processing element, described in more detail below, has a number of independent functional units. Control of these units requires a 52-bit instruction broadcast to all the processing element chips each clock cycle. This instruction includes three register addresses, an %bit immediate, and control fields for the functional units. In a single instruction, Kestrel can perform an ALU operation, comparison, result selection, storing of the selector bit, address computation, memory access, and communication. All of these capabilities are used concurrently in our sequence comparison program's inner loop.
2.3: Inter-PE c o m m u n i c a t i o n
In systolic algorithms, data moves between PES as partial results are calculated. Sometimes values are required by multiple PES to compute a result. Both of these are true for the dynamic programming solutions to sequence analysis. The ability to create a fine-grain pipeline is key
Reg
Reg
Bank -ALU -Bank -to making systolic arrays an effective solution. Thus, a good inter-PE communication scheme is essential for any array processor.
Kestrel employs Syst,olic Shared Registers (SSRs) for inter-PE communication ( Figure 1 ). Rather than being local to each PE, as register files usually are, SSRs reside between PES. This allows neighboring PES to share a register file.
With shared registers, communication and computat,ioii do not require distinct instructions but occur concurrent,ly. When a result is stored aft,er an instruction is executed, communication automatically takes place. The programmer can naturally think about data streaming through the array as values are computed. To pass a value from one P E to the next using SSRs, all that is necessary is to store the value in a register file. For example, if each P E is calculating a value to be used by the P E to its right, then each P E stores its result. into the register file 011 its right. During a subsequent instruction, each P E can read from its left register file to get the previously-stored value. Because all addresses for these register files are issued globally, adjacent PES never write to the same register bank. The first machine to include SSRs was B-SYS [16] , though two earlier machines included shared memory chips, one with a port for each processing element [19] , the other with a combination of asynchronous local operation and synchronized shared memory operation to eliminate contention problems [GI. SSRs make for an elegant programming model and provide a low-overhead solution to the inter-PE communication problem-the cost is one bit per register address.
Local Memory
Local Memory
2.4: Instruction sequencing
The final part of co-processor design is the instruction sequencer and int,erface to the host workstation. For the first Kestrel machine, we are building a simple board interface ( Figure 2 ). Instructions are stored in a local memory, while data transfers to and from t,he host uses queues to simplify addressing. The instruction sequencer is programmed using a 12-bit field appended to each array instruction, making the total instruction width a convenient 64 bits. Three of these bits have specific and constant meanings: whether or not data is read from the input queue, whether or not data is written to the output queue, and whether or not the array instruction's immediate field should be replaced with data from the controller. The remaining nine bits enable branches, jumps, calls, returns, and host interrupts to be performed concurrently with array operations. Kestrel effectively has 0-cycle jump and branch instructions. The controller has a small amount of local data and arithmetic circuitry to enable loop counting and the recirculation of data in the processor array.
Local Memory

3: Kestrel Applications
After the primary components of the architecture were solidified, we turned our thoughts to applications beyond the initial domain of sequence analysis. This application analysis led us to several changes in the processor design, such as the addition of the multiplier. The following applications refined and exploit the special features of Kestrel, such as conditional execution, zero overhead for loop control, and multiprecision arithmetic operations. Because of limited local memory, the applications trade memory usage for a nuniher of instrrictions wherever possihle. The shared register banks allow easy linear data pipeliriing i n both directions.
3.1: Sequence Analysis
Many sequence analysis techniques rely on aligning sequences i n the database to a model or to other sequences, often with a variant of the following edit-dist.ance computation. Given two sequences of characters, a and b, dynamic programming determines a total cost to transform one sequence into the other through three basic operations: deletion of a character, insertion of a character, and mutation of a character. The following dynamic programming recurrences are used to compute "edit distance":
where dist(ui,bj) is the cost of matching a, to b j , dist(a,,$) is the gap cost of not matching ui to any character in 6, and dist(q5, bJ) is the gap cost of not matching 63 to any character in a. Sequence comparison with affine gap penalties and various other features, greatly preferred by biologists, involves three interconnected recurrences of a similar form [29, 35, 371 .
The dynamic programming calculation easily maps to a linear array of processing elements [15] .
A common mapping is to assign one PI? to each character of the query string, and then to shift the database through the linear chain of PES. Typical query strings are hundreds to thousands of characters, matching Kestrel's 512-PEs, though longer queries require storing several adjacent characters in each processing element's local memory (i.e., a virtual processor ratio greater than one).
Architecturally, sequence comparison can he aided by multiprecision arithmetic (the dist values can he at a lower precision than the c values), modulo arithmetic, minimization, and addition. The twin problem of sequence alignment, finding the minimizing correspondence between two sequences, requires the saving of the selector hits of the minimizations and recirculation of sequence data [13].
Kestrel's single-hoard sequence analysis performance is expected to he comparable to both a 16384-processor MasPar computer and a 15-hoard FPGA-based system (Figure 3) . The one system expected to be significantly faster than Kestrel is the BioSCAN machine, a single-purpose machine that calculates scores for ungapped alignment segments (cz,j = ei-1,j-l + dist ai, 6j), and then performs statistical post-processing to gauge similarity [36] . Further details on parallel hardware for sequence analysis can be found in two comparative studies [IS, 391.
3.2: Neural Networks
Neural networks (NNs) [28] 16 ,536-PE MasPar, [34] , SAMBA (single purpose) [24] , l-and %-board Mercury (single purpose) [4] , Biocellerator (FPGA, about $50k) [ Figure 4 shows decision and input data rates as functions of the vector size and the maximum number of nodes in a layer, respectively. This performance is equal to the performance range of commercial, single-purpose NN chips [25] .
Training or a neural network by backpropagation uses pipelines of registers in both directions. The input vectors and the intermediate outputs are stored in each node for later use, so that the forward part is split into a transfer and store phase followed by the local MAC operations. Only a few instructions are required to adjust the weights of the output layer. Changing the weights of one hidden layer node however is computationally much more intensive because each output node yields its own contribution. It seems best to accumulate partial results and shift them backwards from the output to the hidden layer requiring 122 + 123 MACS. After multiplication with the derivative of the activation function, the weights can be adjusted for all nodes in max(nl,n>, 123) steps.
Training is obviously slower than classification and the dependence on nl, 122 and 123 is also more complex. We estimate that the maximal example cited above allows a training input of 18000 vectors per second and this number increases to about 50000 v/s for a more typical case of 121 = 8, 800 , I , , , I I I , 
3.3: Discrete Cosine Transform
The discrete cosine transform (DCT) and its inverse (IDCT) [18] , as part of video compression and decompression standards, process 8 x 8 arrays either as 8-bit pixels I ( z , y) or as 12-bit to 14-bit coefficients u ( u , U). Real-time video compression and decompression leads to a continuous stream of DCT and IDCT operations which can account for 25% to 50% of CPU usage unless dedicated hardware is available. The DCT (and IDCT) resemble the discrete 2-D Fourier transforms but remain in the domain of real numbers using cosine terms as coefficients:
Both the DCT and IDCT are separable into sequential 1-D transforms. According to the equation, each element would need 64 multiply and accumulate operations, but more efficient algorithms similar to FFT exist. Our implementation was derived from the Telenor H.263 software distribution [38] and averages 9.5 additions, 6 multiplications and 2 scaling operations per pixel.
It would be tempting to implement the DCT in pipelined fashion, shifting in a pixel stream and doing the MAC operations on the coefficients while shifting them toward the output. A careful analysis shows considerable overhead for this solution due to more costly internal shifts and staggered P E reinitializations. Separate phases of shifting data in and out followed by a full DCT per PE are more efficient and allow coding of the coefficients as immediate operands. Taking advantage of Kestrel's multiprecision multiply and accumulate instructions, a full DCT executes in 3768 clock cycles and the associated 1/0 requires 8234 cycles, leading to a rate of 176 kDCT/s. A single 64-PE chip running at 33 h1Hz is thus capable of handling an input rate of 11.26 x IO6 pixel/sec at a maximum latency of 3 6 4~s . There is a potential for considerable speedup with future changes of the 1 / 0 architecture for higher bandwidth and background PE to P E communication.
3.4: Nunierical Algorithms
Although not designed for floating-point computation, we have extensively studied the application o f the Kestrel PE to floating-point arithmetic, both single and double precision, to ensure future applications would not be overly peualized by Kestrel's 8-bit architecture.
Floating-point addition and multiplication are implemented in the obvious way. In the case of addition, the multiplier is used to perform shifts. For division, we have optimized the use of k - The 512 PE, 33MHz system will be able to perform single-precision addition, multiplication, and division at 160, 460, and 260 million floating-point operations per second (MFLOPS), and the double-precision variants at, 65, 150, and 100 MFLOPS.
4: Kestrel Processing Element Architecture
The Kestrel P E architecture is the result of a number of requirements and goals. The PES had to be fast and small to result in an inexpensivc systcm. The PES had to be flcxiblc to allow for thc development of new algorithms.
The Kestrel PE and its horizontal microcode format are shown in Figure 5 , the components of which are described below. Each PE can operate on up to three operands per instruction: Operand A, Operand B, arid Operand C. Operand A and Operand C are two independently selected registers. Operand B can come from the multiplier, the bit shifter, memory data register (MDR), the same register as Operand C, or a globally-issued immediate value. To aid in multiprecision operations when operands differ in length, Operand B can also be the sign extension of Operand C, the MDR, or MultHi (the high byte of a previously-computed multiplier result).
The relative sizes of several of the Kestrel PE functional units, as well as their usefulness to various applications, are displayed in Each P E has an address generator and decoder for its SRAM. The SRAM has two addressing modes: absolute and indexed. For absolute addressing, the address is just the 8-bit immediate specified in the instruction. For indexed addressing the immediate is added to Operand C (typically meaning that a concurrent ALU or multiplier operation must use a different source for Operand B).
The size of the SRAM is based on three factors. First, 256 bytes will allow the 512-PE Kestrel to process protein hidden Markov models of length 3000, and DNA models of about 10000 positions without host partitioning (simpler analysis methods can handle much longer sequences). Second, 256 bytes is a natural choice for a P E with &bit local addressing. Third, the SRAM size helps keep the P E small, enabling a high density of processing elements.
ALU The ALU operates on Operand A and Operand B and a carry in to produce its result.
The carry in can be specified as part of the instruction, or it can come from a latch that holds a previously computed carry out,. The ALU is discussed in more detail in Section 5.1.
Bit Shifter
The bit shifter is an 8-bit loadable shift register that is capable of shifting left or right by one bit. The bit shifter serves two purposes: it can be used for data manipulation and for conditional processing. The bit shifter is discussed in more detail in Section 5.2.
Multiplier
The multiplier multiplies Operand A and Operand I3 to produce a 16-bit result split int,o two bytes, MultLo and MultHi. The lower byte is treated as the result of the current instruction, while the higher byte is stored in the MultHi register for future use if needed. The multiplier is discussed in more detail in Section 5.3.
Comparator The comparator compares the output of the ALU with Operand C, and the minimum or maximumcan be selected as the result of instruction execution. This is done by subtracting Operand C froin the output of the ALU. The subtraction produces three flags, the borrow-out from the subtraction, the most significant bit (msb) of the subtraction, and the true sign of the subtraction (the sign that, would be produced had the operands been sign-extended to nine bits, similar to a feature in the Intel i860). The three flags allow for three types of comparison: unsigned comparison (borrow-out), modulo 256 comparison (msb), and signed comparison (true sign). The comparator is an example of Kestrel's facility with multiprecision operations. With the aid of an equality test, multiprecision comparison is done top-down by byte, saving several cycles over the standard two-step process of a multiprecision subtraction followed by a multiprecision selection.
Result Selector The result selector chooses the one-byte result of executing an instruction. For multiply instructions, the result is always the low byte of the multiplication. Otherwise, the result is either the output of the ALU or Operand C. The result can be forced to one or the other of these values, or it can be chosen by a flag from the ALU, comparator, bit shifter, or a neighboring PE's bit shifter.
: Kestrel Design
Kestrel's full custom VLSI layout was done using the Magic tool suite [26] . With hindsight, a combination of custom layout of the PES and standard cell for the global logic may have saved some time without incurring high area or power costs. It is only the regularity of the design that enabled the hand layout of 1.4 million transistors.
The designs use the scalable submicron rules, and we have implemented test chips using 2 p m and 0.8pm CMOS processes using the MOS Implementation Service (MOSIS). The final 64-PE chip will have a 0.5pm feature size. Layout verification was performed with a functional simulator programmed on a MasPar parallel processor, a Verilog model, and irsim (part of the Magic suite).
Kestrel uses a two phase non-overlapping clock. Most arithmetic operations occur during phase one. The SRAM access and register write of the result takes place during phase two. This allows ample time for inter-chip writes to the shadow register bank (at chip boundaries, there are two adjacent and coherent register banks to minimize communication). As register writes can be disabled by the writing P E , mask setup at the register bank must be complete by the beginning of phase two. This means that inter-chip communication occurs during both phases. Figure 6 shows a floor plan of the Kestrel PE with dimensions in scalable X units, where X is approximately one half the feature size of the CMOS process [27] . The height of the register bank is slightly less than half the height of the PE, so the register banks from the adjacent column can be fit into the extra space by turning the adjacent column sideways, reducing the horizontal pitch of the P E by about 500 A. The main constraint on the layout was the vertical pitch of the PE, set by the SRAM, and the need to efficiently route local buses and global control signals. Over half of the PE's area and two thirds of the PE's transistors are used by the SRAM and register bank.
In the next sections, we take a close look at three of the most interesting Kestrel components: the integrated ALU and comparator, the bit shifter, and the multiplier. Table 2 . ALU function encoding.
-- 
, is the bit-reversal of the function code, except G is zero for increment functions, and Goo is always 0. The alternative function codes for the 8 logic functions that appear in both the decrement group and in the increment group can be used to set the Gout of the carry chain (this is not needed with Kestrel's flexible flag selection logic, but may be useful in other designs).
This compact ALU function encoding compares quite favorably to that of the classic 74181 chip.
For the same number of function bits, the Arithmetic Operations mode of the 74181 supplied the first four functions of each type, leaving out all unary functions of B, as well as B -A and -A. The 74181 could not negate at all.
The ALU, comparator, and equality test circuit have been tightly integrated ( Figure 7 ). Because all three carry chains operate in parallel, the total delay is only slightly higher than that of the ALU alone. Since it is only 8-bits wide, carry lookahead is not needed.
The ALU and comparator pack 980 t,ransistors in 356 428 A', including multiprecision and flag selection hardware. The control logic for the multiprecision features and flag selection occupy a proportionally large amount of space compared to the ALU and comparator bit slice due to the number of global control signals
5.2: Bit shifter
The bit shifter provides bit manipulation of data and conditions. The data functions for packing and unpacking bits were included to support the sequence alignment function in sequence analysis methods. The conditional processing functions greatly enhance the SIMD processing element's ability to evaluate nested conditions based on local values.
Fast evaluation of local conditions is critical in SIMD processing. Each clause of a conditional must be broadcast to the array in turn so that those PES for which the condition is true can execute Store Save current state. Free bit shifter for other tasks. Can be used with Load, Clear, and Set to process more than 8 nested conditions. Load
Restore a previous state. Table 3 . Bit shifter functions associated with processing nested conditionals.
inactive (as with most SIMD designs, the mask register can he overridden globally). The NOR of the 8 bit shifter bits can be used to set a PE's mask register. If any of the bits are 1 (making the NOR O ) , the PE will be masked, and will not execute broadcast instructions. Pushing additional bits onto the bit shifter can further refine the set of active PES, while popping bits will restore a previous set. Bit shifter microcode fields for masking are always unconditional (i.e., ignore the mask bit of the PE) to ensure the same number of conditions are present in all PES.
The great advantage of Kestrel's horizontal microcode is that the hit shifter functions for conditional operations (Table 3 ) can in a large part be done concurrently with data processing, so that there is, for example, a 0-instruction cost for the 'else' of an if-else construct, and an ability to shift 8-bit data left or write concurrently with an ALU instruction
5.3: Multiplier
The multiplier (Figure 8 ) is implemented using the modified Booth's algorithm. It can treat either operand as signed or unsigned by considering each to be nine bits long, and manipulating the ninth bit based on how that operand should be treated. The lower eight bits of the product are the result, and the upper eight bits can be stored internally by the multiplier in the MultHi latch. The contents of the MultHi latch can then be read out on the Operand B bus or added to a subsequent multiply. Both Operand C and MultHi can be added internally to the product with no danger of overflow [22] . The ability to perform a multiply-accumulate-accumulate greatly speeds multiprecision multiplication (Figure 8 ).
The multiplier packs 2900 transistors in 1 162 075 A', including bus drivers, operand latches, and control logic. The major design concern was finding an acceptable balance between speed and power consumption. To reduce dynamic power consumption, operand latches were introduced to prevent operands from changing when a multiply is not being performed. Also, as the upper eight bits do not have to be computed until the end of phase two, the upper carry chain could be slowed.
: Chip Fabrications
Multiplier Test Chip The multiplier test chip ( Figure 9 ) was fabricated in the Orbit 2.0pm process using a TinyChip standard pad frame. We received four chips and they all worked as expected, except for one problem not found in simulation. The clock and data lines were reversed on the nMOS part of the dynamic latches that controlled the addition of MultHi into the product of a subsequent multiply. The resulting charge sharing problem caused the addition of erroneous values into subsequent multiplications. The rest of the design was carefully scrutinized to ensure this mistake was not made again.
SRAM and Register Bank Test Chip
The memory test chip ( Figure 9 ) measured 2.49mm by 2.49" and had 18996 transistors. Of the 25 chips, one had a defect in two adjacent bits of the SRAM, and another had a defect that caused unpredictable behavior in the register bank. The SRAM could perform a read or write in approximately 7.5ns, and the register bank could perform a read or write in 4ns or less. Our ability to fully test the register bank speed was limited by our tester due to the arrangement of on chip control lines. SPICE simulations of the SRAM in the HP 0.5pm process indicated that it would be fast enough to operate with a system clock speed up to 66 MHz, with a read/write time of about 5.011s. SPICE simulations of the register bank in 0.5pm indicated a read time of about 3ns.
Kestrel Test Chip
The Kestrel PE test chip (Figure 10 ) has 51 966 transistors in 4.29" by 4.05m111, and was fabricated in the U P 0.8pm process. The test chip has two processing elements, three shared register banks, and instruction latches and decoders. This chip could be used interchangeably with the final chip, as it uses the same packaging, pin assignment, and functionality.
We fabricated 25 chips in the HP 0.8pm process. in size, 80 percent larger than our original estimate. We estimated a need for 7 n F of on-chip distributed bypass capacitors, and the heat dissipation estimates indicate the chips may require heat sinks and fans. We expect to receive the chips in August, at which point we will test them and assemble the complete system. 
7: Architectural Comparison
There is a great diversit,y among SIMD processing chips. In this section, we attempt a comparison of several typical architectures, both recent and planned. For simplicity, comparisons are based on performance per chip; it could be argued that performance per gate is more appropriate because this considers pure logic rather than access to the latest technology.
As with Kestrel, each of the machines described below has its target applications areas and advantages and disadvantages. This comparison does not highlight Kestrel's independent functional units, conditional evaluation, or programming model. Table 4 shows estimated performance of several chips on integer addition, multiplication, and MAC operations of various sizes in millions of operations per second (MOPS). The peak 1/0 bandwidth data relates to transfer between local buffer memory or data cache. MGAP The MGAP-2 is a very fine grained architecture. Each chip has 1536 1-bit PES connected in an octagonal mesh. Each P E has a 16-bit dual-port local storage and two 3-input, 1-output function multiplexers for calculation. Based on published layout descriptions, the chips may have on the order of one half to two thirds of a million transistors. The configuration is stored in a local register but memory access is globally controlled. The system runs at 50 MHz. Performance is 6.8 kDCT/s (8 x 8 blocks of 8-bit pixels) for the MGAP-2 chip [20, 211. RaPiD The RaPiD system is the most hardware-oriented among these examples: it is a "coarse grained FPGA," with 16 cells per chip arranged in a linear array. Each cell consists of three 16-bit ALUs, one multiplier, six datapath registers and three 32 by 16-bit memory blocks and some glue logic. The interconnection between these functional units, as well as between the cells, uses ten 16-bit segmented buses. A separate control path allows dynamic scheduling of the pipeline operations and some data path flow control. Transistor estimates are not yet available. The reference's estimate of fitting 16 cells on a chip running at 100 MHz will allow close to 1.6 million DCTs/sec (8 x 8 blocks of 8-bit pixels) [8] .
8: Conclusions
Kestrel steers a course between special-purpose co-processors, reconfigurable-logic systems, and general-purpose supercomputers. Our design grew from an initial desire to build a small, fast, and inexpensive sequence analysis machine into a general purpose parallel accelerator. The design does not exist in a vacuum, however. Its evolution has been shaped in particular by several architectures including B-SYS [16] , MasPar [30] , Blitzen [2] , and the unbuilt MISC machine [33] , as well as MGAP and PIM [3, 121 . Contemporaneous sets of design choices can be found with the CNAPS [14] , RaPiD [8] , Rapid-2 (not related) [9] , Samba [24] , and SIMPil [lo] architectures, among others.
The resulting design provides high performance at low cost on its prime target application, biological sequence analysis, as well as on other applications amenable to SIMD parallelization.
The design has been the result of a constant interplay between algorithm development, system design requirements, processing element architectures, and VLSI and board design constraints.
