Abstract-The Warp machine is a systolic array computer of accessed by a procedure call on the host, or through an linearly connected cells, each of which is a programmable interactive, programmable command interpreter called the processor capable of performing 10 million floating-point opera-Warp shell [8]. A high-level language called W2 is used to tions per second (10 MFLOPS). A typical Warp array includes ten cells, thus having a peak computation rate of 100 MFLOPS. program Warp; the language is supported by an optimizing The Warp array can be extended to include more cells to compiler [121, [23] accommodate applications capable of using the increased compuThe Warp project started in 1984. A two-cell system was tational bandwidth. Warp is integrated as an attached processor completed in June 1985 at Carnegie Mellon. Construction of into a Unix host system. Programs for Warp are written in a two identical ten-cell prototype machines was contracted to high-level language supported by an optimizing compiler.
There are three major components in the system-the Warp towards scientific computing and signal processing. In both processor array (Warp array), the interface unit (IU), and the cases, wide instruction words are used for a direct encoding of host, as depicted in Fig. 1 . The Warp array performs the the hardware resources, and software is used to manage the computation-intensive routines such as low-level vision rou-parallelism (that is, to detect parallelism in the application tines or matrix operations. The IU handles the input/output code, to use the multiple functional units, and to pipeline between the array and the host, and can generate addresses instructions). The Warp cell differs from these earlier proces-(Adr) and control signals for the Warp array. The host sors in two key aspects: the full crossbar of the Warp cell supplies data to and receives results from the array. In provides a higher intracell bandwidth, and the X and Y addition, it executes those parts of the application programs channels with their associated queues provide a high intercell which are not mapped onto the Warp array. For example, the bandwidth, which is unique to the Warp array architecture.
host may perform decision-making processes in robot naviga-
The host consists of a Sun-3 workstation that serves as the tion or evaluate convergence criteria in iterative methods for master controller of the Warp machine, and a VME-based solving systems of linear equations. multiprocessor external host, so named because it is external
The Warp array is a linear systolic array with identical cells, to the workstation. The workstation provides a Unix environcalled Warp cells, as shown in Fig. 1 . Data flow through the ment for running application programs. The external host array on two communication channels (X and Y). Those controls the peripherals and contains a large amount of addresses for cells' local memories and control signals that are memory for storing data to be processed by the Warp array. It generated by the IU propagate down the Adr channel. The also transfers data to and from the Warp array and performs direction of the Y channel is statically configurable. This operations on the data when necessary, with low operating feature is used, for example, in algorithms that require system overhead. accumulated results in the last cell to be sent back to the other Both the Warp cell and LU use off-the-shelf, TTL-compaticells (e.g., in back-solvers), or require local exchange of data ble parts, and are each implemented on a 15 x 17 in2 board.
between adjacent cells (e.g., in some implementations of The entire Warp machine, with the exception of the Sun-3, is numerical relaxation methods).
housed in a single 19 in rack, which also contains power Each Warp cell is implemented as a programmable horizon-supplies and cooling fans. The machine typically consumes tal microengine, with its own microsequencer and program about 1800 W.
memory for 8K instructions. The Warp cell data path, as depicted in Fig. 2 , consists of a 32-bit floating-point multiplier III . WARP ARRAY ARCHITECTURE (Mpy), a 32-bit floating-point adder (Add), two local memory In the Warp machine, parallelism exists at both the array banks for resident and temporary data (Mem), a queue for and cell levels. This section discusses how the Warp architeceach intercell communication channel (XQ, YQ, and AdrQ), ture is designed to allow efficient use of the array level and a register file to buffer data for each floating-point unit parallelism. Architectural features to support the cell level (AReg and MReg). All these components are connected parallelism are described in the next section. through a crossbar. Addresses for memory access can be
The key features in the architecture that support the array computed locally by the address generation unit (AGU), or level parallelism are simple topology of a linear array, taken from the address queue (AdrQ).
powerful cells with local program control, large data memory The Warp cell data path is similar to the data path of the for each cell, and high intercell communication bandwidth. Floating Point Systems AP-120B/FPS-164 line of processors These features support several program partitioning methods [9] , which are also used as attached processors. Both the Warp important to many applications [21] , [221. More details on the cell and any of these FPS processors contain two floating-point partitioning methods are given in Section VII-B, and a sample units, memory and an address generator, and are oriented of applications using these methods is listed in Section VIII.
A linear array is easier for a programmer to use than higher significantly revised when we reimplemented Warp on PC dimensional arrays. Many algorithms in scientific computing boards. For the wire-wrapped prototype, we omitted some and signal processing have been developed for linear arrays architectural features that are difficult to implement and are [18] . Our experience of using Warp for low-level vision has not necessary for a substantial fraction of application programs also shown that a linear organization is suitable in the vision [1] . This When a cell tries to read from an empty of operating independently; it has its own program sequencer queue, it is blocked until a data item arrives. Similarly, when a and program memory of 8K instructions. Moreover, each cell cell tries to write to a full queue of a neighboring cell, the has 32K words of local data memory, which is large for writing cell is blocked until some data are removed from the systolic array designs. For a given I/O bandwidth, a larger full queue. The blocking of a cell is transparent to the data memory can sustain a higher computation bandwidth for program; the state of all the computational units on the data some algorithms [20] .
path freezes for the duration the cell is blocked. Only the cell Systolic arrays are known to be effective for local opera-that tries to read from an empty queue or to deposit a data item tions, in which each output depends only on a small into a full queue is blocked. All other cells in the array corresponding area of the input. The Warp array's large continue to operate normally.' The data queues of a blocked memory size and its high intercell I/O bandwidth enable it to cell are still able to accept input; otherwise, a cell blocked on perform global operations in which each output depends on an empty queue will never become unblocked. any or a large portion of the input [21] . The ability of The implementation of run-time flow control by hardware performing global operations as well significantly broadens has two implications. First, we need two clock generatorsthe applicability of the machine. Examples of global opera-one for the computational units whose states freeze whenever a tions are fast Fourier transform (FFT), image component cell is blocked, and one for the queues. Second, since a cell labeling, Hough transform, image warping, and matrix com-can receive data from either of its two neighbors, it can block putations such as LU decomposition or singular value decom-as a result of the status of the queues in either neighbor, as well position (SVD).
as its own. This dependence on other cells adds serious timing Because each Warp cell has its own sequencer and program constraints to the design since clock control signals have to memory, the cells in the array can execute different programs cross board boundaries. This complexity will be further at the same time. We call computation where all cells execute discussed in Section V. the same program homogeneous, and heterogeneous other-
The intercell communication mechanism is the most revised wise. Heterogeneous computing is useful for some applica-feature on the cell; it has evolved from primitive programmations. For example, the end cells may operate differently from ble delay elements to queues without any flow control other cells to deal with boundary conditions. Or, in a hardware, and finally to the run-time flow-controlled queues. multifunction pipeline [13] , different sections of the array In the following, we step through the different design changes. perform different functions, with the output of one section 1) Programmable Delay: In an early design, the input feeding into the next as input, buffer on each communication channel of a cell was a programmable delay element. In a programmable delay IV. WARP CELL ARCHITECTURE element, data are latched in every cycle and they emerge at the This section describes the design and the evolution of the output port a constant number of cycles later. This structure i', In the above discussion of the example of Fig. 3 , it was output (X) ; assumed that the control for the second cell to latch in input was sent with the output data by the first cell. If the second cell In this program, the first cell removes a data item from the were to provide the input control signals, we would need to X queue (dequeue (X)) and sends it to the second cell on X add an input operation in its microprogram for every output (output (X)). The first cell then removes a second item, and operation of the first cell, at exactly the cycle the operation forwards the result to the second cell after two cycles of takes place. Doing so, we obtain the following program for the computation. For this program, the second cell needs to be second cell: delayed by three cycles to ensure that the dequeue of the nop second cell never overtakes the corresponding output of the input (X) first cell, and the compiler will insert the necessary nops, as shown in Fig. 3 space left over for other improvements as well.
5) Queue Size: The size of the queues is an important factor contains the Input operations to match the Output operations in the efficiency of the array. Queues buffer the input for a cell of the first cell, and the second column contains the original and relax the coupling of execution in communicating cells.
program.
Although the average communication rate between two comSince the input sequence follows the control flow of the municating cells must balance, a larger buffer allows the cells sender, each cell is logically executing two processes: the to receive and send data in bursts at different times. input process, and the original computation process of its own.
The long queues allow the compiler to adopt a simple code These two processes must be merged into one since there is optimization strategy [23] . The throughput for a unidirectional only one sequencer on each cell. If the programs on array is maximized by simply optimizing the individual cell communicating cells are different, the input process and the programs provided that sufficient buffering is available becell's own computation process are different. Even if the cell tween each pair of adjacent cells. In addition, some alprograms are identical, the cell's computation process may gorithms, such as two-dimensional convolution mentioned need to be delayed with respect to the input process because of above, require large buffers between cells. If the queues are compile-time flow control as described above. ing efficiently. In SIMD machines, branching is achieved by 4) Randomly Accessible Queues: The queues in all the masking. The execution time is equivalent to the sum of the prototype machines are implemented with RAM chips, with execution time of the then-clause and the else-clause of a hardware queue pointers. Furthermore, there was a feedback branch. With local program control, different cells may follow path from the data crossbar back to the queues, because we different branches of a conditional statement depending on intended to use the queues as local storage elements as well their individual data; the execution time is the execution time [1] . Since the pointers must be changed when the queue is of the clause taken. accessed randomly, and there is only a single pair of queue
The Warp cell is horizontally microcoded. Each component pointers, it is impossible to multiplex the use of the buffer as a in the data path is controlled by a dedicated field; this communication queue and its use as a local storage element. orthogonal organization of the microinstruction word makes Therefore, the queues in the production machine are now scheduling easier since there is no interference in the schedule implemented by a FIFO chip. This implementation allows us of different components.
C. Data Path 4) Address Generation: As shown in Fig. 2 , each cell 1) Floating-Point Units: Each Warp cell has two floating-contains an integer unit (AGU) that is used predominantly as a point units, one multiplier and one adder, implemented with local address generation unit. The AGU is a self-contained commercially available floating-point chips [35] . These float-integer ALU with 64 registers. It can compute up to two ing-point chips depend on extensive pipelining to achieve high addresses per cycle (one read address and one write address).
performance. Both the adder and multiplier have five-stage
The local address generator on the cell is one of the pipelines. General purpose computation is difficult to imple-enhancements that distinguish the PC Warp machine from the ment efficiently on deeply pipelined machines because data-prototype. In the prototype, data independent addresses were dependent branching is common. There is less data depen-generated on the IU and propagated down the cells. Data [28] . In the interconnection chip designed for polycyclic is a new VLSI component (IDT-49C402), which combines the architectures [28] , a "queue" is associated with each cross 64-word register file and ALU on a single chip. The large point of the crossbar. In these storage blocks, data are always number of registers makes the backup table unnecessary for written at the end of the queue; however, data can be read, or most addressing patterns, so that the AGU is much smaller and removed, from any location. The queues are compacted can be replicated on each cell of the production machine.
automatically whenever data are removed. The main advanThe prototype was designed for applications where all cells tage of this design is that an optimal code schedule can be execute the same program with data independent loop bounds. readily derived for a class of inner loops [27] . In the Warp cell However, not all such programs could be supported due to the architecture, we chose to use a conventional crossbar with data size of the address queue. In the pipelining mode, where the buffers only for its outputs (the AReg and MReg register files cells implement different stages of a computation pipeline, a in Fig. 2) , because of the lower hardware cost. Near-optimal cell does not start executing until the preceding cell is finished schedules can be found cheaply using heuristics [23] .
with the first set of input data. The size of the address queue 3) Data Storage Blocks: As depicted by Fig. 2 , the local must at least equal the number of addresses and control signals memory hierarchy includes a local data memory, a register file used in the computation of the data set. Therefore, the size of for the integer unit (AGU), two register files (one for each the address queues limits the number of addresses buffered, floating-point unit), and a backup data memory. Addresses for and thus the grain size of parallelism. both data memories come from the address crossbar. The local For the production machine, each cell contains an AGU and data memory can store 32K words, and can be both read and can generate addresses and loop control signals efficiently. written every (200 ns) cycle. The capacity of the register file This improvement allows the compiler to support a much in the AGU unit is 64 words. The register files for the floating-larger class of application. We have preserved the address point units each hold 31 usable words of data. (The register generator and address bank on the IU (and the associated Adr file is written to in every cycle so that one word is used as a channel, as shown in Fig. 1) . Therefore, the IU can still sink for those cycles without useful write operations.) They support those homogeneous computations that demand a small are five-ported data buffers and each can accept two data items set of complicated addressing patterns that can be conveniently from the crossbar and deliver two operands to the functional stored in the address bank. units every cycle. The additional ports are used for connecting V APCL N UIPEETTO the register files to the backup memory. This backup memory V APCL N UIPEETTO contains 2K words and is used to hold all scalars, floating-
The Warp array architecture operates on 32-bit data. All point constants, and small arrays. The addition of the backup data channels in the Warp array, including the internal data memory increases memory bandwidth and improves through-path of the cell, are implemented as 16-bit wide channels put for those programs operating mainly on local data.
operating at 100 ns. There are two reasons for choosing a 16- Control of the external host is strictly centralized: the interfaces to other computers. Moreover, standard boards workstation, the master processor, issues commands to the provide a growth path for future system improvements with a cluster and support processors through message buffers local minimal investment of time and resources. During the to each of these processors. The two clusters work in parallel, transition from prototype to production machine, faster each handling a unidirectional flow of data to or from the processor boards (from 12 to 16 MHz) and larger memories Warp processor through the IU. The two clusters can have been introduced, and they have been incorporated into exchange their roles in sending or receiving data for different the host with little effort. phases of a computation, in a ping-pong fashion. An arbitration mechanism transparent to the user has been implemented A. Host I/O Bandwidth to prohibit simultaneous writing or reading to the Warp array when the clusters switch roles. The support processor controls
The Warp array can input a 32-bit word and output a 32-bit peripheral I/O devices and handles floating-point exceptions word every 200 ns. Correspondingly, to sustain this peak rate, and other interrupt signals from the Warp array. These each cluster must be able to read or write a 32-bit data item interrupts are serviced by the support processor, rather than by every 200 ns. This peak I/O bandwidth requirement can be the master processor, to minimize interrupt response time. satisfied if the input and output data are 8-bit or 16-bit integers Afte servicing the interrupt, the support processor notifies the that can be accessed sequentially. master processor.
In signal, image, and low-level vision processing, the input
The external host is built around a VME bus. The two and output data are usually 16-or 8-bit integers. The data can clusters and the support processor each consist of a standalone be packed into 32-bit words before being transferred to the IU, MC68020 microprocessor (P) and a dual-ported memory which unpacks the data into two or four 32-bit floating-point (M), which can be accessed either via a local bus or via the numbers before sending them to the Warp array. The reverse global VME bus. The local bus is a VSB bus in the production operation takes place with the floating-point outputs of the machine and a VMX32 SB bfor the prototype; the major Warp array. Except for the switch, all boards in the external host are off-
The Warp host has a run-time software library that allows the-shelf components. The industry standard boards allow us the programmer to synchronize the support processor and two to take advantage of commercial processors, I/O boards, clusters and to allocate memory in the external host. The runmemory, and software. They also make the host an open time software also handles the communication and interrupts system to which it is relatively easy to add new devices and between the master and the processors in the external host. for describing the cell code is Algol-like, with iterative and conditional statements. In addition, the language provides As mentioned in the Introduction, Warp is programmed in a receive and send primitives for specifying intercell communilanguage called W2. Programs written in W2 are translated by cation. The compiler handles the parallelism both at the system an optimizing compiler into object code for the Warp machine. and cell levels. At the system level, the external host and the W2 hides the low-level details of the machine and allows the IU are hidden from the user. The compiler generates code for user to concentrate on the problem of mapping an application the host and the IU to transfer data between the host and the onto a processor array. In this section, we first describe the array. Moreover, for the prototype Warp, addresses and loop language and then some common computation partitioning control signals are automatically extracted from the cell techniques.
programs; they are generated on the IU and passed down the address queue. At the cell level, the pipelining and parallelism A. The W2 Language in the data path of the cells are hidden from the user. The
The W2 language provides an abstract programming model compiler uses global data flow analysis and horizontal of the machine that allows the user to focus on parallelism at microcode scheduling techniques, software pipelining and the array level. The user views the Warp system as a linear hierarchical reduction to generate efficient microcode directly array of identical, conventional processors that can communi-from high-level language constructs [12] , [23] . cate asynchronously with their left and right neighbors. The Fig. 7 is an example of a 10 x 10 matrix multiplication semantics of the communication primitives is that a cell will program. Each cell computes one column of the result. We block if it tries to receive from any empty queue or send to a first load each cell with a column of the second matrix full one. This semantics is enforced at compile time in the operand, then we stream the first matrix in row by row. As prototype and at run time in the PC Warp, as explained in each row passes through the array, we accumulate the result Section IV-A-2.
for a column in each cell, and send the entire row of results to The user supplies the code to be executed on each cell, and the host. The loading and unloading of data are slightly the compiler handles the details of code generation and complicated because all cells execute the same program. Send scheduling. This arrangement gives the user full control over and receive transfer data between adjacent cells; the first parameter determines the direction, and the second parameter The system is solved by repeatedly combining the current selects the hardware channel to be used. The third parameter values of u on a two-dimensional grid using the following specifies the source (send) or the sink (receive). The [13] , [22] . The application effort has been use of it, including most of the low-level vision programs, the increased since April 1987 when the first PC Warp machine discrete cosine transform (DCT), singular value decomposi-was delivered to Carnegie Mellon. tion [2] , connected component labeling [22] , border followThe applications area that guided the development of Warp ing, and the convex hull. The last three algorithms mentioned most strongly was computer vision, particularly as applied to also transmit information in other ways; for example, con-robot navigation. We studied a standard library of image nected components labeling first partitions the image by rows processing algorithms [30] and concluded that the great among the cells, labels each cell's portion separately, and then majority of algorithms could efficiently use the Warp macombines the labels from different portions to create a global chine. Moreover, robot navigation is an area of active research labeling.
at Carnegie Mellon and has real-time requirements where 2) Output Partitioning: In this model, each Warp cell Warp can make a significant difference in overall performance processes the entire input data set or a large part of it, but [32], [33] . Since the requirements of computer vision had a produces only part of the output. This model is used when the significant influence on all aspects of the design of Warp, we input to output mapping is not regular, or when any input can contrast the Warp machine with other architectures directed influence any output. Histogram and image warping are towards computer vision in Section VIII-B. examples of such computations. This model usually requires a Our first effort was to develop applications that used Warp lot of memory because either the required input data set must for robot navigation. Presently mounted inside of a robot be stored and then processed later, or the output must be stored vehicle, Warp has been used to perform road following and in memory while the input is processed, and then output later. obstacle avoidance. We have implemented road following Each Warp cell has 32K words of local memory to support using color classification, obstacle avoidance using stereo efficient use of this model. vision, obstacle avoidance using a laser range-finder, and path 3) Pipelining: In this model, typical of systolic computa-planning using dynamic programming. We have also impletion, the algorithm is partitioned among the cells in the array, mented a significant image processing library (over 100 and each cell performs one stage of the processing. The Warp programs) on Warp [30] , to support robot navigation and array's high intercell communication bandwidth and effective-vision research in general. Some of the library routines are ness in handling fine-grain parallelism make it possible to use listed in Table IV. this model. For some algorithms, this is the only method of A second interest was in using Warp in signal processing achieving parallelism, and scientific computing. Warp's high floating-point computa-A simple example of the use of pipelining is the solution of tion rate and systolic structure make it especially attractive for elliptic partial differential equations using successive overre-these applications. We have implemented singular value laxation [36] . Consider the following equation: decomposition (SVD) for adaptive beam forming, fast twodimensional image correlation using FFT, successive overrea 2u +8 a2u x,y laxation (SOR) for the solution of elliptic partial differential +x =f(, A equations (PDE), as well as computational geometry al- Convert real image to integer using max, min linear scaling.
66 4
Average 512x512 image to produce 256x256.
15 (1   58  3 gorithms such as convex hull and algorithms for finding the performance in several systems for robot navigation, signal shortest paths in a graph. processing, scientific computation, and geometric algorithms, while it will no longer be the bottleneck. Third, since restrictions on Statistics have been gathered for a collection of 72 W2 using the Warp cells in a pipeline are removed in PC Warp as programs in the application areas of vision, signal processing, explained in Section IV-B-4, it will be possible to implement and scientific computing [23] . Table IV gives sor arrays, such as the Connection Machine [34] and MPP [7] . an upper bound on the achievable performance and the We chose these architectures because they have also been used achieved performance. The upper bound is obtained by extensively for computer vision and image processing, and assuming that the floating-point unit that is used more often in because the design choices in these architectures were made the program is the most used resource, and that it can be kept significantly differently than in Warp. These differences help busy all the time. That is, this upper bound cannot be met even exhibit and clarify the design space for the Warp architecture. with a perfect compiler if the most used resource is some other
We attempt to make our comparison quantitative by using functional unit, such as the memory, or if data dependencies in benchmark data from a DARPA Image Understanding the computation prevent the most used resource from being ("DARPA IU") workshop held in November 1986 to comused all the time. pare various computers for vision [29] . In this workshop, Many of the programs in Tables III and IV problems where a complex global computation is performed 2) Processor I/O Bandwidth and Topology: Systolic on a moderate-sized data set. In these problems, not much data arrays have high bandwidth between processors, which are parallelism is "available." For example, the DARPA IU organized in a simple topology. In the case of the Warp array, benchmarks included the computation of the two-dimensional this is the simplest possible topology, namely a linear array. convex hull [26] of a set of 1000 points. The CM-1 algorithm The interconnection networks in the Connection Machine used a brush-fire expansion algorithm, which led to an allow flexible topology, but low bandwidth between communi-execution time of 200 ms for the complete computation. The cating processors.
same algorithm was implemented on Warp, and gave the 18 Bit-serial processor arrays may suffer from a serious ms figure reported in Table III time. In many cases it is necessary to process an image which
The floating-point processors in Warp aid the programmer is stored in a frame buffer or host memory, which is easier in in eliminating the need for low-level algorithmic analysis. The Warp machine has demonstrated the feasibility of l's performance at this level exceeded Warp's by two orders programmable, high-performance systolic array computers. of magnitude. However, specialized hardware must be used to The programmability of Warp has substantially extended the eliminate a severe I/O bottleneck to actually observe this machine's application domain. The cost of programmability is performance. The use of the router in the Connection Machine limited to an increase in the physical size of the machine; it allows it to do well also at higher levels of vision, such as does not incur a loss in performance, given appropriate border following. We also see that the more general class of architectural support. This is shown by Warp, as it can be programming models and use of floating-point hardware in programmed to execute many well-known systolic algorithms Warp give it good actual performance in a wide range of as fast as special-purpose arrays built using similar technolalgorithms, especially including com-plex global computations ogy. on moderately sized data sets. ACKNOWLEDGMENT IX. CONCLUSIONS We appreciate the contributions to the Warp project by our The Warp computer has achieved high performance in a colleagues and visitors at Carnegie Mellon: D. Adams, F. Yam, and A. Zobel. We thank our cells' high degree of programmability and large local memory industrial partners GE and Honeywell for their contribution make up for the lack of higher dimensional connectivity. The towards the construction of the wire-wrapped prototypes. We high-computation rate on each cell is matched by an equally appreciate the continued collaboration with GE for the high inter-and intracell bandwidth. The host system provides development of the production Warp machine. In particular, the Warp array with high I/O bandwidth. The 
