Using transformations
that formalize the systolic designer's "bag of tricks," a prototype system converts nestedloop algorithms into efficient functional-level systolic designs.
Systolic arrays have been proposed as a cost-effective solution to many computation-intensive problems. They consist of simple cells operating synchronously, each communicating only with nearby cells. Their applications range from numeric tasks, such as signal and image processing and matrix arithmetic, to symbolic tasks, such as searching and sorting, graph algorithms, and relational databases.
The design process for systolic arrays is modeled here as a series of transformations, expressed in a language devised for clearly and concisely describing systolic arrays. A prototype program called Sys (for systolic design system) accepts a software algorithm and some advice and applies a series of transformations to produce a functional-level circuit description of a systolic design.
Systolic arrays
Systolic architectures are typically large, regular arrays of processor elements ( Figure 1 ). Such an architecture is called systolic because data is pumped steadily through the array of cells, much like blood in the body. Data Unlike architectures that broadcast data to many points, systolic architecture can easily be scaled up to handle large problems. Their local interconnections schemes'avoid the clock skew that arises when data is broadcast over paths of differing lengths. Also, the low fanout in a systolic array allows the signal drivers to be independent of the size of a systolic array can be increased without altering other design parameters. These properties-low external memory bandwidth, simplicity, regularity, and local communicationmake systolic architectures especially well-suited to implementation in VLSI. Systolic design is providing new or improved hardware solutions to many computation-intensive problems. Two solutions to the simple problem of polynomial evaluation will serve as examples of systolic arrays.
Two systolic designs for polynomial evaluation. Suppose we have the following polynomial: P(X)=AmXm+AmjlXm-l+ .+Ao
We wish to evaluate P(xi) at points xi, I ic n. By Horner's rule, the polynomial can be reformulated from a sum of powers into an alternating sequence of multiplications and additions:
The value of P(x,) for each xi, pi, is computed by an algorithm whose inner loop is for ifrom 1 to n do for j from m to 0 do pi:.pi *x; +Ai; One systolic implementation is shown in Figure 2 . In this design, referred to as Poly-1, there is one cell for each xi. Each cell holds its xi value in one register, x, and accumulates the value of the polynomial in another. Our methodology models cell behavior explicitly. We can manipulate the cell programs at the same time we design the data flow and interconnection of the array. This manipulation allows us to design complex systolic algorithms with data-dependent cell actions and variable data-flow speeds.
A transformational approach
We model the design process as a series of transformations applied to an initial abstract algorithm for a computation. Instead of embarking on the formidable task of automating the process, we rely on the human designer to decide which transformations to try, and we let the machine perform the laborious manipulation required to apply them. This transformational paradigm was first suggested as a solution to the high cost of software maintenance.'4 By using a computer-aided transformational process to map a high-level specification into an implementation, we can capture all the design decisions in the machine. Changes in the design strategy or in the specification itself can be implemented simply by editing the recorded history of design decisions and replaying the modified history. '5 Our motivation for using the transformational paradigm is to help people create systolic designs. The "bag of tricks" produced by research in systolic design can be encapsulated as machine-applicable transformations. Moreover, the transformational approach offers a solution to the verification problem. A systolic design is typically remote from the abstract computation it implements, so its verification can be quite complicated.
However, if the design has been derived by applying a sequence of transformations known to preserve correctness, the derivation itself serves as a constructive proof of correctness with respect to the initial specification. The algorithm must first undergo high-level transformations to be prepared for systolic implementation. For example, Horner's rule was used to transform polynomial evaluation into a series of multiply-and-add operations, which can be performed by the same type of cell. Some of these transformations are similar to those that adapt sequential programs for execution on general multiprocessor architectures.'6"7 In systolic designs, further consideration must be given to the unique characteristic of data flowing through a regular interconnect rather than residing in a global memory. Failure to anticipate such concerns before translating the algorithm into a lower level design can force the designer to backtrack and revise the high-level algorithm. While the transformational paradigm cannot substitute for design insight in selecting a good algorithm initially, it can facilitate design evolution by assisting the transformation of successive versions of an algorithm. Sys assumes that the initial algorithm has undergone the COMPUTER necessary software transformations.
An algorithm ready for systolic implementation typically consists of highly repetitive computations, expressed in terms of nested loops and begin-end blocks. Sys knows how to map such constructs onto hardware; some transformations for mapping other constructs are described elsewhere. 18 Sys takes the algorithm and goes through a bottom-up process to design a systolic array. For each loop, it allocates hardware, schedules the computation, and optimizes.
The allocation phase generates a description of the necessary primitive functional units and allocates them to operations in the algorithm. The user guides this phase by annotating each begin-end block and loop. The annotation in place tells Sys to use the same hardware for all the statements in the block or iterations of the loop, while the annotation in parallel tells Sys to use separate hardware for each one.
The scheduling phase decides when each operation can safely be performed.
The optimization phase makes the design truly systolic by inserting registers, adding data connections, and adjusting the schedule. The user can guide this phase by selecting which optimizing transformations to apply.
Description of systolic circuits. To facilitate machine transformations, we divide the description of a systolic circuit into its structure and its driver. The structure describes the hardware cells and how they interconnect; the driver describes the format of the data streams to and from the circuit.
Structure. The notation for describing the structure is similar to a programming language. A hardware module corresponds to a procedure. The description is hierarchical; composite modules are made up of submodules, which, can themselves, be composite modules. The submodules are analogous to local procedures. Local registers correspond to "own" variables, which retain their states between calls.
A module's interface to the external world consists of input and output ports. A scalar port inputs or outputs Consider, for example, the specification of the Poly-II module, as shown in Figure 4 . (The italicized information is the driver specification.) This figure shows that the Poly-II module has two input ports, xin and Pi,; an output port, pout; and an array of m + I primitive Multiply-Add modules, each of which has two input ports, xi, and pin, two output ports xou, and Pout, and one local register A.
The body of the module description specifies the behavior of the module in each clock cycle. For a primitive module, the implementation of its behavior is not important to the aspects of systolic design addressed here. Therefore, the computation performed in each clock cycle is defined abstractly by an Algol-like notation relating the new values (outputs and new register contents) to the inputs and old register contents. We assume a two-phase clocking scheme such that, in phase 1, a module reads from its registers and input ports and, in phase 2, it updates its registers and writes into its output ports. Each call to the module corresponds to a clock cycle.
In the example of Poly-II, MultiplyAdd is a primitive module. In phase I of each clock cycle or procedure call, Pi, * Xin +A is computed and the result is written into Pout in phase 2.
The value of xin is also read in phase 1 and written into xout in phase 2. All output ports retain their values until they are written into again in the second phase of the next clock cycle when the next procedure call occurs.
A composite module is an encapsulation of a group of submodules. The function of the entire module can be modeled as invoking, in parallel, all the local procedures corresponding to the submodules. In each clock cycle, all submodules receive data from their input ports simultaneously in phase I of the common clock and write data into their output ports in phase 2. The interconnections among submodules are modeled as parameter bindings. Each input port of a submodule corresponds to a formal parameter of a local procedure; whatever is connected to the input port corresponds to the actual parameter.
The specification of Poly-II serves as an example of the syntax for describing composite modules. Parallel invocation of submodules is specified by putting the corresponding procedure calls inside a "Parallel begin . . . end" construct. The specification states that the inputs of the leftmost February 1985 module, Multiply-Add[m], come directly from the outside and that the input ports of every other submodule are connected to the output ports of the submodule to its left.
The notation used in the definition of a module follows standard programming language conventions. Ar allel, and pipelined scheduling, illustrated in Figure 6 , are the three most common schemes in systolic design.
Sequential. Consider the following loop:
For ifrom 1 toninplacedoy,:=F(y,);
The annotation in place indicates that this loop is to be implemented on a single hardware cell that computes function F. The data collision constraint dicates that the loop must be computed sequentially, since the cell can compute F for only one input value at a time. As Figure 6a shows, the cell inputs the elements yI,. . .,y, in successive clock cycles and outputs the updated value for each element in the second phase of the same clock cycle.
Parallel. If we change the in place annotation to in parallel, Sys generates n modules, each of which computes one iteration of the loop. The data collision constraint is trivially satisfied. Since the iterations are independent of each other, they can be scheduled simultaneously without violating the data dependency constraint (Figure 6b ). The ith cell inputs yi and outputs its updated value.
Pipelined. Consider the following loop:
For i from 1 to n in parallel do x: = F(x); Although n modules are allocated for computing this loop, the data dependency constraint precludes computing the n loop iterations simultaneously, since the value of x at the beginning of each iteration depends on the value computed during the previous iteration. However, a pipelined implementation is possible, as illustrated in Figure 6c . The initial value of x is input to the leftmost cell. Each cell applies the function F to its input and outputs the result to its right. After n steps, the rightmost cell outputs the final value of x. The delay is the same as for the sequential implementation; however, the pipelined structure permits overlapping execution of the loop for different initial values of x, accepting a new input as often as it can compute the function F. Pipelining, therefore, increases throughput by a factor of n relative to the sequential in place implementation.
Optimization. After scheduling, Sys has a correct, but inefficient, functional-level implementation of the original algorithm. It next tries to improve the design by applying correctness-preserving transformations suggested by the user.
In the optimization phase, Sys tries to reduce external memory accesses and, in general, to minimize communication between each module and its environment. Sometimes external data access can be replaced with local storage; for example, if an input value is used in many operations, it can be input into the system once and kept until it is no longer needed.
Preload-repeated-value:
If an input stream consists of a repeated value, replace it with a preloaded register.
Similarly, if a computed value is only a temporary result to be used in some other computation, the system can save the value until it is used, rather than outputting it and reading it back in later.
Replace-feedback-with-register: Figure 7b . This process is then repeated for a total of m +1 cycles.
Retime-to-eliminate-broadcasting:
If an input stream is broadcast to several unconnected cells, propagate each data item in the stream from one 
Am. xi
Am'
Optimize inner loop. Since the output from pout is fed back into pm,, the Replace-feedback-with-register transformation applies here. As Figure 7c shows, this rule modifies MultiplyAdd by replacing the pin and po0ut
ports with an internal register, p, and by deleting the streams for the eliminated ports. Next, Sys identifies the data stream Sequence Ufrom m to 01 of xi as a repeated value by noticing that xi is independent of the loop index j.
Therefore, the Preload-repeated-value transformation applies. This rule modifies the structure by replacing the input port x,n with an internal register x, and deletes the input stream for xm, from the driver. Figure 7d shows the optimized implementation of the inner loop. At this point, the design description is
Module Multiply-Add; Finally, Sys applies the Retime-toeliminate-broadcasting rule. The input stream is unchanged, but it now goes only to the leftmost cell instead of being broadcast to all the cells. Figure  2 shows the final Poly-I design:
Module Poly-l;
In Derivaton of Poly-HI. The Poly-IT design shown in Figure 3 is the result of starting with different annotations in the same algorithm:
For i from 1 to n in place do for from m to 0 in parallel do pi =pi *xi+Ai;
Implement loop body. The first step, implementing the body of the inner loop, is the same as for Poly-I. The resulting Multiply-Add module is shown in Figure 7a . The major contribution of this work is a transformational model of systolic design. In our design model, software transformations are first applied to put the algorithm to be implemented into a regular form conducive to systolic implementation. The steps of allocating operations to hardware, scheduling their execution, and optimizing the design are then performed bottom-up, starting with the innermost blocks of the algorithm. We have successfully used this model to rederive several published designs, and it appears suitable for designing complex systolic arrays. This model may help guide manual design, explain systolic algorithms, or capture the design process in the machine where it can benefit from effective automated support.
Transformations offer a convenient way to formalize systolic design expertise. They are typically very simple, and their effects are close to the designer's intent. New ones can readily be added to incorporate new design techniques.
An equally important contribution of this research is a natural notation for describing systolic designs. It preserves the structure of the original algorithm by showing clearly which operations are algorithmically related, even though they may be radically redistributed in time and space in the final design. Together, the Wave, Sequence, skewed, and delayed constructs make it easy to express the standard systolic communication patterns. Similarly, the quantified notation for arrays of ports and modules allows succinct specification of replicated structure.
Splitting the description into structure and driver has proved advantageous. Factoring transformations into their effects on structure and driver makes the transformations easy to represent and implement. Describing the hardware separately from how it is to be used makes structure descriptions context-independent and hence easier to combine into composite structures. The hierarchical representation of structure makes complex designs more manageable.
Our transformational model and notation are the basis of our prototype system, Sys. While Sys itself does not purport to be a practical tool for systolic design, it demonstrates the feasibility of a transformational design process that combines human creativity with the capability of machines to perform detailed manipulations. D and their colleagues at CMU and ISI and to Mario Barbacci, Allan Fisher, and Peter Highnam for reading early drafts. This research was supported in part by DARPA Contract MDA-903-81-C-0335.
