Abstract
Introduction
Like other application-specific architectures, the proposed reconfigurable array architecture serves as medium to map algorithms onto hardware. Statement blocks in inner loops of high performance applications should be evaluated as fast as possible. Mono-processor systems need a kind of reconfigurable coprocessor to accelerate such blocks. Reconfigurable wavefront arrays suit well for such applications since their data-driven computation is self-timed and pipelining can easily be performed [5]. The high parallel U 0 requirements of the wavefront array make it difficult to connect such an array to a bus oriented mono-processor system. This leads to the idea to integrate the bus as an auxiliary structure into the array. To distinguish this architecture from conventional wavefront arrays, it is called reconfigurable datapath architecture (rDPA) . In this architecture the decentralized control of a systolic array is combined with the centralized control for YO operations.
Mapping statement blocks or algorithms onto wavefront arrays is done for example by simulated annealing with a trade-off between minimizing area (number of processing elements) and performance [7] . These methods map the data flow graph onto the target array and have to consider local data dependencies. In the rDPA all statements can be mapped separately since variables without local dependencies can be routed to different places in the array using the bus. The bus includes a performance drawback since the U 0 operations can not be performed in parallel. Optimized scheduling must ensure that the input data is in a perfect sequence for high performance. Although the proposed rALU can be used for any bus-oriented host based system, it is 404 1063-686W94 $4.00 0 1994 IEEE originally build for the Xputer prototype Map-oriented Machine 3 (MOM-3). Many applications require the same data manipulations to be performed on a large amount of data, e. g. statement blocks in nested loops. Xputers are especially designed to reduce the von-Neumann bottleneck of repetitive decoding and interpreting address and data computations. Xputers require memory accesses for data only, whereas von-Neumann computers require memory accesses for instructions also. In contrast to von Neumann machines an Xputer architecture strongly supports the concept of the "soff ALU" (rALU First the paper gives an overview on the hardware and software environment. The rDPA array is described in section 3. Section 4 explains the data-driven rALU which uses the rDPA. "he programming environment allows the mapping of operands and conditions to the rDPA automatically (section 5). The scheduling algorithm is described in detail. Finally some benchmark results are shown and the paper is concluded.
Hardware and software environment
The hardware of an Xputer consists of three main parts: the data memory, the data sequencer and the reconfigurable ALU (rALU). The data memory contains all the data necessary for the computation. It is organized as a two dimensional data map. The data sequencer provides the access to the data. The term data sequencing derives from the fact that the sequence of data triggers the operations in the rALU, instead of a von-Neumann instruction sequence. The most essential part of the data sequencer are the generic address generators (GAGS). Each GAG can produce address sequences which correspond to loops under hardware control and each GAG updates one scan window. The scan windows contain all the data which are accessed or modi- fied by the subnets. They are a kind of window to the data memory. The reconfigurable ALU is the data manipulator of the Xputer. It consists of several subnets. Each subnet has parallel access to the scan window. The complex operator in the subnet is reconfigurable. Subnets need not to be of the same type. Subnets can be configured for arithmetic or bitlevel operations ( figure 1) .
Configuration of the rDPA array and its support chip is done with the rALU programming language. The syntax of the statements follows the C programming language syntax (see also figure 6 ). In addition, the language provides the size of the scan windows used and the next handle position which is the lower left corner of the boundary of the scan window. Providing the handle position gives the necessary information for pipelining the complete statement block in the rALU.
The rALU programming language can be computed with a compiler from the C programming language or it can be written by hand. The C compiler [9] arranges the way data is distributed in the data memory, so that data can be accessed efficiently. It computes also the code for the data sequencer to perform the loops.
Reconfigurable array architecture
The reconfigurable rDPA array architecture has been designed for evaluation of any arithmetic and logic expression from a high level description. It consists of a regular array of identical processing elements called datapath units (DPUs). Each DPU has two input and two output registers. The dataflow direction is only from west and/or north to east and/or south. The operation of the DPUs is data-driven. This means that the operation will be evaluated when the required operands are available. The communication between two neighbouring DPUs is synchronized by a handshake like in wavefront arrays. This avoids problems of clock skew and each DPU can have a different computation time for its operator. A problem occurs with the integration of multiple DPUs into an integrated circuit because of the high VO requirements of the processing elements. To reduce the number of input and output pins, a serial link is used for data transfer between neighbouring DPUs on different chips as shown in figure 2. The 32 bits of the parallel internal communication path are converted into series of two bit nibbles. The DPUs belonging to the converters are able to perform their operatJons independent of the conversion. Using a serial link reduces the speed of the communication, but simulation results showed that by using pipelining, the latency is increased whereas the throughput of the pipeline is decreased only slightly. Internally the full datapath width is used. For the user this serial link is completely transparent. A global YO bus has been integrated into the rDPA array, permitting the DPUs to write from the output registers directly to outside the array and to read directly from outside. This means, that input data to expressions mapped into the rDPA do not need to be routed through the DPUs. A traditional wavefront array has a decentralized control, whereas the rDPA array has a centralized control for the YO operations and a decentralized control for the execution of the expressions. A single bus is sufficient for our prototype implementation since it is targeted to a busoriented host system with a single memory bus. The communication between the control unit and the DPUs is synchronized by a handshake like the internal communications. Each DPU which has to communicate by using the bus gets an address during configuration. The rDPA control unit can access a DPU directly using the bus. The DPU where the address matches performs the handshake with the control unit and receives or sends the data. ?he largest bunch of communication occurs during the computations and uses the internal communication paths. Therefore this is done completely in parallel.
The operatm of the DPUs are configurable. A DPU is implemented using a fixed ALU and a microprogrammable control as shown in figure 3 . This means, that operators such as addition, subtraction, or logical operators can be evaluated directly, whereas multiplication or division are implemented sequentially. New Operators can be added by the use of a microassembler.
In addition to expressions, the rDPA can also evaluate conditions. Each communication channel has an additional condition bit. If this bit is true, the operation is computed, otherwise not. In each case the condition bit is routed with the data using the same handshake. The 'false' path is evaluated very quick, because the condition bit has to be routed only. As mentioned before the array is extendible by using several chips of the same type. The DPUs have no address before configuration since all chips are identical. The DPUs are identified by their location in the rDPA array. Consequently each DPU has an x-and a y-address like the elements in a matrix. A configuration word consists of a configuration bit which distinguishes the configuration data from computational data. Furthermore it consists of the x-and the y-address, the address of the DPU's configuration memory, and the data for this memory.
Each time a configuration word is transferred to a DPU, the DPU checks the x-and the yaddress. Four possible cases can occur:
the y-address is larger than zero and the x-address is larger than W O the y-address is larger than zero and the x-address is zero the y-address is zero and the x-address is larger than zero both, the y-address and the x-address are zero In the first case the DPU checks if the neighbouring DPUs are busy. If the neighbouring DPU in y-direction is not busy, the y-address will be decreased by one and the resulting configuration word will be transferred to this DPU. If the DPU in y-direction is busy and the DPU in x-direction is not busy the x-address will be decreased by one and the resulting configuration word will be transferred to this DPU. If both neighbouring DPUs are busy, the DPU waits until one finishes. With this strategy a automatic load distribution for the configuration is implemented.
Internally the configuration words are distributed over the whole array and several serial links are used to configure the rest of the chips. An optimal sequence of the configuration words can be determined since these can be interchanged arbitrarily. In the second case, the y-address will be decreased by one and the configuration word will be transferred to the next DPU in y-direction. In the third case when the y-address is zero and the x-address is larger than zero, the xaddress will be decreased by one and the configuration word will be transferred in x-direction. In the last case when both addresses are zero, the target DPU is reached, and the address of the DPU's configuration memory shows the place where the data will be written.
Because of the load distribution in the rDPA array, one serial link at the array boundary is sufficient to configure the complete may. The physical chip boundaries are completely transparent to the user. The communication structure allows dynamic in-circuit reconfiguration of the rDPA may. 'Ihis implies partial reconfigurability during runtime [6]. Partial reconfigurability is provided since all DPU can be accessed individually. The configurability during runtime is supported because each DPU forwards a configuration word with higher priority than starting with the next operation. The load distribution takes care of that most of the configuration words avoid the part of the rDPA array which is in normal operation. Further the configuration technique allows to migrate designs from a smaller array to a larger array without modification. Even newer generation rDPA chips with more DPUs integrated do not need a recompilation of the configuration data. The configuration is data-driven, and therefore special timing does not have to be considered.
With the proposed model for the DPA, the array can be expanded also across printed circuit board boundaries, e. g. with connectors and flexible cable. Therefore it is possible to connect the outputs of the east (south) array boundary with the west (north) one, to build a torus.
The rDPA array used as a reconfigurable ALU
With the rDPA, a programmable support chip for bus-oriented systems is provided. Together they form a data-driven reconfigurable ALU (rALU). The support chip consists of a control unit, a register file, and an address generation unit for addressing the DPUs (figure 5).
Figure 5. The reconfigurable data-driven ALU
The register file is useful for optimizing memory cycles, e. g. when one data word of a statement will be used later on in another statement. Then the data word does not have to be read again over the external bus. In addition, the register file makes it possible to use each DPU in the rDPA for operations by using the internal bus for routing. If different expressions have a common subexpression, this subexpression has to be computed only once. If the rDPA does not provide the routing capacity for this reduction, e. g. if a subexpression is common to three or more expressions, the interim result can be routed through the register file.
The address generation unit delivers the address for the DPU registers before each data is written into the rDPA over the bus. The addresses of the DPU registers are configured in a way, that the address has to be increased by one only, but it can be loaded directly from the rDPA control unit also. The rDPA control unit holds a program to control the different parts of the data-driven rALU. The instruction set consists of instructions for loading data into the rDPA array to a special DPU from the external units, for receiving data from a specific DPU, or branches on a special control signal from the host. The rDPA control unit supports context switches between three control programs which allows the use of three independent virtual rALU subnets, which are all configured in the same rDPA array. The control program is loaded during configuration time. The reconfigurable data-driven ALU allows also pipelined operations.
A status can be reported to the host to inform about overflows, or to force the host to deliver data dependent addresses. The input FIFO is currently only one word deep for each direction. The datapath architecture is designed for an asynchronous bus protocol, but it can also be used on a synchronous bus with minor modifications of the external circuitry.
Programming the rDPA array
Statements which can be mapped to the rDPA array are arithmetic and logic expressions, and conditions. The input language for programming the rALU including the rDPA array is the rALU programming language. The syntax of the statements follows the C programming language syntax. A part of a rALU programming language example is shown in figure 6 . The rALU programming language is parsed. A data dependency analysis is performed to recognize possible parallelization and to find dependencies between the statements. The statements are combined to larger expressions and a data structure which is a kind of an abstract program tree is built (figure 7). Then the data structure is mapped onto the rDPA array structure. The mapping algorithm starts at the leaf cell nodes of the data structure for each expression. These nodes are assigned to DPUs in a first line of the rDPA array. A second line starts if there is a node of higher degree in the data structure. The degree of a node increases if both sons of the node are of the same degree. After that the mapped structure is shrunk by removing the nodes which are used only for routing. There are several possibilities for the mapping of each expression. Finally the mapped expression with the smallest size is chosen. Figure 7 shows an example of the mapping. Now the mapped expressions are allocated in the rDPA array, starting with the largest expression. If the expressions do not fit onto the array, they are split up using the global YO bus for routing. If the number of required DPUs is larger than the number of DPUs provided by the array, the array has to be reconfigured during operation. Although this allocation approach gives good results, future work will be done in the optimization of this algorithm to incorporate the scheduling process for advance timing forecast.
Due to the global YO bus of the rDPA array, the loading of the data and the storing are restricted to one operation per time. An optimal sequence of these YO operations has to be
H. Figure 9a shows the final schedule of the program example.
In time step 10 no VO operation is performed. If the statement block of the example is evaluated several times, the global VO bus can be fully used by pipelining the statement block. The pipeline is loaded up to step 9. ?hen the variable d from the next block is loaded before the output variables a, i and f are written back. The statement block is computed several times (step 10 to 21, figure 9c) until the host signals the rALU control to end the pipeline.
Step 22 to the end is performed, and the next operators can be configured onto the rDPA array.
The rDPA configuration file is computed from the mapping information of the processing elements and a library with the microprogram code of the operators. 
Results
The prototype implementation of the rDPA array works with 32 bit fixed-point and integer input words. Currently the host computer's memory used from the host is very slow. The clock frequency of the system is 25 MHz. In many applications the coefficients in e. g. filter implementation are set up in such a way that shift operations are sufficient and multiplications are not necessary. If high throughput is needed, the DPU processing elements can be linked together to figure 11 . The 'h' are the coefficients and the 'x[ 1' is the input data stream. The multiplications with the coefficients 'b' till 'hN-l' perform first a multiplication and route the result to the south, and then they route the 'x' input word to the next DPU at the east. At the beginning all DPUs with multiplications are preloaded with zero via the YO bus to fill the pipeline. Then the ' X I input words have to be provided at the west input only. The filter produces a new valid output word every 500 ns by using 'shifts' instead of 'multiplications'. The bubblesort works with a linear chain of 'scan-max' operators producing a new sorted data word every 240 IIS. The speed of the examples 2 to 5 depend not on the order of the filter as long as the necessary hardware (number of DPUs) is provided. The same is true for example 6. * * * * 
Conclusions
Bubblesort, length n scan-max n-1 2 240 ns/data word A reconfigurable wavefront array rDPA (reconfigurable datapath architecture) for evaluation of any arithmetic and logic expression has been presented. Pipelining is supported by the architecture. The word-orientation of the datapath and the increase of the fine granularity of the basic operations greatly simplifies the automatic mapping onto the architecture. The extendible rDPA provides parallel and pipelined evaluation of the compound operators. The rDPA array is originally built for the Xputer prototype MOM-3 but it can be used also as reconfigurable ALU for bus oriented host based systems as well as for rapid prototyping of high speed datapaths. The architecture is in-circuit dynamically reconfigurable, which implies also partial reconfigurability at runtime.
A prototype chip with standard cells has been completely specified with the hardware description Verilog and will be submitted for fabrication soon. It has 32 bit datapaths and provides arithmetic resources for integer and fixed-point numbers. The programming environment is specified and is being implemented on Sun SPARCstations.
Future work is the implementation of a pipelined version of the processing elements of the rDPA and the optimization of the mapping algorithm.
