Abstract -High-performance computation on a large array of cells has been an important feature of systolic array. To achieve even higher degree of concurrency, it is desirable to make cells of systolic array themselves systolic array as well. The architecture of systolic array with its cells consisting of another systolic array is to be called super-systolic array.
I. Introduction
VLSI has made implementation of system hardware or even highly parallel array processors economically feasible and technically realizable [1] . A systolic array [2] [3] [4] [5] formed by interconnecting a set of identical data-processing cells in a uniform manner is a combination of an algorithm and a circuit that implements it, and is closely related conceptually to arithmetic pipeline. In a systolic array, data words flow from external memory in a rhythmic fashion, passing through many cells before the results emerge from the array's boundary cell and return to external memory. Upon receiving data words, each cell performs the same operation and transmits the intermediate results and data words to adjacent cells synchronously.
High-performance computation over a large array of cells has been an important feature of systolic array. To achieve even higher degree of concurrency, it is desirable to make cells of systolic array themselves systolic array as well. We will refer to this architecture of systolic array consisting of another systolic array in a hierarchical manner as a super-systolic array.
In this paper we propose a scalable super-systolic array architecture which consists of another systolic array, produces high -performance, and can be directly adopted in the VLSI design including regular and local interconnection and functional primitives that are typical for a systolic architecture. The cell of a systolic array derived through projection and scheduling upon the dependence graph, DG, [1] from the given behavior can be designed as another systolic array, and this systolization procedure of implementing cell as another systolic array could be applied repeatitively in a hierarchical manner until a cell having only primitive operators is obtained.
We choose a super-systolic array for convolution as an example to demonstrate the procedure for deriving a super-systolic array, and then check the improvement on performance because it is a simple problem with a variety of enlightening systolic solution, and more importantly, it is representative of a wide class of computation suited to systolic designs.
The systolic array for convolution can be thought of as a logical systolic array in a sense that the array assumes all operation to complete in a unit delay to maintain rhythmic data flow. However, difference in the required delay for different operators may not be negligible and forced selection of unit delay as the longest delay among operators ends up with a low-performance systolic array. To make the assumption reasonable, namely, to transform logical systolic array into a virtual systolic array, time-consuming operation such as multiplication should be effectively implemented with more primitive operations. The strategy mentioned above calls for implementation of another systolic array performing complex operation in a cell of the logical systolic array. We will refer to this virtual systolic array resulted from the above strategy as super-systolic array.
Derived super-systolic array for convolution is modeled and simulated in RT level using VHDL, then synthesized to a schematic and finally implemented using the cell library based on m
1-poly 4-metal CMOS technology.
II. Super-Systolic Array for Convolution

A. Systolic Array for Convolution
The problem of convolution is defined as follows [1] : Given two sequences u(i) and w(i), i = 0, 1, …, N-1, the convolution of two sequence is
The convolution problem can be viewed as a problem of combining two data streams, w(i)'s and u(i)'s, in a certain manner to form a resultant data stream of y(i)'s.
The rectangular shaped DG for convolution is shown in Fig. 1(a) . Note that the w(i) coefficients along the columns remains unchanged. This mean that the coefficient w(i) may be a stored constant in the ith processor, as shown in Fig.  1(b) . We first apply the systolization procedure to the convolution signal flow graph, SFG, [1] in Fig. 1(b) . In Fig.  2 (a) convolution SFG is shown along with the cut-sets. If we scale the delay by a factor of two, i.e., D → 2D΄, then one delay can be transferred from the left-going edges to right-going edges in the cut-sets, leading to the systolic array for convolution shown in Fig. 2(b) . The pipeline period, α , is two for this convolution array. The systolic array consists of identical linearly-connected processing elements, or cells, as depicted in Fig. 2(b) . The internal structure of each cell is shown in Fig. 2(c) . Each cell contains a multiplier and an adder. In a high-complexity system, area restriction is very crucial, thus leads to a need for a systolic array-based implementation of the area -consuming operator such as multiplier [6] [7] [8] . The multiplier in each cell of the systolic array for convolution is a natural candidate for systolization and should be implemented using systolic array as is proposed in this paper.
B. Systolic Array Multiplier
In a high-complexity system, area restriction is very crucial and affects the final performance of the system. The systolic multiplier allows us to get high processing speed as well as limited resource consumption.
The DG for M-bit multiplier performing p(i)=u(i)w(i) is shown in Fig. 3 for the case of M =4. The DG obtained can now be safely projected in the ij-direction, [1 1] . The default schedule is used. The data flow pattern divides the DG into upper and lower part as shown with dashed line in Fig. 3 , each part resulting in systolic array with different interconnection. At the same time, output data are produced from every node, resulting in a large number of output port in the SFG generated by the ij-projection. To circumvent this problem, it is possible to extend the index space [9] of the DG, so the output occurs at points that will be mapped to the boundary nodes of the SFG only. The modified DG using procedure of index space extension is shown in Fig. 4 . When the modified DG is projected along the ij-direction, the SFG with input and output port on the boundary node only is obtained. The systolic array for 4-bit multiplier with two ports, one for input and one for output each, is shown in Fig.  5 . Output data emerge from the rightmost node and the array use M cells. Multiplicands stay in cells, multipliers and results move in the opposite direction. 
C. Super-Systolic Array
Making cells of systolic array themselves systolic array results in even higher degree of concurrency and even lower resource consumption, referring to the original systolic array as a super-systolic array.
An example of super-systolic array for convolution is depicted in detail in Fig. 6 . Each cell of systolic array for convolution contains multiplier and adder. To get higher processing speed and area minimization, multiplier is designed again using systolic array, making systolic array for convolution a super-systolic array and the cell of systolic array for convolution a super-cell. The cell of a super-systolic array consisting of another systolic array is referred to as a super-cell. Internal structure of each cell is identical and is shown in Fig. 6 .
To compare the performance of systolic array for convolution shown in Fig. 2 with that of super-systolic array for convolution shown in Fig. 6 , we implement each of them on XCV200 with approximate Gate count 220,000, 2352 SLICEs [10] . The inputs, u(i) and w(i),are set to 16-bit each in this implementation. Fig. 7 and Fig.8 show implementations for each design and their implementation reports are listed in Table 1 .
From the results, we can see that design using super-systolic array utilizes chip resource more efficiently and shows even higher performance than design using systolic array. 
III. Simulation, Synthesis, and Implementation
Each of the systolic array multiplier and super-systolic array for convolution was modeled and simulated in RT level using VHDL [11] , and synthesized to a schematic using Synopsys design compiler [12] [13] .
Simulation result using Synopsys VHDL simulation for 4-bit systolic multiplier is shown in Fig. 9 . A synthesized schematic using Synopsys design compiler for super-systolic array for convolution is shown in Fig. 11 and internal structure of its cell, i.e., the super-cell is shown in Fig. 12 . Note that super-cell of super-systolic array for convolution contains another systolic array, that is, systolic multiplier in this design. A schematic for systolic multiplier and internal structure of its cell is shown in Fig. 13 and Fig.  14.   Fig. 11 .Synthesized schematic for super-systolic convolution array 
IV. Conclusions
This paper demonstrates a super-systolic array performing convolution. Making cells of systolic array themselves systolic array yields high-performance, bringing about high degree of concurrency. We refer to this architecture of systolic array consisting of another systolic array in a hierarchical manner as a super-systolic array.
High-performance real-time signal processing calls for the enhancement of concurrent computational capability. Systolic array offers a promising solution to this computational need and presages a technological breakthrough in signal/image processing applications. Super-systolic array approach has value in handling signal/image processing applications in which data rates are usually very high and the computational requirements are extremely demanding because fundamental DSP operations such as convolution/correlation, FFTs, FIR or IIR filters can be implemented using systolic array.
Designs based on super-systolic array architecture are simple, modular, expandable, and yield high-performance. Research on this architecture is particularly worthwhile in view of the fact that VLSI makes the implementation of systolic array chip feasible.
