Abstract: A technique for mapping systolic FIR filter banks onto fixed-size processor arrays is presented. It is based on the time-sharing properties of c-slow circuits. The technique can be further developed to a formalism and holds high potential for automatic realization. It has been applied to the mapping of systolic filter banks onto a fixed-size array of Transputers.
Introduction
Since the explicit formulation of the notion '.yJtolic array' [l-61, a great number of systolic algorithms has been proposed for different computational problems (The reader is referred to the monograph under [7] for an extensive introduction to systolic algorithms and arrays, and for many examples). In particular, the convolution and its major application in finite impulse response (FIR) filtering have been among the problems most exhaustively studied. Numerous systolic algorithms have been given for these problems operating both on the word and on the bit level. Since we are primarily concerned with the implementation of systolic algorithms on (appropriate) parallel computers, we focus here on the word-level algorithms [4-181.
One of the major problems in implementing systolic algorithms is that they usually require a processor array whose size depends on the size of the problem to be solved. For instance, the convolution of an input digital signal with a set of N coefficients would typically require a systolic array of N/2 or N cells/processors. However, unless a special purpose multiprocessor system is built for a specific application, it should be considered as a mere coincidence, if the number of processors required by a systolic array model of the algorithm equals the number of processors of the parallel computer which is available for implementation. In most practical cases, the former number is (much) greater than the latter one. The problem becomes even more complex when a whole filter bank with many channels having coefficient sets of different size needs to be implemented in a processor array of fixed size and fixed topology.
In this paper, it is shown how systolic FIR filter banks of arbitrary size and structure can be efficiently implemented in a processor array of fixed size and fixed topology. The approach used is based on the theory of the so called c-slow circuits and gives a discipline for mapping of systolic FIR filter banks onto a linear processor array of fixed size [16, 17] . Since this approach is based on a set of well-defined rules, it is very suitable for automatic realization. The paper is organized as follows: Elements of the theory of c-slow circuits are presented in the next section. Then the technique is illustrated on several examples of filter banks. Some characteristics of the technique are summarized in the last section.
Using C-slow Circuits
The mapping technique proposed here is actually a time-sharing method for the use of logic circuits. It will be illustrated on automata consisting of combinatorial circuits and registers (delay elements). The automata abstraction should, however, not be considered as a proposal for hardware implementation, but rather as a model of an algorithm. This art of algorithm representation is widely used in the literature and it is especially useful and comprehensive for the representation of parallel systolic algorithms [7] . We illustrate the technique on a simple example, since it might be more illuminating than giving a general theory. Figure 2 shows how the method described above can be applied to array structures. The array shown in Figure 2a consists of three independent accumulators. These three parts of the array are active on three independent accumulation processes: y , = 2-2 xLr y', = 2-f x '~, and y", = 2.f x "~, respectively. A 3-slow version of one accumulator can execute all three accumulation processes, Figure 2b . It is active on the first, second and third accumulation process in the clock cycles for which holds fmod = 0, tmod = 1, and fmod = 2, respectively. In this way, a c-slow version of an appropriate (l/c)-th part of a homogeneous array can execute the task of the whole array. (In doing this, it is, however, c times slower than the original circuit.) The method can be generalized for the case in which the parts of the circuit are interconnected. To take account of the interconnections, feedbacks and multiplexers should be added to the model. For this case and for further details on the technique, the reader is referred to [17] . Figure 3 shows a three-cell systolic array for implementing the convolution (FIR filter)y, = 3N=-t u,xij of an input digital signal xk, k = ... -1, 0, 1, ... , with a set {ai 11 = 0, 1, ... , N-1) of three coefficients ( N = 3). The (combinatorial) function of one cell is specified in Figure 3a . The implementation of this algorithm on the target (three processor) system is straightforward: cells are mapped onto processors and delay elements are implemented as locations in the private memory or local communication memory of the respective processing nodes. In the case of Transputer implementation, cells and delay elements are realized as sequentially replicated (OCCAM) processes which execute concurrently and communicate through channels. Figure 4 shows a 4-slow version (c = 4) of the array shown in Figure 3 . Each register of the array in Figure 3 has been replaced by a chain of four registers. The number of clock cycles between consecutive input/output operations is increased four times. The 4-slow systolic array shown in Figure 4 is capable of executing concurrently four different convolutions. It means that by time sharing, it can do the work of four systolic arrays like the one in Figure 3 (in doing this, it is, however, four times slower). This property will next be used for the realization of systolic filter banks.
The first example is a filter bank of just one channel/-filter. The number of coefficients of this filter is, however, greater than the number of processor nodes available in the target system, so that a one-to-one mapping of coefficients onto processors is not possible. Figure Sa shows a block diagram of a 12-cell systolic array for the convolution with 12 coefficients. The systolic array itself is assumed to be of the kind shown in Figure  3 and to consist of 12 cells, one cell for each coefficient. Such an array can be decomposed into four connected parts (subarrays), each of which is identical with the array shown in Figure  3 . Figure Sb illustrates how a 4-slow version of one 3-cell array can do the work of the 12-cell array by executing the operations of subarrays 0, 1, 2, and 3 in the clock cycles for which holds Figure 5 Using a 3-cel1, 4-slow systolic array for a 12-coefficient convolver tmd4 = 0, fmd4 = 1, tmod = 2, and fmod4 = 3, respectively. The multiplexers in front of the array transfer input data when the control bit is 1. Otherwise, the feedbacks are enabled. In this way, a 12-coefficient systolic convolver is realized by a 3-cell systolic array. The transformed algorithm model uses only three cells and can thus be directly mapped onto an array of three processor nodes. (In practice, it is more convenient to use one more processor node which carries out the function of the multiplexers used in the model and which is also used as an interface to the host computer. Thus a ring configuration is actually used.)
Another example is shown in Figure 6 . It is a filter bank with two channels, each of them with six coefficients. A straightforward implementation would require two systolic arrays, each of six cells. We can, however, decompose the original model arrays into four 3-cell subarrays which are schematically shown in Figure 6a . A 4-slow version of one 3-cell subarray can do the work of all four subarrays, Figure 6b . Thus two 6-coefficient systolic convolvers are realized with a single 3-cell systolic array. The transformed algorithm model shown in Figure 6b can now easily be mapped onto the target array of three processing nodes.
One more example is shown in Figure 7 . Figure 7a shows schematically three systolic convolvers with six, three, and three coefficients, respectively. The whole system consists again of four 3-cell subarrays whose work can be done by a 4-slow version of one 3-cell subarray, Figure 7b . The clock cycles for which the condition fmod = 0 or tmod = 1 holds are used for the tasks of the 6-cell array, and the clock cycles for which tmod = 2 and tmod = 3 holds are used for the two 3-cell arrays, respectively. In general, a c-slow version of an N-cell systolic convolution array can be used to concurrently execute the tasks of n systolic arrays (filter channels) with p,N, p-a, ... , pnN coefficients, respectively, where 
Concluding Remarks
The major features of the mapping technique presented can be summarized as follows:
The regularity of the algorithms is retained. The fixedsize systolic array to be used for a whole filter bank is as regular as the original problem-size dependent systolic array which realizes just one filter.
The additional structures represented in the model by feedbacks and multiplexers do not depend on the size of the filter bank to be implemented. The size of the bank and its particular structure are encoded in the control signals for the multiplexers.
The absence of transfer of intermediate results to the host is retained. Thus minimal communication with the host is guaranteed.
The whole mapping process is well-defined and can be carried out in a highly automatic fashion.
The mapping technique presented above can be used for any multiprocessor-system of appropriate structure (ring or linear array with nearest neighborhood). The processor nodes to be used should, however, have enough private or communication memory in order to realize data structures as chains of delay elements which are required to implement c-slow versions. In some processor arrays, as for example the Carnegie-Mellon WARP machine, neighboring processors communicate via hardware-supported register chains of programmable length. The mapping technique proposed here is, therefore, especially suitable for such processor arrays. In other cases such as the Transputer array we use, these data structures should be organized in the local memory of each processor node.
Currently, a program system is being developed which automatically maps systolic FIR filter banks onto two fixed-size processor arrays: an array of Transputers and the DIRMU reconfigurable multiprocessor kit built at IMMD (111) in Erlangen.
The work presented in this paper focuses on the implementation of systolic algorithms in parallel computers. Note, however, that this mapping technique can successfully be used for the realization of flexible systolic filter banks in a VLSI chip. The systolic array shown in Figures 5, 6 , and 7 can, for instance, be implemented in a VLSI chip. As shown by these figures, one and the same systolic array is used to realize different filter banks. The information on the kind of filter bank to be realized is encoded in the flow of control bits for the multiplexers.
