Systolic array designs are presented for decimation lters with an innite input signal. The rst design is based on an existing design for convolution and is shown to be time optimal with respect to our criteria. Two new designs are derived by reducing the number of processing cells to the theoretical minimum while retaining optimal timing. EDICS number SP 5.1.4.
I. Introduction
There has been a great deal of interest in the use of systolic arrays for signal processing applications and many designs have been suggested. A survey of designs can be found in [4] . While few systolic designs ever become specic-purpose devices, they provide parallel algorithms that may be directly implemented on general-purpose parallel computers [7] , [9] . Decimation lters are a form of convolution and existing designs for convolution may be used as decimation lters. While such designs are arguably Time optimal the number of processing cells in the array may be reduced to form Space-Time optimal designs, which produce output in minimal time using minimal processing cells. This is possible due to some specic characteristics of the problem as dened below. Consider the following generalized decimation lter where the continuous input signal X i is convolved with the window H j . The output Y i is
where K is called the window size and M is the decimation factor. This problem can be expressed as a system of recurrence equations:
h(i; k) = h(i 0 1; k)
with appropriate boundary conditions (see [3] ). Equation (2) imposes an ordering on the partial sums for each output variable. The order is y(i; 0); y(i; 1); :::; y(i; K 0 1). However, it is clear from equation (1) that any ordering is acceptable. An alternative, though equivalent system of recurrence equations is:
with appropriate boundary conditions (see [3] ). Equation (5) The designs to follow are derived from these two expressions of the problem. elay is the number of systolic steps for which an output variable remains in the array. e nition : Timing for a systolic array is elay{ ini al when the number of steps required to produce an output variable is minimized. DRAFT e nition : A systolic array is elay-ini al if it has elay{ ini al timing.
e nition : A systolic array is rocessor{ elay{ ini al when it uses as few processors as any delay{minimal design for the given problem.
III. esi ns . elay{ ini al i in
We will use the notation t w to refer to the timing function for the array variable . Since the network is to consist of si le cells we assume their arithmetic capability consists of only the primitive operation of equation (1), namely one partial sum operation, per step.
There are K partial sums k of each output term Y i and each computation must occur at a di erent step since the recurrence equations give a chain of partial sum dependencies.
Therefore the delay cannot be less than K. We assume the x terms arrive in ascending order at a xed and predetermined rate of one every step which implies maximum throughput is one output per M steps. The timing functions 1 schedule partial sums in the order imposed by equations (2) 
ig. 1. etwor and cell operations for a delay-minimal design
We can apply the timing 1, and the allocation function a(i; k) = k (see [2] ) to the network of igure 1 to complete the design of a systolic array to compute the decimation lter problem as expressed by equations (2) to (4). The design is delay-minimal since 1 is a delay{minimal timing. Table I The minimum number of processing elements necessary to produce outputs requiring K computations, at a throughput of 1 output per M steps is K M. rom Table I The timing 1 can be used with the new network. The allocation function is now a 0 (i; k) = (kdivM) and components, bu ering and logic of the processing cells are somewhat more complicated (see [3] ). rocessing involves a slightly di erent operation at each step in an {step cycle. At each step, a physical processor actually performs ust one of the operations of the virtual processors it represents. This is possible because each group of M virtual processors perform M 0 1 unwanted computations at each step. The cell operations could be directly implemented using a bu er through which x values are shu ed one place per step, with the front value outputed and a new value inputed. A multiplexor is needed to select the correct x value for the partial sum operation. e : A di erent design is obtained by using the timing 2, the allocation function a 0 and the new network. The timing function t 0 y has the e ect of aligning each partial sum operation with the entry of the necessary x variable according to t 0
x . This means that x is always processed in the step that it is inputed. This is signicant as the implementation requires no multiplexing to select the correct x value from the bu er. Therefore the step size is smaller than for design . As the timings are delay{minimal and the number of DRAFT processing cells is K M both and are processor{delay{minimal designs. 
