To achieve additional spcedup in rank order and stack filter architectures requires lhe use of parallel processing techniques such a? pipelining and block processing. Pipellining is well understood but few block architecturw have been developed for rank order and stack filtering. Block processing is essential when the architecture reachs the throughput limits caused by the underlying technology. A trivial block SWJC~UC repeats a single input, single output structure to generate a multiple input, niultiple output structure and can achieve speedups equal Lo the block size (or the number of multiple outputs). Unlike linear filters, the rank order and stack filter outputs are calculated using comparisons. It is possible to share these comparisons within the block structure and Uius substantially reduce the size of the block structure. In this paper we intmduce a systematic method for applying block processing to the rank order and stack filters.
Introduction
Rank order fiiters are non-linear filters which choose an output based on its rank within a window of sample inputs determined by sorting the inputs. The output for the rank order filter is defined as y(n) = r'* largest of [ x(n-Nl), .x(n-Nl+l), . . . , .~( t r ) , . . . , x(n+N2-l), x(n+Nz) I (1) where r is the rank of the filter and W = NI+N2+1 is the size of the input window. Rank order filters can be built with sorting networks [ 1-31. In lhis paper we will concenlrate on building efficient block struclures for the rak order filter structuws using Batcher's odd/even inerge sorter [I], and Rtas' running order merge solier [3], which funclion using the merge sort algorithm [4] . The merge sort algorithm repeatcdly merges pairs of sorted subsets until the entire input set is solied. The odd/even merge soiter consists of several levels of merge units, built using an NxN odd/even merge unit [l] , whose basic processing element is the compare and swap unit. ,411 example of the merge algoriUlm is shown in Fig. l(a) and the iniplementation using compare and swap units is shown in Fig. I Intemational Conference on Application-Specific Array Processors whose inputs are all separated by the same time steps an: mapped onto a single merge unit by using extra memory units. In Pitas' original discussion each level of merging generated a serial output stream. We use a technique similar to Pitas, but generate all the sorted elements in parallel. For example, the merge sort in Fig. l(a) can be built using a running order merge sorter as shown in Fig. l(c) . Here the two 2x2 mergels an: mapped onto a single 2x2 merge and the four 1x1 mergers are replaced by a single 1x1 merger. Stack filters 15.6) are an extension of the rank order filtering class. Stack filters can be built as binary filters or by using maximum of minimum structures. In this paper we will generate block structures using the MAX-MIN structure [6] . The MAX-MIN structure is defmed
where F defines the number of MIN functions, Sj is a subset defmed over the input sample window from eq. (I), and j = 1 to F. In this paper we will use MIN(MAX) functions based on the cells shown in Fig. 2 [6] .
The StNCtureS for rank order and stack filters generate a single output for each new input entered into the structure. In many cases, a single output structure may not support the necessary operating speeds for todays high bandwidth systems. Pipelining and block processing are two of several parallel processing techniques which can be used to exploit the hidden concurrency within an algorithni. Pipeling [7-91 increases the speed of a circuit by overlapping the computations of the data path. Pipelining is limited by the i/o bandwidth and inherent clock speed of the underlying technology. Block processing [8, 10, 11] , transforms a single-input single-output system into a multiple-input multiple-output system. Block processing overcomes i/o and clock speed limits by processing multiple outputs. With block processing, the sampling rate can be increased by as much as the block factor times the original sample rate of the single-output structure, where the block factor is determined by the number of parallel outputs. However, the size of the block structure increases by as much as the block factor times the original filter size. We introduce a method for applying block processing to rank order md stack filters. This method systematically finds shared subsfruchues within the block smctures to efficiently reduce the size of these structures and is an extension of the method introduced in [12] .
Block processing can also be used for low power designs [ 131. With block processing, the clock period of the block structure can be increased by the block factor to achieve a new structure with the identical sample rate of the sequential structure. Such a structure can be operated at a lower supply voltage and therefore requires less power. By reducing the size of the block structure through shared suDsrructures, the capacitance of the block structure is reduced. Therefore, a block structure with shared substructures quires even less power than the original block structure. In addition, with shared substructures, it is possible to achieve arbitrarily low power architectures, which is not possible with trivial block structures. Of course, these low power designs come at the expense of the corresponding increase in area associated with the block structure.
Block Processing Structures for Rank Order Filtering
The block processing structure for a rank order filter augments the initial input samples in the window with L-1 additional samples to generate L outputs simultaneously, where L is the block factor. The L outputs of a block rank order filter arc defined as The simplest approach to building a rank order filter with this structure is to build a separate sorting network for each output, as shown in Fig. 3(a) for a window of size W = 5 and a block factor of L = 2. This method is far f " efficicnt, and by identifying inputs shared between the multiple outputs, it is possible to reduce the overall size of the block structure. For example, in Fig. 3 , four inputs are shared between the two outputs. Instead of using two complete sorting networks, it is possible to first merge the overlapping inputs as shown in Fig.  3(b) . The merge blocks can be further reduced by mapping merge units related by time onto a single merge unit as shown in Fig. 3(c) , at the cost of extra memory. It is possible to develop a systematic metliod lo identify these shared input sets, regardless of the window size or the block size. This technique is adapted to the underlying merge sorting algorithm [4] and attempts to maximize the amount of input sharing. The shared input sets are identified by repeatedly recognizing shared pairs of input sets, beginning with pairs of inputs. We first identify shared pairs of single inputs. Then we identify shared pairs of sets containing two inputs, and so on. These shared input sets are used to build shared merging networks.
Since the shared input sets are built up in the same recursive form as used in the merge sort algorithm, the shared merging networks can be built directly from the common input sets.
We first explain this method assuming that the block structure is implemented with a structure similar to Batcher's odd/even merge sorter. We then extend the explanation to support time mapping using a variant of Pitas' method. For each level of merging we find the common input sets. This is done in an iterative manner beginning with single inputs. At each level of merging we identify h e shared inputs sets using the following systematic approach.
First we identify all pairs of input sets which may be shared by more than one output. The next step is to identify the maxinium number of pairs of input sets which are common to the most outputs. This is a modified form of the set covering problem [4]. We solve this problem using a greedy heuristic which chooses the pairs of input sets one by one depending on which one covers the maximum number of outputs. This identifies the shared inputs and therefore the shared mergers at each level in h e block structure. This method can be formalized as follows. For each level of merging: 1) Identify all input sets in pairs commo~i to more than one output. 2) Correlate the pairs with the outputs. 3) Choose the pair which is common to the most outputs. If more than one pair is common to the most outputs, then choose the first such pair. 4) Remove all pairs with an input in common with the pair chosen in step 3. 5 ) If there are still some pairs remaining go to step 3. 6) Replace the inputs corresponding to each output with the sets identified for this level of merging. m. q4t+Z)A4k+31 Example 2.1 Given N1= 3. W = 7. and L = 4. Then
For the first level of merging, there are 25 pairs of single inputs, PI-p7,5, which are common to more than one output identified in step 1. These pairs are s h o w in Fig. 4 (a) and are correlated with the outputs, according to step 3. in Fig. 4(b) . In Fig. 4 (b) there are six input pairs which cover all the outputs. In step 3, input pair, PI 1, is chosen as the first shared pair. Input pair, PI 1, contains inputs x(4k), and x(4k+l). h Step 4. all the pairs containhg one of these two inputs are removed as shown in Fig. 4(c) . We retum to step 3 and next choose pair E O , since it.
covers the most outputs. We continue in this fashion until we have chosen the input pairs, P11,
International Coderenee on Application-Specific Array Pmcessors P20, P1, and €25 for the first level of merging. Thcse input pairs are merged and shared among the corresponding outputs. In step 6, we replace the inputs corresponding to the outputs with the sets identified above. Output y(4k) depends on x(4k-3). P1, P11. and €20; output y(4k+l) depends on P1, P11, PZO, and x(4kM); output y(4k+2) depends onx(4kl), P11, P20, and P25;
and output y(4&+3) depends on P11, P20. P25 and ~( 4 t h ) .
For the second level of merging, the inputs sets are identified and correlated with the outputs as shown in Fig. 4(d) . For the third level of merging, @re are two pairs of shared input sets. We identify and correlate the pairs of input sets as shown in Fig. 4(e) . The process of identifying all Ihe input set pairings yields the shared merge structure shown hi Fig. 5 Table 1 contains the number of compare and swap units, the nuniber of memory units, and the numkr of paidle1 stagcs for merging block structures with shared substructures. Block struclures with and without time mapping are shown in the table. In both cases, the nuniber of compare and swap units is reduced with input sharing by up to 40%. It is assunied that all the. sorted elements are generated with the block structures. If only a single ranked outpul is necessary, the number of compare and swap units may be reduced [ 141. Merge sorting with time mapping requires fewer compare and swap Units, and more memory uits.
Occasionally the shared nierge structure requires an additional level of merging and thus the number of slages in the shared structures increases as shown in Table 1 . For a non-pipelined design, the critical pnlh (the longest series of compare and swap Units) increases when the stage delays increase. It is essentid to pipdine these designs to achieve the same clock periods as the 
Block Structures for Madhiin Based Stack Filters
Block processing can also be applied to stack filters and the block structure can be reduced by using shared substructures. The M A W I N structure for a block stack filter is described as
... ~/ N ( S F i ) l
where i = 0 to L -1, L is the block factor, F defiles the number of MIN terms. and Sj, is a subset defined over the corresponding input window
The trivial way to construct a block structure for a stack filter is simply to repeat the MAXIMIN structure L times. Each M M function requires lSji 1-1 compares. Each MAX function requires F-1 conipares. This is not efficient-As in the rank order filters, there arc conlmon inputs between consecutive outputs. This implies that it is possible to share comparisons within the MIN functions, to strare MIN functions, and to share comparisons within the MAX functions. The straightforward implementation of the block stack filter would require 28 compares for the MIN and MAX functions. However it is possible to share compares in several places within the block structure. First it is obvious that three MIN functions are repeated between the two outputs. It is not necessary to calculate these twice. Second the remaining MIN functions have five compares in common. Third the MAX functions have 2 compares in common. Therefore it is possible to implement this block structure with the MAX/MIN structure shown in Fig. 6 which only requires 17 compares. Fig. 2 three times. However, as shown in Fig. 6 it is possible to share one compare between the three MlN cells. Therefore, it is possible to implement a single bit cell of the shared MIN functions as shown in Fig. 7 . Thus implementation requires only 28 gates versus the 30 required for the trivial implementation. With more shared comparisons, the gate reduction is higher. In fact for this implementation, t l~r e will be a reduction in the number of gates as long as the number of shared comparisons is greater than or equal to three.
M/h'[.r(Zk),(Uc+l)x(2k+2)], MIN[x(2k)x(2k+l)x(2k+3)1, we could simply repeat the cells in
To find all the shared comparisons quires two steps. First identify the shared MIN functions among the outputs. These shared MIN terms need only be calculated once and if there are multiple shared terms among tlle outputs then the MAX of these can be shared among the outputs. Next identify the shared comparisons within the MIN functions. These also only need to be calculated once and shared among the MIN functions. To identify the shared MIN subfunctions requires using a technique similar to the one outlined in Section 4 to identify the common merge subfunctions in the sorting networks. The shared input sets are identified, in a similar manner, by repeatedly recognizing shared input sets. We begin by identifying the pairs of inputs which are common to more than one MIN function. The next step is to identify the maximum number of input pairs wluch are common to the most MIN functions. This is again a modified form of the set covering problem [4]. Algorithm 2.1 can be used to choose the input pairs that cover the MIN functions. Each level of merging is replaced with a level of MIN functions which operate on a subset of the inputs common across the MIN functions. This identifies the shared Comparisons at each level in the block structure.
Implementing the shared comparison stmcture as in Fig. 7 requires increasing the critical path (the maximum number of gate delays) through the MAX/MIN structure. To ensure a critical path equivalent to a structure with no sharing requires adding pipeline latches to each level of shared comparisons. The number of pipelining latches is equal to the number of shared input sets. For example, in Fig. 7 one input set is shared across three MIN functions, so a Single pipeline latch is required as shown in fig. 7 . This technique can be extended to support time mapping as well. First recognize that all compares within the minimum functions may be mapped to a single minimum subfunction if the inputs to the compare unit are separated by mL. time steps, where m is a positive integer and L is the block factor. The extra memory needed to store the results of the single compare unit depends on m. Again to identify the common input sets, it is necessary to associate the inputs separated by mL. time steps. Example 3.2 Given the stack filter described in Example 3.1. The straightforward implementation of the block stack filter required 28 compares for the MIN functions and MAX functions. By sharing minimum and maximum functions it was possible to implement the block structure using only 17 compares as shown in Fig. 6 . By mapping the compare units that can be shared in time, it is possible to implement the MAX/MIN structure using only 13 compares and three additional latches as shown in Fig. 8.0 We have applied these techniques to reduce the number of comparisons for some common stack filters. For our examples we utilize stack fdten to build various rank order fdters. We have translated these architectules into the =/MAX implementation shown in Fig. 7 and calculated the corresponding gate reduction. The results are shown in Table 2 . Table 2 the MAX/MIN structure. The final two columns show the number of latches necessary for the shared designs. The first set of latches are those necessary to maintain the critical path length for the shared design as compared to a non-shad design. The last set of latches are those necessary to maintain the critical path length plus those necessary to store the time mapped comparisons. where CL is the load capacitance, V , is the supply voltage, k is dependent on the transistor parameters, and V, is the CMOS threshold voltage. Typically V , is 5 volts and V, is 0.6 volts.
For low power applications, the supply voltage can be reduced by increasing the clock period. The power, P, of a CMOS circuit depends on the supply voltage and the clock period. With block processing, the clock period of the block structure can be increased by the block factor, while maintaining the identical sample period for both the original and the block structure.
Then the block structure can operate at a lower supply voltage and thus a lower power than the original smcture.
Session 3 Applications I Signal Processing

75
For a parallel design with block factor L ,
where Tp is the clock period of the block structure and To is the clock period of the original structure. By substituting q. (5) into eq. (6) . the supply voltage is reduced by
where Vp is the supply voltage of the block structure. and V,, is the supply voltage of the original structure. p is less than one and is dependent on the size of the block factor. A larger block factor generates a smaller $ value. The power of the original structure is described by Po = (C,,V;)/(T,) where CO is the capacitance of the original structure 113). .The power of the block structure is Pp =(CpVpz>l(Tp) where Cp is the capacitance of the block structure. For trivial block structures C, = LC, since the block structure grows linearly with the block factor. However, by using our technique to reduce the number of compare and swap units, the capacitance of the block structure does not grow liiearly with the block factor. Instead. Cp = &CO where a < 1 and is due to the shared substructures. Therefore
So by reducing the size of the block structure through shared substructures we can further reduce the power collsumption of the block structure over that provided by a block structure with no sharing. For rank order filters, a is the ratio of the number of compare and swap units with sharing to the number of compare and swap units with no sharing. For stack filters, a is the ratio of the number of gates with sharing to the number of gates with no sharing. The power calculation in Eq. (8) assumes that the supply voltage CII be infinitely reduced by increasing the block factor. Of course, in reality there is some limit on the reduction of the supply voltage. At some point, even by increasing the block factor, it is not possible to further decrease the supply voltage. If we assume the power supply voltage is limited to some value (we use 1.8 V in our examples), then the power reduction for the trivial block structure will level out as reaches its lower limit. However with the shared substructure idea, there is no International Codemnce on Application-Specific Array Processors limit on a, the ratio of the number of compare and swap units with sharing to the number of compare and swap units with no sharing. Therefore it is possible to obtain arbitrarily low power designs by increasing the block factor and using s h a d substructures, at the. expense of increased area. This is shown in Fig. 9 for the sorting networks with shared qubstructures. The power reduction possible for the trivial block structures without input sharing (shown as circles in Fig. 9) is independent of the window size. The power reduction for the shared block structures (shown as squares) is dependent on the window size. Therefore we show only a single example for a window of size W =9. (Jle number of compare and swap structures for the block structures, without time mapping. are calculated as in Table 1 .) The amount of power reduction for the shared block strucfures varies slightly with the block size. It is possible to generate more shared substructures with an even block size than an odd block size. Therefore the power reduction falh faster with even block sizes. With sharing it is possible to continue reducing the power consumption even after the supply voltage limit is reached by increasing the block factor.
Conclusion
In this paper we have presented several new parallel structures for rank order and stack filtering. These structures are based on shared substructures which substantially reduce the area requirements for these designs. The block structures can also be used for low power designs. Additional bit-serial designs could be generated as in [2] .
