This paper presents high sample rate semi-systolic array architectures for computing 1-D and 2-D non-recursive and recursive median lters. A high sample rate is obtained by pipelining the computations in each processor. While the non-recursive lters are pipelined by placing latches in the feed-forward paths, the recursive lters are restructured to create additional delays in the feedback paths, and then pipelined using the delays as latches.
Introduction
The median lter is a non-linear digital lter with an interesting combination of properties: it can reduce high frequency and impulsive noises without destroying edge information. These properties make it a very popular lter in speech processing and image processing schemes.
There are two types of median lters: non-recursive and recursive. In 1-D non-recursive median ltering, a window of size (2N + 1) is moved across the entire signal, and the center sample of each window is replaced by the median of the samples in the window. The idea is that the median of the samples in the window is the best guess for the center sample. Let W i be a window centered on the ith sample, that is, W i = fx i?N ; : : :; x i ; : : :; x i+N g. Then y i is the median of the samples in W i , y i = medianfW i g. In 2-D median ltering, a window of size (K K) is moved across the image, K = 2N + 1. Let W i;j be the window centered on (i; j), then y i;j = medianfW i;j g. In recursive median ltering, the window consists of the recent median samples as well as the input samples.
For instance, the ith window for 1-D recursive lteringW i = fy i?N ; : : :; y i?1 ; x i ; : : :; x i+N g, and y i = medianfW i g. By including the recent median samples, the features from the input signal can be extracted faster than by non-recursive median lters 8]. Rank order lters are generalized form of median lters where the output is the element with rank R, that is, the output is the Rth smallest element. Thus a median lter with window size 2N + 1 is a rank order lter with R = N + 1.
The existing architectures for median lters can be broadly classi ed into two classes 1]: ones in which the samples are stored in order of arrival, and others in which samples are stored in sorted order. There are other stack lter based architectures where the median is obtained by examining the bits of the samples 4]. 11] gives an excellent survey of the existing architectures.
The architectures belonging to the rst class (samples stored in order of arrival) are either sorting network based architectures 9, 2] or are linear systolic array architectures 7, 6, 3, 10] . While the sorting networks can be pipelined to any level, they consist of at least O(K log 2 K) compare-swap units for window size K, K = 2N + 1. The linear systolic array architectures, on the other hand, consist of only 2N + 1 processors, but have a larger sample period. In the linear array architectures, the sample period is a function of the time taken to update the ranking of the old window based on comparison results with the new and oldest samples. The architectures for the case when samples are stored in sorted order 1, 5] also consist of a linear array of 2N + 1 processors. Here the sample period is a function of the time taken to update the sorted list by discarding the oldest element and inserting the new element. In 3] we developed array architectures with lower sample periods than existing array implementations for the case when samples are stored in order of arrival as well as when samples are stored in sorted order.
In this paper we extend our previous results 2, 3] and develop high sample rate architectures for 1-D and 2-D non-recursive and recursive median lters. The architectures consist of a semi-systolic array of processors which store the samples in order of arrival. We rst describe non-pipelined designs for 1-D and 2-D non-recursive median lters with lower sample periods than existing architectures 6, 7, 10] , and then show how to pipeline these designs to any level by placing latches in the feedforward paths. We next describe a design with low sample period for 1-D recursive median lters. We show how to pipeline the design by restructuring the algorithm to create additional delays in the feedback paths to be used as pipeline latches. In Sections 2, 3, and 4, we describe the architectures and algorithms for 1-D and 2-D non-recursive median lters, and 1-D recursive median lters.
1-D non-recursive median lters
In this section we describe a highly pipelined semi-systolic linear array architecture for 1-D nonrecursive median ltering. The algorithm is based on the fact that ranking of the present window can be obtained by updating the ranking of the old window (rather than re-evaluation from scratch) 7]. The systolic linear array architectures proposed by Kung 7] and Hwang 6] obtain the ranking of the present window by updating the ranking of the old window twice; once by comparing with the sample being discarded and once by comparing with the new sample. The sequential update of the ranks, and the purely systolic nature of the design, makes the sample period of the designs in 6, 7] quite large (see Table I ).
We propose an architecture consisting of a semi-systolic linear array of (2N + 1) processors (see Figure 1 ) which have a lower sample period than the existing architectures 6, 7, 10]. The processors store the samples in the order of arrival; P 0 stores the new sample and P 2N the oldest sample. Each processor updates its rank based on the comparison results with the new and the discarded samples.
Processor P 0 computes the rank of the new sample by counting the number of samples which are smaller than its own. All processors (except P 0 ) consist of (i) data register a to store the sample value, (ii) rank register r to store the rank of that sample, (iii) a comparator to compare with the new sample, (iv) a shift register s to store the comparison results with all samples that arrived before it, (v) rank update unit to update the old rank, and (vi) an output unit to output the median value.
By storing the comparison results with older samples in register s, each processor no longer needs to compare with the discarded sample. The comparison result with the discarded sample is in the rightmost cell of register s. Figure 2 describes the design of such a processor. Processor P 0 consists of registers a, r, s, a carry look-ahead adder tree and an output unit.
We next describe the operations that take place during the computation of the ith output. For the ith window, the new sample x new = x i+N , and the discarded sample x old = x i?N?1 .
Step Step 2.1. Update rank: Each processor, except P 0 , updates its rank register r, r = r + u ? v.
Step 2.2. Compute rank of new sample: Processor P 0 computes the rank of x new by counting the number of samples that are less than x new . This is achieved by computing r = 1 + P 2N j=1 u(P j )
by a carry look-ahead adder tree.
Step 2.3. Update comparisons with other samples: All processors, except P 0 , shift their register s right. Processor P 0 loads the contents of u(P i ) into s i], 1 i 2N.
Step 3. Match rank: Any processor whose rank matches R = N + 1, outputs the contents of data register a. For rank order lters, R is set to the rank of the lter. Registers a, r, and s of processor P j are shifted to processor P j+1 , 0 j < 2N. The sample period of our design is given by T sample = T comp (n) + maxfT add (b); T cla (2N + 1)g + T comp (b), where n is the number of bits to represent a sample value, b is the number of bits to represent rank of the sample (b = dlog(2N + 1)e), T comp (s) is the time required to compare two s bit numbers, T add (s) is the time required to add two s bit numbers, and T cla (s) is the time required to add s 1-bit numbers using a tree of carry look-ahead adders. Figure 4 describes the above sequence of operations for median computation in an example with window size 5.
Recently Lucke and Parhi 10] independently proposed an architecture for 1-D median ltering. In their architecture 10], the rank of the new element is computed rst, and the ranks of the remaining elements are updated by comparing with the ranks of the new and the oldest elements. In comparison, in our architecture 3], the rank of the new element is computed in parallel with the rank updates of the remaining elements, resulting in a lower sample period. Table I compares the sample period and the hardware requirements of our architecture with those in 10] and 6].
Pipelining: The sample period of our design can be reduced by pipelining the computations in each processor. For example, the sample period can be reduced by a factor of 3 by adding latches in the feed-forward paths as shown in Figure 2 . The trade-o is an increase in area (due to the latches) and an increase in latency. Figure 3 describes an alternate implementation which results in fewer pipeline registers. Here the samples of the window are stored in a centralized register bank of size 2N + , for levels of pipelining. Now if in Step 3 of the procedure, processor P i has rank N + 1, then the contents of the (i + ? 1)th register is sent to the output bus.
2-D non-recursive median lters
The linear array architecture proposed for 1-D median ltering can be easily extended to compute 2-D median ltering as in 6, 10] . The array consists of K 2 processors, and a new output is determined every K cycles. In this section we rst propose a high-throughput architecture for 2-D median ltering where an output is determined every cycle, and then show how this architecture can be modi ed for the case when an output is determined every T cycles, 1 T K. The area complexity of the high throughput architecture is O(K 3 n) compared to O(K 2 log 2 Kn) for odd-even merge sorting network based architectures.
The proposed architecture consists of a (K K) array of processors, where processors P 0 through P K?1 (in the leftmost column) store K samples of the new column (see Figure 5 ). Processors P K through P K 2 ?1 update their ranks based on the comparison results with the K new samples and the K discarded samples, while processors P 0 through P K?1 compute their ranks by counting the number of samples that are smaller than their sample values. Processors P K through P K 2 ?1 contain K comparators to compare with the K new samples, data register a, rank register r, register array s 2 to store the comparison results with all samples that arrived before it, rank update unit and an output unit. Processors P 0 through P K?1 contain (K ? 1) comparators, registers a and r, register array s of size (K K ? 1), a carry look-ahead adder tree, and an output unit.
The operations for the computation of the median of a 2-D window are summarized as follows.
Step 1. Compare with K new samples: Each processor, except P 0 through P K?1 , compares the K new samples with the contents of its register a, and stores the comparison results in 1-bit registers fu m g, 0 m K ? 1. The cells in the rightmost column of the register array s contain the comparison results with the discarded samples. These are loaded into 1-bit registers fv m g.
Step 2.1. Update rank: Each processor, except P 0 through P K?1 , updates its rank register by r = r + P K?1 m=0 u m ? P K?1 m=0 v m .
Step 2.2. Compute rank of K new samples: Each processor P m , 0 m < K, computes the rank of its sample by a tree of carry look-ahead adders, r m = 1 + P K 2 ?1 i=0 i6 =m u m (P i ).
Step 2.3. Update comparisons with other samples: Each processor P K through P K 2 ?1 shifts its register array s right. Processor P m , 0 m < K, loads the contents of u m (P i ) into s, K i K 2 ? 1.
Step 3. Match rank: Any processor whose rank matches R = dK 2 =2e outputs the contents of 2 The size of the register array decreases from (K K ? 1) in the leftmost column of processors to (K 1) in the rightmost column of processors.
register a. Registers a, r, and s of processor P j are shifted to processor P j+K , 0 j < K 2 ? K. The sample period of this design is T sample = T comp (n) + maxfT cla (2K) + T add (b); T cla (K 2 )g + T comp (b). The sample period can be reduced by pipelining the computations in each processor. As in the case of 1-D ltering, pipelining is achieved by placing latches in the feed-forward paths. Figure  6 traces the above sequence of operations for 2-D median computation in an example with a (3 3) window. The new column of samples is (3; 2; 0) T , and the discarded column of samples is (10; 3; 1) T .
Area-time trade-o
The above architecture can be generalized for the case when an output is determined every T sample periods, 1 T K. The number of processors is still K 2 , however, now in 1 sample period, the ranks of K=T new elements are determined, and the ranks of K 2 ? K=T samples are updated. Each processor now consists of only K=T comparators to compare with the K=T new elements. The number of carry look-ahead adders is also reduced to K=T. Thus when the sample period is increased by a factor of T, the area of the resulting processor array is reduced to O(K 3 n=T). 
Architecture without pipelining
The architecture that we propose for 1-D recursive median ltering consists of (2N + 1) processors arranged in 2 rows as shown in Figure 7 . The rst row stores the N y-samples in processors Q 0 through Q N?1 , and the second row stores the (N + 1) x-samples in processors P 0 through P N . The output and the input samples are stored in the order of arrival. In every sample period, 2 new samples are written into the leftmost processors of the array, and 2 samples are discarded from the rightmost processors of the array (see Figure 7 ). All processors (except P 0 ) update their ranks based on the comparison results with the 2 new and the 2 discarded samples. Processor P 0 computes its rank by counting the number of sample values that are smaller than its own. Figure 8 illustrates the design of a single processor. All processors except P 0 3 consist of a comparator, data register a, rank register r, register s y to store comparisons with all the y-samples, register s x to store comparisons with all the x-samples, rank update unit and an output unit. 3 Processor P0 contains registers a, r, sx, sy, a carry look-ahead adder tree and an output unit.
The Step 2.1. Update rank: Each processor, except P 0 , updates its rank register r by r = r + u 0 + u 1 ? v 0 ? v 1 .
Step 2.2. Compute rank of new sample: Processor P 0 computes its rank by a tree of carry look-ahead adders, r = 1 + P N?1 i=0 u 0 (Q i ) + P N j=1 u 1 (P j ).
Step 2. Step 3. Match rank: Each processor compares its rank r with R = N + 1. If r > R, then it sets u 0 = 1, otherwise it sets u 0 = 0. If r = R, then it gates the contents of its register a to the output bus, and the contents of registers a, r, s x and s y to processor Q 0 . All processors transfer the contents of register a, r, s x , s y , u 0 to their right adjacent processors. The sample period of our architecture is given by T sample = T comp (n) + maxfT cla (4) + T add (b); T cla (2N + 1)g + T comp (b).
The recursive median lter architecture recently proposed by Lucke and Parhi 10] has a larger sample period compared to our architecture. Their algorithm consists of (i) computing rank of x new , (ii) using rank of x new to compute rank of y new , (iii) updating ranks of the remaining 2N ? 1 elements by +2,+1,0,-1,-2 depending on ranks of y old , x old , y new , x new and the current rank, and (iv) outputting median. The complexity of the rank update logic (in Step iii) and the sequential computation of Steps (i), (ii) and (iii) makes their design more complex compared to ours. Table II compares the sample period and hardware requirements of both the architectures.
Architecture with pipelining
Recursive median lters cannot be pipelined by only adding pipeline latches in the feedforward paths (as in non-recursive median lters). For instance, if a set of latches is added to the feedforward path of the processor in Section 4.1, the processor array would implement the function y i = medianfy i?N?1 ; : : :; y i?2 ; x i ; : : :; x i+N g instead of y i = medianfy i?N ; : : :; y i?1 ; x i ; : : :; x i+N g.
In this section we describe modi cations in the recursive median ltering algorithm in order to incorporate pipelining. We rst describe an architecture with 2 levels of pipelining, and then show how this idea can be extended for higher levels of pipelining.
Pipelining to 2 levels
In order to pipeline the recursive median lter design to 2 levels, the algorithm for the compu- y i is fed back to the processor array for the computation of y i+2 . Figure 9 traces the computations for an example with window size 5.Ŵ i = f9; 9; 12; 2; 5; 6g, x old = 10 and y old = 9. In order to reduce the sample period of the proposed architecture by a factor of 2, latches could be added in the rank update unit in all processors (except P 0 ), and in the carry look-ahead adder tree in processor P 0 .
Pipelining to 3 levels
In order to pipeline the recursive lter to 3 levels, y i should be a function of N outputs y i?3 through y i?N?2 . LetW i be a window of size (2N + 3) such thatW i = fy i?N?2 ; : : :; y i?3 ; x i?2 ; : : :; x i+N g. It can be shown that y i = medianfW i g is rank (N +1) or rank (N +2) or rank (N +3) ofW i depending on the ranks of y i?N?2 , y i?N?1 , x i?2 , x i?1 , x i+N?1 and x i+N . An implementation of this algorithm results in a complex output unit, making reduction of the sample period by a factor of 3 impossible.
In general, in order to pipeline the recursive lter to M levels, y i should be a function of N outputs, y i?M through y i?N?M+1 , and (N + M) inputs, x i?M+1 through x i+N . For M > 2, the match rank step of the algorithm becomes increasingly complex, and the delay in the output unit dominates the delay of the design.
Conclusion
In this paper we have described high sample rate array architectures for computing non-recursive and recursive median lters. We have developed non-pipelined architectures with lower sample periods than existing array architectures, and then reduced the sample periods further by pipelining the processors. The non-recursive lters have been pipelined by placing latches in the feed forward paths. The trade-o was in the increase in area (due to pipeline latches) and the increase in latency. The recursive lters could not be pipelined as easily. The recursive algorithm was restructured to create additional delays in the feedback paths, which could be used as pipeline latches. The complexity of the algorithm increased considerably with the increase in the number of levels of pipelining. Step 1:
Step 2:
Step 3: Comp. with R=5 Output
Step 3:
Step 1: 3 0 2 7 9 8 P 0 P 5 P 1 P 2 P 3 P 6 P 7 P 8 P 4 Step 1: 6
Step 3: Match Match e=1 f=0 g=1 0 0 0 1 1 1 Q 0 P 0 P 3 P 2 P 1
Step 2: 
