This paper presents sorting network based architectures for computing non-recursive and recursive median lters. The proposed architectures are highly pipelined, and consist of fewer compare-swap units than existing architectures. The reduction in the number of compare-swap units is achieved by minimizing computational overlap between successive outputs, and also by using Batcher's odd-even merge sort (instead of bubble-sort). The latency of these networks is reduced by building them with sorting units that sort 2 elements (sort-2) as well as 3 elements (sort-3) in 1 time unit.
Introduction
Median lter is a non-linear digital lter with an interesting combination of properties: it can smooth the signal by removing noise, and at the same time preserve edge information. These properties make it a very popular lter in speech processing and image processing schemes.
There are two types of median lters: non-recursive and recursive. In non-recursive median ltering, a window is moved along the sampled values of the image, and the center value of each window is replaced by the median of the values in the window. For instance, in 2-D non-recursive median ltering, the (i; j)th window of size (K K), W i;j , is centered at (i; j), and the (i; j)th output y i;j = medianfW i;j g. In merging sequences of equal sizes, and as a result cannot be used to e ciently sort windows of all sizes. Batcher's odd-even merge sort, on the other hand, does not have any restriction on the size of the merging sequences, and is highly suitable for sorting windows of all sizes.
In this paper we develop sorting network based architectures based on Batcher's odd-even merge sort 2] for both non-recursive and recursive median lters. The proposed architectures are highly pipelined and have fewer compare-swap units than existing architectures 9, 11, 3] . We reduce the number of compare-swap units further by minimizing the number of overlapping computations among consecutive windows, and also by processing consecutive windows at the same time. When M is not a power of 2, we reduce the the latency and increase the regularity of the sorting network by building it with sorting units that sort 2 elements as well as 3 elements in 1 time unit. In the next two sections, 1 A compare-swap unit compares two elements and sorts them in decreasing or increasing order. 1 we describe sorting network-based architectures for non-recursive and recursive median lters.
Non-recursive median ltering
In this section we describe two sorting network architectures that are based on Batcher's odd-even merge sort 2] for non-recursive median ltering. The rst network sorts all the elements of a window from scratch every sample period, while the second network exploits the overlap between consecutive windows, and sorts a subset of the window elements from scratch every sample period.
Batcher's odd-even merge sort: The odd-even merge algorithm to sort M elements 2] consists of parallely sorting the top dM=2e elements and the bottom bM=2c elements in decreasing (or increasing) order, and then merging the two sorted sequences. When M is not a power of 2, the latency of the odd-even merge network can be reduced by introducing sorting units which sort 3 elements in 1 time unit. Such a method of increasing circuitry to reduce latency could be extended to sort any number of elements in 1 time unit. However the circuitry required to support such an operation would increase drastically with increasing M. The number of 2-element comparators in this sorting network is given by C sort (M) = C sort (dM=2e) + C sort (bM=2c) + C merge (dM=2e; bM=2c); where C merge (dM=2e; bM=2c) is the number of comparators required to merge two sorted sequences of sizes dM=2e and bM=2c. Network 2: This network is based on the fact that in 2-D median ltering, the window shifts by 1 column every sample period, and that two consecutive windows share (K ? 1) columns. In this network, the elements of a column are sorted only once, and the sorted column is stored for (K ? 1) cycles, so that it can be used for the computation of K consecutive outputs. Thus the K columns of a (K K) window are sorted during K di erent cycles, and the sorted columns are merged using Batcher's odd-even merge sort for every median output. A (3 3) median lter operating in bit-serial mode, with least signi cant bit (LSB) rst, and 8-bit precision, was designed using VTITools on an Apollo workstation. The clocking scheme was 2-phase non-overlapping. The technology was 2 micron twin-tub CMOS. According to the simulations that were run using Berkeley SPICE and VTISim, this network can be clocked at 50 MHz, resulting in a throughput of 6 10 6 medians/second.
Block processing
The number of comparators per output can be reduced even more by processing consecutive windows at the same time. This is because consecutive windows have common elements, and so if they are processed simultaneously, the common elements need to be sorted only once. Processing multiple outputs at the same time (block processing) has many advantages including, reduction in power dissipation per output 2 and fast processing of stored signals. In this section, we rst develop e cient ways of block processing 1-D median lters, and then show how this can be extended to 2-D lters.
In 10], Lucke and Parhi developed a block processing algorithm which identi es the elements that are common to all the outputs, and also identi es pairs of elements that are common to more than one output. The common elements are presorted and then merged to generate all the block outputs. Here we propose a new algorithm for 1-D block processing that is based on merging sequences of approximately equal size in every stage. Our algorithm di ers from 10] in the way the common elements are identi ed, as well as in the way the sorted sequences are merged. Our algorithm is summarized as follows. Let the 1-D window be of size K = 2N + 1, and the block be of size B.
Step 1: Identify disjoint groups of N + 1 consecutive elements which occur with the maximum frequency. Sort these groups of N + 1 elements.
This involves (i) creating groups of (N + 1) consecutive elements from the list of B outputs, and listing the frequency of occurrence of each group, (ii) forming sets of such groups, such that the groups in a set do not contain any common elements, and that they occur with high frequency, (iii) choosing the set which contains minimum number of groups, and in case of con ict, choosing the set whose groups occur approximately same number of times, (iv) sorting the (N + 1) elements of a group by odd-even merge sort as described in Section 2.
Step 2: Sort the remaining N elements per output.
The remaining N elements for every output are sorted by applying this algorithm recursively.
Step 3: Merge sorted sequences of sizes N + 1 elements (from Step 1) and N elements (from Step 2).
Since we are interested only in the median, this merge unit is modi ed as described in Section 2.
The above algorithm is used to generate a network to compute multiple median outputs, referred to as the multiple-median network. The number of comparators in such a network can be reduced by identifying common elements between the (N +1) elements of a group in Step 1, and the N elements of a group in Step 2. For instance, if fx 3 ; x 4 ; x 5 ; x 6 g is one of the groups in Step 1, and if fx 2 ; x 3 ; x 4 g is one of the groups in Step 2, then fx 3 ; x 4 g which is common to both the groups is presorted, and then merged once with fx 5 ; x 6 g and once with fx 2 g to generate the two sorted groups. Another 2 In block processing, if B outputs are computed every cycle, then the power dissipation is proportional to the number of comparators in the network, since all the comparators are active. If B outputs are computed every B cycles, then the power dissipation is proportional to the number of comparators that are active in every Bth stage of the network (the remaining comparators are deactivated). In both cases, the power dissipation per output is less, at the expense of a larger number of comparators on chip.
way of reducing the number of comparators in the multiple-median network is to identify groups of elements whose indices di er by B units, and to sort only the groups with larger indices. For instance, if the multiple-median network has groups fx i ; x i+1 g, and fx i+B ; x i+B+1 g, then fx i ; x i+1 g need not be sorted, since the sorted sequence corresponding to fx i ; x i+1 g is the sorted sequence fx i+B ; x i+B+1 g delayed by B units. Figure 3 describes the derivation of a multiple median network when K = 9, and B = 3. The number of sort-2 units is 48 compared to 22 3 = 66 units that would be required if there were no block processing. Design of insertion stage S j : In insertion stage S j , 1 j N, the output value y is inserted in a sorted sequence u j;1 ; : : :; u j;N+j , to create another sorted sequence u j+1;1 ; : : :u j+1;N+j+1 . The hardware consists of (N + j) comparators followed by output logic units as shown in Figure 5b Multiple-median network:
Step 1: Groups fx 4 ; x 5 ; x 6 ; x 7 ; x 8 g and fx 5 ; x 6 ; x 7 ; x 8 ; x 9 g occur with Remaining elements fx 1 ; x 2 g, fx 2 ; x 10 g, fx 10 ; x 11 g. Note that the the sorted sequence corresponding to fx 4 ; x 5 g by B=3 cycles. y 1 y 2 y 3
x 2 x 10 x 10 x 11 x 3 x 9 x 4 x 5 x 8 x 7 x 6 maximum frequency. Choose S = fx 4 ; x 5 ; x 6 ; x 7 ; x 8 g.
Step 2: Remaining elements: fx 1 ; x 2 ; x 3 ; x 9 g, fx 2 ; x 3 ; x 9 ; x 10 g, fx 3 ; x 9 ; x 10 ; x 11 g. sorted sequence corresponding to fx 1 ; x 2 g is obtained by delaying 
