An area-efficient IC for high-throughput median filtering applications is presented in this paper. This IC implements a modified delete-and-insert sorting algorithm which is very efficient in running order statistics applications. In hardware design, we first map the algorithm onto a regular P E structure, where each PE consistis of shift register, comparator, and some control gates. Then we conduct full-custom circuit/layout design of the PE to meet performance requirement. A proto-type chip for 64 input samples is implemented and tested. Results show that clock rate up to 50 MHz can be achieved using a 1.2 pm CMOS double metal technology. Two outstanding features of this IC are: (1) any specified order of input patterns can be produced within one clock cycle; (2) each chip can handle at most 64 data and can be cascaded as the number of sorted data is over 64. Thus this IC releases the bottle-neck of median search in hardware realization for many system designs, making real-time performace achievable.
INTRODUCTION
The use of median filtering technique can be found a lot of applications in consumer electronics, such as in [1,2]. In the past, several algorithms and hardware [3] were reported to overcome the complexity during median search, such as bubble sort, quick-sort, ... etc. However these solutions become impractical when real-time performance and hardware cost are taken into account simultaneously as the amount of data increases. In this paper, we present an optimal hardware solution for median search, where the complexity only increases linearly according to the number of input data. Moreover any specified order of data can be obtained right after the data set is inputted.
We briefly discuss the modified insert-and-delete sorting algorithm in section 2, where we focus on the transformation so that a parallel architecture can be obtained. Section 3 describes the VLSI architecture and circuit design in this median filtering IC. Evaluation and some discussion about this high-througput IC are given in section 4.
THE MODIFIED DELETE-AND-INSERT ALGORITHM
Median search is a special case in sorting. Therefore input samples can be sorted first according to a certain sequence, then any specified order within the data set can be obtained. This sorted sequence can be obtained by many available sort algorithms, among them the delete-and-insert algorithm is selected. As shown in Figure 1 , delete and insert operations can be regarded as a set of shift operations working on the sorted data sequences. If descending sequence is assumed, delete and insert operations act in the following way: 0 for deletion: find the position of the input sample and shift left those sorted data which are less than the input sample, i.e. the identifed data item is replaced by its right neighbourhood, and hence the delete operation is done; 0 for insertion: first find the position of the inserted item, i.e. to shift right those sorted data which are less than the input sample, then load the input sample to the allocated position. Thus now the sorting process can be replanced by a set of shift operations which are conditionally operated according to the content of sorted data items and input sample. This implies that parallel compare and shift operations can be exploited to speed up performance.
ARCHITECTURE AND CIRCUIT DESIGN

FOR THE OD1 CHIP
This areaefficient solution is based on the optimized deleteand-insert (ODI) sorting algorithm, described above, where a pre-shift strategy is exploited to speed up performance.
As already shown in Figure 1 , the insert and delete operations can be regarded as a set of shift right and shift left operations, where extra load operations are needed to place the input pattern to the exact position. Figure 2 s h m how the (ODI) algorithm is mapped to a regular PEbased architecture, where each PE mainly consists of one sort-cell, one comparator, and some control gates. Figure 3 shows the cirucit design for each shift register cell and the shift left operations to meet the delete operation. The number of PES is determined by the maximum amount of data to be processed (in this test chip, 64 PES are included). However this design can be cascaded once the number of input patterns is more than the specified maximum number. This indeed makes the design more flexible and capable of handling different applications.
In addition to the above-mentioned hardware, we have also included a data selection unit t o meet the goal of selecting any specified order. This is achieved by exploiting a dynamic big-OR circuit to meet specified timing constraint. Therefore the achievable clock rate can be up to 50 MHz based on a 1.2 pm CMOS double-metal technolgy. The complete chip plot of this IC is shown in Figure 4. In addition, we have cascaded 2 chips to test the performance. Results remain the same as those obtained from each single chip.
EVALUATION AND DISCUSSIONS
It should be noted that dynamic circuit design is exploited in the data selection unit to speed up performance. If static design is used, then the target performance cannot be reached.
CONCLUSION
In conclusion, this high-performance can meet both highspeed data-sorting and real-time median search requirements as requested in many imagelvideo applications. However the area for each PE can be reduced if more communication and layout are taken into account simultaneously. This will be done in the follow-up designs where we have found that this high-speed sorter can be exploited to enhance system performance. 
