The median filter plays an important role in image processing for noise removing. To enhance the operating speed, hardware implementation of median filter on field programmable gate arrays (FPGA) or application specific IC (ASIC) are necessary and inevitable. This paper presents a low-cost and high-throughput hardware design for two-dimensional (2D) median filter. Two techniques are employed to reduce circuit area. One is the parallel three-valued sorting method for reducing the number of pipelined registers under constant speed, and the other is the functional sharing method used to implement median value sorting with the least number of comparisons. The proposed 2D median filter is implemented by two different synthesized methods. One is synthesized by Synopsys Design Compiler with the TSMC 90nm cell library (ASIC design) and the other is synthesized by Xilinx FPGA Virtex7 XC7VX690T (FPGA design). Experimental results show that the area cost of the proposed design is reduced more than 30% on average in comparison to the previous designs.
I. INTRODUCTION
The median filter plays an important role in image processing for noise removing [1] - [4] and up sampling [5] . The goal of median filter is to search the ((M +1)/2) th number among M candidate data where M is odd. To achieve higher computation speed and fulfill the real-time requirement, hardware implementation of median filters on field programmable gate arrays (FPGA) or application specific IC (ASIC) are necessary and inevitable [6] - [11] .
Those previous median filter architectures can be divided into two types, the single-input based [6] - [7] and multipleinput based [8] - [11] designs. As for the single-input based median filter design, it processes one input value in each cycle. In [6] , Chen et al. proposed a sorting based area-efficient design of 1D architecture. Inspired by [6] , Nikahd et al. [7] proposed a high-speed architecture that makes the operating speed remain almost the same when the window size of the median filter gets higher. On the contrary, the multiple-input designs can update multiple input values and produce median value every cycle. Thus, when processing image filtering with a N×N mask, the multipleinput design can complete all operations in a shorter time. Smith and John [8] first proposed a streamlined systolic array The associate editor coordinating the review of this manuscript and approving it for publication was Yuhao Liu.
for finding median value of a 3×3 mask. Inspired by [8] , J. Subramaniam et al. proposed a median-finding method [9] with fewer pipelined stages.
Actually, some input values processed in current cycle will be reused in the next cycle when performing image median filtering with a 3×3 mask. By using the reuse property of input values, Vega-Rodríguez et al. [10] proposed a high-throughput 2-D median filter architecture that can produce four median values in each cycle. Inspired by [10] , J. Subramaniam et al [11] proposed 2D median filter architecture with lower access times.
However, these 2D median filters [10] , [11] need larger amounts of pipelined registers and more comparison operations to finish 2D image median filtering. As the bit-width of input values increases, it will lead to a considerable increase in area cost. Therefore, designing a 2D median filter that not only satisfies high throughput and low access times but also meets the lower area-cost requirement is essential.
This paper proposes a low-cost design for a 2D median filter. In our design, a special parallel sorting module is proposed to reduce the pipelined registers effectively and the redundant comparison operations are removed by adopting functional sharing concept. According to the experimental results, the area cost of our architecture can save more than 30% on average when compared to the previous designs [8] - [11] on the condition of the same operating speeds. The remainder of this paper is organized as follows: Section II describes the proposed algorithm in detail. The hardware architecture of the proposed 2D median filter is presented in Section III. The experimental results and comparisons are shown in Section IV, and the conclusions are provided in Section V.
II. PROPOSED 2D MEDIAN FILTER ALGORITHM
There are two parts in this section. Part A briefly reviews the algorithm of latest design [11] . The proposed algorithm is described in detail in part B.
A. AN OVERVIEW OF MEDIAN FILTERING ALGORITHM [11] The algorithm proposed in [11] is designed to process image filtering with a 3×3 filtering mask. The scheme is shown in Fig. 1 The input values of every cycle are two column vectors of four values each (I 0 to I 3 and I 4 to I 7 ). The eight input values are divided into four parts and each part is sorted separately. All of them are stored in first two columns of register array (Col.1 to Col.2). In the next cycle, values in Col.1 and Col.2 will be transferred to Col.3 and Col.4, respectively. The register array can be divided into four filtering masks, M1, M2, M3 and M4. Each mask will produce their own median value independently, so the design can produce four median values in each cycle after initial delay. Fig. 2 shows the hardware algorithm of 9-stage median filter proposed in [11] . It contains three types of modules, three-valued sorting module (TVS i ), minimum searching module (MINS i ) and maximum searching module (MAXS i ) where the designs of TVS and MINS/MAXS are shown in Fig. 3 and Fig. 4 , respectively. To realize the whole median filtering procedure, totally 12 TVS, 4 MAXS and 4 MINS modules are employed. As shown in Fig. 3 , TVS is composed of three Comparedand-Swap operations (CAS) where each CAS is implemented with one comparator. To achieve better operating speed, some pipelined registers are inserted between CAS units and the total pipelined stage is three in TVS. Similarly, the MAXS/MINS module contains two CAS operations and requires two pipelined stages as shown in Fig. 4 .
B. THE PROPOSED ALGORITHM
After our study, the filter proposed in [11] has two main drawbacks. Firstly, the total pipelined stage is 9, so a lot of pipelined registers and longer latency are required. Secondly, there are many redundant comparison operations existed in their algorithm. For example, the comparison for I 1 and I 2 are performed repeatedly in TVS 1 and TVS 2 as shown in Fig. 2 . To solve the drawbacks, we propose a new median finding algorithm to design the low-cost high-throughput 2D median filter with parallel sorting and functional sharing.
Observing Fig. 3 , we know three pipelined stages are required to perform the three-valued sorting. In this paper, a parallel three-valued sorting module (PTVS), shown in Fig. 5 , is proposed to reduce the pipelined stage from three to two under the same operating speed. Let N represent the number of input values under sorting, then the total possible cases of sorting results can be denoted as T , where T is given by
When performing three-valued sorting, N is 3 and the total cases are C 3 1 × C 2 1 × C 1 1 = 6. As shown in Fig. 5 , three comparison operations, D i > D i+1 , D i+1 > D i+2 , and D i > D i+2 are adopted and performed in parallel. The sorting results and corresponding comparison conditions of the six cases can be listed as TABLE 1 where the one-bit conditional signal s i is set as 1 if its corresponding condition is true. Hence, the highest, median and lowest results can be given by Eq. (2) to (4) respectively. Finally, a transformed circuit (denoted as TC) is developed to realize Eq. (2) to (4). Obviously, PTVS requires 2 pipelined stages to complete its function as shown in Fig. 5 . Another special design in our filter is functional sharing. Observing Fig. 2 ., we found that it requires several repeated comparisons during the whole processing such as I 1 and I 2 in TVS 1 and TVS 2 , and I 5 and I 6 in TVS 3 and TVS 4 , respectively. The goal of functional sharing is to let two PTVS modules containing repeatedly comparisons merge into single module and perform its operations simultaneously. As shown in Fig. 6 , this merged module, denoted as dual parallel three-valued sorting (DPTVS), can process four input values and output the sorting results of both the first three and the last three values respectively. Obviously, the redundant comparison operations can be removed. Similarly, 2 MINS/MAXS modules containing repeatedly comparisons can be merged into single module (DMINS/DMAXS) as shown in Fig. 7 . Hence, the proposed 2D median filter with parallel sorting and functional sharing can be described as Fig. 8 . Totally, 4 DPTVS, 2 DMAXS, 2 DMINS and 4 PTVS modules are used.
As mentioned in previous subsection, [11] needs nine pipeline stages to finish the median sorting process. The proposed filter adopts six pipelined stages only, thus less pipelined registers are required. Besides, [11] needs 52 comparison operations while our design needs only 44. Obviously, our method can save up to 15.38% cost saving on comparison operations.
III. VLSI ARCHITECTURE OF PROPOSED 2D FILTER
By employing the pipeline scheduling approach, we develop a low-cost, 6-stage pipelined VLSI architecture based on Fig. 8. Fig. 9 shows the block diagram of our design which is composed of four DPTVS modules, two DMINS modules, two DMAXS modules, and four PTVS modules. The number of inputs in each cycle is eight and four median values will be generated in each clock cycle after six latency cycles.
Based on Fig. 5 , we design a 2-stage pipelined circuit of PTVS as shown in Fig. 10 . In first stage, three comparators are used to generate three conditional signals (s 0 , s 1 and s 2 ). In second stage, the transformed circuit designed to realize Eq. (2) to (4) is used to generate three sorted values according to s 0 , s 1 and s 2 from the previous stage. Obviously, the PTVS requires three comparators and the size of total pipelined registers is 6×N + 3 bits where N represents the bit-width of input values.
Similarly, Fig. 11 shows the 2-stage pipelined circuit design of our DPTVS based on Fig. 6 . In the first stage, five comparators will generate five conditional signals in parallel. Two transformed circuits will generate two sets of the three sorted values simultaneously in the second stage. The DPTVS requires five comparators and the size of total pipelined registers is 10×N + 5 bits.
Based on the algorithm shown in Fig. 7 , the 2-stage pipelined circuit of DMINS/DMAXS is shown in Fig. 12 . In first stage, there is one CAS circuit (CAS 1 ). The value produced from CAS 1 will be shared by two CASs in next stage. In second stage, two CAS circuits will generate two maximum/minimum values simultaneously. The single DMINS/DMAXS required three comparators and the size of total pipelined registers is 5×N bits.
Since the proposed architecture is composed of 4 DPTVS, 2 DMINS, 2 DMAXS, 4 PTVS modules and 12 data registers (Can be seen in Fig. 9 ), the required size of total registers is 4 × (10 ×N + 5) + 2 × (5 ×N )+ 2× (5 ×N )+ 4 × (6 ×N + 3) + 12×N = 96 × N + 32 bits and the number of total comparators is 4 × 5 + 2 ×3 + 2×3 + 4 × 3 = 44. Compared with the similar design [11] , which requires 128 ×N bits registers and 52 comparators, our design can save more than 30% of registers and 15.38% of comparator. Besides, both our design and [11] can output 4 median values every cycle. In other words, our design needs 11 comparators to produce single median value on average but [11] needs 13. Since [8] , [9] and [10] need 19, 19 and 13 comparators respectively to produce single median value, our design obviously requires less comparator cost.
IV. EXPERIMENT RESULTS
To evaluate the performance between similar designs [8]- [11] and proposed design. In addition to the proposed design, we also implement the similar designs and all of them are synthesized in the same condition for comparison. There are two different synthesized methods. One is implemented by Verilog HDL and synthesized by Synopsys Design Compiler with the TSMC 90nm cell library (ASIC design) and the other is implemented by Xilinx FPGA Virtex7 XC7VX690T (FPGA design). To evaluate the effectiveness of our architecture, we take four different bit-width samples as our testing [10] , and [11] using TSMC 90nm technology. patterns, 8-bit, 16-bit, 32-bit and 64-bit respectively. The power performance for ASIC design is evaluated by using Energy Per Sample (EPS) calculated with power (mW) × clock period (ns) × latency. Regard to the maximum frequency, because the critical path of similar [8] - [11] and the proposed design are both determined by the CAS circuit, the maximum operating frequency which they can achieve is the same.
The experimental results of ASIC design are given in TABLE 2 and the experimental results of FPGA design are given in TABLE 3. The frequency in it refers to the maximum simulation frequency which the designs can achieve in ASIC and FPGA, respectively. Compared to other architectures, our architecture requires less area cost at the condition of same operating speed. Besides, when the bit width of the test patterns increase, our architecture has the smallest growth rate in area because we adopt the comparison conditions with fixed bit-width for median value finding. As for the throughput, both [8] and [9] can produce a single median result every clock cycle. However, [10] , [11] and our design can output four median results simultaneously. In order to compare the hardware resources needed to generate single median value, the hardware resources of [10] , [11] and our design are divided by four and the results in the brackets are the resources required to produce a single median value in [10] , [11] , and ours, respectively. The simulation results shown in TABLE 2 and TABLE 3 prove that the proposed architecture is the most low-cost one when compared with the similar architectures.
V. CONCLUSION
The two-dimensional median filter is widely applied in image processing for noise removing. To enhance the operating speed, many solutions of hardware implementation have been proposed. A low-cost design of 2D median filter is proposed in this paper. We employ two techniques for reducing the area cost. One is the parallel three-valued sorting method, which is used to reduce the number of pipelined registers, and the other is the concept of functional sharing, which is applied to reduce the number of required comparators. The simulation result shows that the area cost is reduced more than 30% as compared with the latest design [11] , besides, when the bit width of the test patterns increase, our architecture has the smallest growth rate in area.
