Abstract-This paper presents an FPGA implementation of a high performance rank filter for video and image processing. The architecture exploits the features of current FPGAs and offers tradeoff between complexity and clock speed. By maximizing the operating frequency the complexity of the filter structure can be considerably reduced compared to previous 2D architectures.
INTRODUCTION
Rank order filtering is a non-linear filtering technique, which selects an element from an ordered list of TAP number of samples. In the two-dimensional (2D) case filtering takes place on the contents of a rectangular window (or more generally, an arbitrary shape), which slides across the image. Every time the window is moved by one pixel column, a set of obsolete elements are discarded and a set of new elements are inserted. The samples within the window are sorted and the element with the specified rank replaces the output element of the window. Most typical ranks are median, minimum and maximum, but the selection can be easily tailored to the needs of any application. Compared to other filters, such as FIR, Laplacian or blur filters, rank filters can effectively remove impulse like noises while preserving the edges of the original image. This can be very useful for various applications, for instance removing certain types of transmission noises, or pre-processing for edge detection. This paper presents a hardware architecture that is tailored for high performance color video processing but can be used in various applications as an IP block by taking advantage of the design time parameterization. The paper concentrates on the timing-driven architecture selection which exploits the high operating frequency of recent FPGAs, thus reduces hardware resource requirements. Bit serial approaches [1] , [2] provide the lowest complexity, but do not lend themselves well for high sample rate implementations, as filtering performance is proportional with the precision of the input data. However, the processing rate typically does not depend on the number of samples which change between processing cycles.
WV new samples
Insert-delete or sorting network based architectures [3] explicitly orders the incoming samples. In every cycle, the least recent sample is discarded and the most recent input is inserted into the magnitude sorting structure at the appropriate location. While these solutions require relatively few comparators, the feedback nature of the algorithm hinders pipelining.
Another set of applications store the samples in the order of arrival and select the appropriate output sample by calculating the location of the output sample dynamically. These architectures are easier to pipeline and still require few comparators.
III. PROPOSED ARCHITECTURE
When filtering images or video, the filter window slides horizontally on the input image, just as Fig. 1 shows. In case of a simple rectangular window, to generate a valid output, WV (vertical size of the filter window) new input samples should be processed. Therefore, for non bit serial implementations, an important classification criterion is the level of input parallelization.
Word-serial architectures can process one input sample per clock cycle. This is the typical structure for filtering 1D inputs, but it is also applicable for 2D filtering. In this case the filter should operate at WV times of the input pixel frequency and generates a valid input sample every WV th clock cycle.
The other extremity is the full-parallel approach -these filters can generate valid output every clock cycle, irrespectively of the number of input samples required to achieve this. Consequently, such filters process WV new samples in a single clock cycle. Hence the required operating frequency equals to the input pixel frequency, while at the same time hardware resource requirements are greatly increased. Previous papers typically considered fully parallel architectures as 2D filters, however, as this paper proves, using recent FPGA technologies this solution is sub-optimal due to the inefficient resource utilization.
Multi-word architectures are hybrid solutions: in one cycle they can handle more than one input samples, but less then the fully parallel implementation (from now on, let NI denote the number of new input samples in a single cycle). This solution allows finding a good balance between operating frequency and hardware complexity. Using a given filter window and input pixel frequency, the required operating frequency can be computed:
When processing color images using the full per-pixel information (e.g. full RGB or YCbCr values) is not a convenient solution. Filtering these components independently not only increases computational requirements, but may introduce blur effects, as it may generate new color values which were non-existent on the input image. A better solution is to use a magnitude-like value, e.g. luminosity. If the input format does not contain such a component, it can be generated within the filter.
A. Global Filter Architecture
The proposed architecture consists of five main components (illustrated on Henceforward only the Filter Core and its extensions are discussed in details, as this is the essential part of the filter.
B. Word-serial Filter Core
The operation of the Filter Core is based on observations introduced in [5] . As a first assumption the filter contains TAP number of different samples. For each sample, an index value is generated, which equals to the number of samples which are smaller than the given sample. This results in TAP distinct values for the TAP samples, which range from 0 (smallest sample) to TAP-1 (largest sample). The ranked sample is the one which has the index value equal to the required rank. The block diagram on Fig. 3 illustrates the hardware implementation of the algorithm for TAP=5. 
If WV is not an integer multiply of NI, the bandwidth of the filter core input supersedes that of the input stream, so in some clock cycles the number of valid new data is going to be less than NI. The simplest solution to make the filter capable of processing different number of new samples is to insert multiplexers into the appropriate data paths, in front of
D[], ND[], CR[] and CN[] registers.
Two-to-one multiplexers are sufficient, because during the operation of the filter there are only two different scenarios. Either all NI inputs are valid, or there are only (WV mod NI) legal values (see Fig. 4 ). Thus the size of multiplexers is limited to 2:1, but still a numerous multiplexers are required. Another solution is to insert padding samples as necessary, so in every clock cycle NI new samples are entered, thus creating a virtual filter (from now referred as virtual filter kernel). Fig. 4 illustrates such kernel for the WV=3, NI=2 case. Valid samples in the window are marked with light grey; padding samples are marked with dark grey (the actual value of the padding samples is irrelevant). Obviously, this method makes the virtual kernel size larger than the real filter window, hence requires more hardware resources, as parts of the Filter Core scales with the virtual kernel size. Fig. 5 presents the contents of the data registers clock by clock -using the numbers on Fig. 4 -as new inputs are inserted and the filter window is moved horizontally. Valid and invalid (padding) samples are marked just as on the previous figure. Samples on the right are the input samples. As most of the data registers contain both valid and invalid samples during operation, comparisons are done using all required data registers, irrespectively of the validity of the actual sample. As a result the number of comparators required scales with the size of the virtual filter kernel. Padding samples are masked after the comparator result registers (CR[], CN[]) , but before the 1CNT blocks. For each older sample, masking is done on 2*NI bits: NI bits mask the comparison results with the NI new samples, and another NI bits mask the comparison results of the oldest NI samples. The output ranking part is the same as in the single-word case. The number of required equality comparators scales with the size of the real filter window, as it is sufficient to select the appropriate output when all samples in a new column have been inserted into the filter. In these cycles the locations of the valid samples are well defined.
D. Multi-word filter with multiple outputs
In case real samples are used for padding, the virtual filter kernel can be viewed as NP+1 real filter windows joint together, where NP is the number of padding lines added to the filter window to form the virtual filter kernel. E.g. the 3x4 virtual kernel on Fig. 4 can be viewed as two 3x3 TAP filter windows joint together. The Filter Core presented in the previous section already computes all the required comparison results to generate valid outputs for both filter windows, however the mask generator, the one-counters and the output address generator should be replicated. The advantage is that the relation between the operating frequency and the number of new inputs processed in a single cycle becomes even better:
The drawback is that the Line Buffer should store WV lines of the input image instead of WV-1, and in case of real-time video filtering an output buffer is also required.
IV. IMPLEMENTATION RESULTS
The following implementation results were obtained using 24 bit RGB input, while the FVG was set to sum the three color components and output the 10-bit result. Table II summarizes the obtainable operating frequency of the wordserial architecture in different Xilinx FPGA families and different TAP numbers. As the most demanding commercial video format (HDTV 1920*1080p) has a pixel frequency of 75 MHz, the required filter architecture can be easily selected based on the above table. For example, a Virtex-4 device can perform real-time filtering on HDTV source using a 49 TAP filter by employing a multi-word Filter Core configuration with 2 input samples per clock cycle. Fig. 6 summarizes the resource requirements of a 49 TAP rank filter using different Filter Core configurations (configuration: WVxWH/NI). LUT and FF denote the number of LUTs and flip-flops in Virtex-4 and Virtex-5 devices, respectively. As can be seen on Fig. 6 , there are multi-word configurations (such as 7x7/5, 7x7/6) which require more resources than the full-parallel architecture (7x7/7). The reason for this is that the virtual filter kernel becomes way larger than the real filter window due to the enormous number of padding samples. Obviously, these configurations should not be used; however, as can be calculated from Table 2 , these are not required even in the slowest FPGAs.
V. CONCLUSION
An efficient architecture for high performance two dimensional rank filter was presented. Rank order filters, especially median filters, are used extensively for removing non-Gaussian (salt and pepper) noise from images and video feeds. Targeting FPGA implementations for video applications, a parametrizable structure was proposed, which deliver efficient solutions custom tailored to different pixel clock rates, available resources, and operating speeds. Compared to previous 2D architectures, the size and complexity of the filter structure was considerably reduced by optimally balancing the number of new input samples entered into the core and the available operating frequency of the filter. The proposed solution is independent of input data type, as it offers great flexibility to either generate magnitude information corresponding to RGB data, or can take advantage of preexisting magnitude information if such data is already available. The presented architecture can be further generalized to use arbitrarily shaped filter kernel and to perform weighted filtering.
