Median filtering is a well-known method used in a wide range of application frameworks as well as a standalone filter, especially for salt-and-pepper denoising. It is able to highly reduce power of noise while minimizing edge blurring. Currently, its existing algorithms and implementations have proved quite efficient but leave room for improvement as far as processing speed is concerned, which has lead us to further investigate the specificities of modern GPUs. In this paper, we propose the GPU implementation of fixed-size kernel median filters, able to output up to 1.8 billion pixels per second. Based on a Branchless Vectorized Median class algorithm and implemented through memory fine tuning and the use of GPU registers, our median drastically outperforms existing implementations, resulting, as far as we know, in the fastest median filter to date.
Introduction
First introduced by Tukey in [8] , median filtering has been widely studied since then, and many researchers have proposed efficient implementations of it, adapted to various hypothesis, architectures and processors. Originally, its main drawbacks were its compute complexity, its non linearity and its data-dependent runtime. Several researchers have adressed these issues and designed, for example, efficient histogram-based median filters featuring predictable runtimes [4, 9] . More recently, authors managed to take advantage of the newly opened perspectives offered by modern GPUs, to develop such CUDA-based filters such as the Branchless Vectorized Median filter (BVM) [5, 2] which allows very interesting runtimes and the histogram-based, PCMF median filter [6] which was the fastest median filter implementation known to us.
The use of a GPU as a general-purpose computing processor raises the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed. In certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU, even though global GPU processes can prove faster than similar ones run on CPU.
In the following section, we propose the overall code structure to be used with our median kernels. For more concision and readability, our coding will be restricted to 8 or 16 bit graylevel input images whose height (H) and width (W ) are both multiples of 512 pixels. Let us also point out that the following implementation, targeted on Nvidia Tesla GPU (Fermi architecture, compute capability 2.x), may easily be adapted to other models e.g. those of compute capability 1.3.
General structure
Algorithm 1 describes how data is handled in our code. Input image data is stored in the GPU's texture memory so as to benefit from the 2-D caching mechanism. After kernel execution, copying output image back to CPU memory is done by use of pinned memory, which drastically accelerates data transfer.
Algorithm 1: Global memory management on CPU and GPU sides. 3 Implementing a fast median filter
Basic principles
Designing a 2-D median filter basically consists in defining a square window H(i, j) for each pixel I(i, j) of the input image, containing n = k×k pixels and centered on I(i, j). The output value I (i, j) is the median value of the gray level values of the k×k pixels of H(i, j). Figure  1 illustrates this principle with an example of a 5x5 median filter applied on pixel I(5, 6). Obviously, one key issue is the selection method that identifies the median value, which can be done using either histogram-based or sorting methods. But, as shown in figure 1, since two neighboring pixels share part of the values to be sorted, a second key issue is how to rule redundancy between consecutive positions of the running window H(i, j). 
Using registers
As register access is at least 20 times faster than all the other memory types available on the GPU, it is natural to turn to the use of registers as a means to store temporary data, keeping in mind that on the fermi architecture, each individual thread can use a maximum of 63 registers within the limit of 32K per thread block. Consequently, it is important to use registers sparingly in order to preserve high pixel throughput values: to do so, we use the iterative forgetfull selection algorithm. Its principle is, at each iteration, to identify then to eliminate (forget) both elements showing the maximum and the minimum values among the current list, then to add the next candidate element left in the original list, till none is left; the last value remaining actually is the global median value. Figure 2 illustrates the forgetfull selection process applied to a 3 × 3 pixel median filter. For clarity reasons, the nine values have been represented in a row. The minimum register count R n needed to start this iterative selection process among k × k values is given by:
The selection of both extrema is implemented through an basic 2-element swapping function. This ensures that the GPU kernel code is free of divergent branches liable to severely impact performance.
Hiding Latencies
Optimizing a GPU kernel also means hiding latencies. The offered massive thread parallelism helps in doing so transparently but, considering the actual computation performed by each thread, optimization may be taken a few steps further: First, we maximized the Instruction Level Parallelism inside the forgetfull selection method by re-arranging the instruction sequence of an incomplete bitonic sort [1, 3] so as to reduce the data dependency of consecutive instructions. second, we divided thread block size by 2, having each thread perform the computation of two neighbor input pixels instead of just one, thus keeping grid size unchanged while reducing the effect of global memory access latency. Additionally, window overlapping also had to be managed, as illustrated by Figure 1 , in order to minimize the increase of register count per thread brought by the new computing organization. Two k × k consecutive neighboring windows share S n = n − √ n pixels, which is more than the number of registers needed to perform the median selection, i.e. R n (or equal for 3 × 3 median filter). The (S n − R n + 1) first selection stages can then be considered common to both windows, leaving only the k nonshared pixels of each window to be processed separately. The above technique saves k + 1 registers for each pair of input pixels, which means that each thread block now uses fewer registers while processing the same pixel count, thus allowing a higher level of parallelism. 
Compute complexity
Arithmetic instructions and texture fetches are easy to count but the number of element swaps needed to select the median value is data dependent and only its maximum can be evaluated. The incomplete sorting (forgetful selection) and the redundancy management performed when outputing two pixels per thread lead to the instruction count below:
• 5 integer multiplications and 5 integer additions to compute thread indexes and ouput pixel coordinates.
• 2n−1 additions to compute neighbor pixel coordinates.
• n − √ n texture fetches to load gray-level values.
• within a m-element vector and according to our selection method, the maximum Figure 3 : Reducing register count in a 5×5 register-only median kernel processing 2 input pixels. The first 7 forgetful selection stages are common to both processed center pixels.
number of element swaps needed to move the minimum value at first position and the maximum at last is given by:
Consequently, the number of element swaps needed by the entire selection of two median values performed by one thread is inferior or equal to:
The above sum equals 42,156 and 474 for typical window sizes of 3 × 3, 5 × 5 and 7 × 7. The total instruction count is thus kept near a O(nlog(n)) rule.
Experiments
Runtimes have been obtained by averaging 1000 executions on a C2070 GPU card hosted by a system with one Xeon E56202.40GHz processor running a linux kernel 2.6.18x86 64 and CUDA v4.0. Each kernel has been run on images of sizes 512×512, 1024×1024, 2048×2048 and 4096×4096. Like many authors, we computed the pixel throughput value as our main performance indicator. It includes kernel runtime as well as transfer times to and from the GPU. We also measured the maximum effective pixel throughput that our GPU/host couple is able to achieve, by running an identity kernel that fetches the gray-level of each pixel in texture memory and outputs it into global memory exclusive of any other instruction. Knowing this peak value allows us to evaluate the potential for improvement of each kernel and helps in deciding on further investigation. Those peak values are detailed in Table  2 , which shows that the larger the image, the higher the expected throughput.
Results
The main contribution of this work is to show how to tune a CUDA GPU implementation in order to achieve highest pixel throughput values. Runtimes, as well as pixel throughput values are gathered in Table 3 . Figure 4 Gray level format 
