Multi-scale 2-D Gaussian filter has been widely used in feature extraction (e.g. SIFT, edge etc.), image segmentation, image enhancement, image noise removing, multi-scale shape description etc. However, their computational complexity remains an issue for real-time image processing systems. Aimed at this problem, we propose a framework of multi-scale 2-D Gaussian filter based on FPGA in this paper. Firstly, a full-hardware architecture based on parallel pipeline was designed to achieve high throughput rate. Secondly, in order to save some multiplier, the 2-D convolution is separated into two 1-D convolutions. Thirdly, a dedicate first in first out memory named as CAFIFO (Column Addressing FIFO) was designed to avoid the error propagating induced by spark on clock. Finally, a shared memory framework was designed to reduce memory costs. As a demonstration, we realized a 3 scales 2-D Gaussian filter on a single ALTERA Cyclone III FPGA chip. Experimental results show that, the proposed framework can computing a Multi-scales 2-D Gaussian filtering within one pixel clock period, is further suitable for real-time image processing. Moreover, the main principle can be popularized to the other operators based on convolution, such as Gabor filter, Sobel operator and so on.
INTRODUCTION
Gaussian filter is a filter whose impulse response is a Gaussian function (or an approximation to it). Gaussian filters have the properties of having no overshoot to a step function input while minimizing the rise and fall time. This behavior is closely connected to the fact that the Gaussian filter has the minimum possible group delay. It is considered the ideal time domain filter [1] . The Multi-scale 2-D Gaussian filter is a filter bank that composed by two or more 2-D Gaussian filters who are difference in standard deviation(σ). It has been widely used in feature extraction (e.g. SIFT, edge etc.) [2] [3][4] [5] , image segmentation [6] , image enhancement [7] , image noise removing and so on. However, the computational complexity of the multi-scale 2-D Gaussian filter remains a challenge for the real-time image processing systems.
In recent years, many fast Gaussian filter implementation methods were proposed. Robinson [8] proposed a method for multidimensional Gaussian filtering using an efficient one-pass cascade of overlapping local-average windows driven by prefix sums, the proposed method allows fast approximate Gaussian filtering in any number of dimensions. Imajo [9] proposed a splines-based fast Gaussian filtering algorithm, they uses an nth-order spline, and pre-computing an integrated input image, the proposed method can calculate a Gaussian-filtered pixel value with several multiplications and summations in constant time on the size of the source image and the size of the Gaussian filter. Khorbotly and
Hassan [10] use linear programming techniques with a curve fitting method to approximate Gaussian filters with the recursive exponential filters. In literature [11] , an efficient constant-time Gaussian filter which provides a high accuracy at a low cost over a wide range of scale σ was proposed. Their key idea is to derive a second-order shift property of DCT-5 to compute short-time DCT coefficients in order to achieve a lower computational cost than existing algorithms.
In literature [12] , a General Purpose Computing on GPU (GPGPU) framework is discussed to accelerate 2D Gaussian filtering. This framework takes advantage of the GPU's parallel computing ability and has achieved better data efficiency without reducing the computational amount while maintaining the filtering quality. Although the above research can speed up the Gaussian filtering in some degree, but they still can't operate in-place, can't suitable for hard real-time systems. Laurent, Nicolas and Benoît [13] proposed a FPGA-based 2-D Gaussian filter, they separated two-dimension convolution into two one-dimension convolutions. And then, the 2-D Gaussian filter is designed as full hardware architecture deployed onto a FPGA device, their proposed filter can process a pixel in-place. However, as the line buffer is behind the multiplier, so the filters can't share the line buffers, the filter bank will spend more memory. In this paper,
we propose a new framework of real-time multi-scale 2-D Gaussian filter; the proposed filter was implemented with a full-hardware architecture based on parallel pipeline, experimental results show that the proposed filter can process a pixel in-place, is further suitable for real-time image processing.
DISCRETE GAUSSIAN FILTER
The impulse response of 2-D Gaussian filter in spatial domain was defined as follow： 
where x is the distance from the origin in the horizontal axis, y is the distance from the origin in the vertical axis, and σ is the standard deviation of the Gaussian distribution.
The Gaussian function is non-zero for x∈[-∞, +∞] and would theoretically require an infinite window length. However, since it decays rapidly, it is often reasonable to truncate the filter window and implement the filter directly for narrow windows, in effect by using a simple rectangular window function [1] .
The filter function is said to be the kernel of an integral transform. The Gaussian kernel is continuous. Most commonly, the discrete equivalent is the sampled Gaussian kernel that is produced by sampling points from the continuous Gaussian, so we can rewrite equation (1) into discrete formula as (
In order to preserve the image brightness, the Gaussian kernel coefficients should be normalized as follows:
i j
and equation (2) can be simplified to Multi-scale Gaussian filter is a filter bank that composed by two or more Gaussian filters who are difference in standard deviation(σ). In this paper, we introduce a multi-scale 2-D Gaussian filter real-time implementation method by assume the filter bank is composed by 3 Gaussian filters whose standard deviation is 0.5, 1.0, and 1.5 respectively. Fig.1 shows the main principle of the proposed filter; it is composed by three Gaussian filters whose standard deviation is 0.5, 1.0, and 1.5 respectively. Because the Gaussian function is separable, in order to save some multiplier, the 2-D convolution is separated into two 1-D convolutions. In Fig.1, ' img' denotes the input image, 'out_0.5' denotes the output of Gaussian filter whose standard deviation is 0.5, 'out_1.0' denotes the output of Gaussian filter whose standard deviation is 1.0, and 'out_1.5' denotes the output of Gaussian filter whose standard deviation is 1.5. 'Line buffer' is a dedicated FIFO named as CAFIFO, that is used to buffer one row of image, 'D' is a shift register which is used to buffer one pixel of image. '3 points convolution' is a 1-D convolution whose window size is 3×1, it is used to achieve vertical convolution and horizontal convolution. Similarly, '5 points convolution' and '7 points convolution' are also 1-D convolutions whose window size is 5×1 and 7×1 respectively. 
The schematic of multi-scale 2-D Gaussian filter

Design of CAFIFO
Generally, line buffer is implemented with traditional FIFO which depth equals width of image. However, traditional FIFO often make mistake when a spark add to the clock, and the mistake will propagate. In order to solve this problem, we designed a new FIFO which was named as CAFIFO (Column Addressing FIFO), the main principle is shown in Fig.2 . Fig.2 a) is the schematic of CAFIFO, and Fig.2 b) is the operation sequence of CAFIFO.
The CAFIFO is composed by a counter and a RAM (Random-Access Memory), where the counter is N (N is equals to the CAFIFO depth) modulus counter. In Fig.2 a) , 'input' denotes the data stream of video; 'HB' denotes the horizontal blanking of the video; and 'pclk' denotes the pixel clock. The output of counter acts as the access address of RAM, it will be cleared to zero when HB is low, when HB is high, the counter will increase 1 at the rising edge of pclk. That is to say, the access address of RAM is synchronize with the rising edge of HB, so the proposed CAFIFO can stop the error propagating induced by spark on clock. Following to the operation sequence of CAFIFO, the CAFIFO read the old data from RAM when pclk is high, and write the new data to RAM when pclk is low. Hence, the output will delay the input Proc. of SPIE Vol. 9301 930104-3 
Design of filter kernels
The proposed filter is composed by three Gaussian filters whose standard deviation is 0.5, 1.0, and 1.5 respectively. So we should figure out three filter kernels accordingly. Literature[1] points out that a Gaussian kernel requires 6 1 σ − values, according to the equation, the kernel size should be 2, 5 and 8, considering the kernel's odd restriction and the memory cost, we devise the kernel size as 3, 5 and 7 respectively. Though the third one is less than theoretical, and maybe causes some error, but small kernel size can reduce memory cost, while the error is small enough that can be ignored in actual applications. In order to reduce multiplier cost while achieve high precision, we arrange the proposed filter run with integer-integer mode and normalize the sum of coefficients to 512, so the coefficients of 1-D kernel can be computed by:
Finally, we get the coefficients of filter kernels that were listed in Table 1   Table 1 . Coefficients of filters 
EVALUATION
In order to evaluate the proposed multi-scale 2-D Gaussian filter, we realized it with an ALTERA Cyclone III FPGA.
Details are shown in Table 1 , and the implementation results are shown in Table 2 . By this implementation, only one pixel clock period is required to compute 3 scales 2-D Gaussian filter. We apply co-simulation using MATLAB and Quartus II to evaluate the proposed filter with the following steps:
1） Gaussian noise or salt and pepper noise was added to the test image in MATLAB environment, and then achieved a noisy image.
2） Convert the noisy image to memory initiation file (.mif file) by using MATLAB, in the design of multi-scale 2-D Gaussian filter; this file is used to initialize a ROM who is used to simulate the input of filter.
3） By using Quartus II, the design of proposed filter was compiled, simulated and generate a simulation report finally.
4）
The simulation report is a compressed vector waveform file (.cvwf file), it can't be processed by MATLAB, so we should save it as a vector table output file (.tbl file) by select the 'Save Current Report Section As…' item under 'File' menu in Quartus II environment. filters achieved rather high precision, beside some spark at the border of error surface, the rest approximate 0. We suppose the spark at the border of error surface is due to no boundary extension is applied to the filter. In order to quantitatively evaluate the proposed filter, the PSNR(peak signal-to-noise ratio) of FPGA-based Gaussian filtering results were compared with that of MATLAB-based Gaussian filtering results, noise with different pattern was added to the 'Lena' image were used as a test image set. Suppose I(m,n) is the pixel values at location (m, n) of the original image, I'(m,n) is the pixel values at location (m, n) of the filtering image, then PSNR can be computed by equation (5), and the results are listed in Table 3 . Here, the columns named 'FPGA' are the results of FPGA-based filter, and the columns named 'MATLAB' are the results of MATLAB-based filter. Table 3 shows that the FPGA-based filter is close to MATLAB-based filter, the error stay in the range between 0.01dB and 0.21dB, for an actual application, it is small enough that can be ignored. 
CONCLUSIONS
This paper presents a real-time implementation framework of multi-scale 2-D Gaussian filter, and has realized a 3 scales 2-D Gaussian filter on a single ALTERA Cyclone III FPGA chip. As a result, only one pixel clock period is required to achieve a multi-scales 2-D Gaussian filtering, the filtering can be operated in-place. Experimental results show that the proposed Gaussian filter is further suitable for real-time image processing. Moreover, the main principle can be popularized to the other operators based on convolution, such as Gabor filter, Sobel operator and so on.
