This paper presents an efficient video filtering scheme and its implementation in a field-programmable logic device (FPLD). Since the proposed nonlinear, spatiotemporal filtering scheme is based on order statistics, its efficient implementation benefits from a bit-serial realization. The utilization of both the spatial and temporal correlation characteristics of the processed video significantly increases the computational demands on this solution, and thus, implementation becomes a significant challenge. Simulation studies reported in this paper indicate that the proposed pipelined bit-serial FPLD filtering solution can achieve speeds of up to 97.6 Mpixels/s and consumes 1700 to 2700 logic cells for the speed-optimized and area-optimized versions, respectively. Thus, the filter area represents only 6.6 to 10.5% of the Altera STRATIX EP1S25 device available on the Altera Stratix DSP evaluation board, which has been used to implement a prototype of the entire real-time vision system. As such, the proposed adaptive video filtering scheme is both practical and attractive for real-time machine vision and surveillance systems as well as conventional video and multimedia applications.
INTRODUCTION
Computer vision methods are becoming increasingly important for the development of novel commercial devices such as wireless phones, vision-based pocket devices, sensor networks, and surveillance and automotive apparatus [1, 2, 3, 4] . This increases the demand for hardware-based implementations of new, relatively complex video processing algorithms [5] . It is not difficult to see that incorporating recent advances in the fields of computer and machine vision, hardware, software, digital signal/image processing, graphics, and telecommunications into integrated intelligent from human-made sources (switching and interference) to signal representation (bit errors) and natural (atmospheric lightning) ones. The resulting noisy samples, so-called outliers, differ significantly in magnitude from the noise-free signal elements. It has been widely observed that outliers, as well as noise in general, affect the perceptual quality of the image, decreasing not only the appreciation of the image but also the performance of the task for which the image was intended [14] . Therefore, image filtering is of paramount importance [6, 7, 14] .
Nonlinear filters have replaced linear filters in many image processing applications [15, 16, 17, 18, 19] , since (i) they can operate effectively in various noisy conditions and potentially preserve the structural information of the image, (ii) image signals are nonlinear in nature, and (iii) images are perceived via the human visual system which has strong nonlinear characteristics [20] . Among numerous nonlinear filters, the most popular filtering schemes are based on robust order statistics [6] due to excellent robust properties and simplicity of design. The order-statistics-based filters utilize algebraic ordering of a windowed set of data to compute the output signal [7] . They constitute a rich class of nonlinear filtering operators whose comprehensive overview can be found in [15, 16] . The most interesting examples include the well-known median filter [21] , rank-order filters [22] , combination filters [23, 24] , multistage filters [25] , weighted median filters [26, 27] , weighted order-statistic filters [28, 29] , lower-upper-middle (LUM) filters [30, 31] , morphological filters [6, 32] , and stack filters [33, 34, 35, 36] .
In the last decade, technological advances in hardware and software have allowed the extension of nonlinear filtering algorithms to multidimensional image processing. Therefore it is not a surprise that nonlinear filters are successfully used today in color image [7, 14, 37, 38, 39, 40] and/or video processing [9, 41, 42, 43, 44, 45] applications. Since many applications require the processing of signals under real-time constraints, significant research efforts have been dedicated to the hardware implementation and embedding of nonlinear filtering techniques in various circuits [46, 47, 48, 49, 50] .
This paper introduces the field-programmable logic device (FPLD) implementation of the simplified variant [51, 52] of the nonlinear adaptive video filtering (NAVF) technique [53] . Utilizing an adaptive design based on orderstatistics and incorporating both temporal correlation existing among the frames and spatial correlation of the samples within the frames, the three-dimensional (3D), so-called spatiotemporal scheme under consideration is capable of tracking video nonlinearities. Due to the extra degrees of freedom achieved through the use of three different smoothing levels (as opposed to the conventional fixed smoothing operator), the proposed scheme is capable of tracking varying image and noise statistics. At the same time, the scheme produces an excellent tradeoff between noise attenuation and signaldetail preservation resulting in reconstructed video with impressive visual quality. Based on the simplified structure of [51, 52] , the NAVF complexity has been reduced, without significant loss in performance, to a level useful for practical implementation in FPLDs. The use of the FPLD-based reconfigurable technology makes the proposed filtering system sufficiently adaptable and flexible for different types of camera and video processing applications; modification of the architecture may be required due to changes in the nature/representation of the signal, varying image and noise statistics, and various end-user needs.
The rest of this paper is organized as follows. The video filtering scheme under consideration is briefly described and analyzed in Section 2. Performance comparisons with relevant filtering schemes are provided in terms of commonly used image quality measures. An overview of the prior art in filter implementation is provided. The proposed implementation approach is outlined in Section 3, with motivation and design characteristics discussed in detail. In Section 4, experimental results corresponding to the implementation of the proposed method using the selected Altera FPLD are analyzed in terms of maximum usable clock frequency, hardware resources, and power consumption for both the stand-alone filter and complete 3D filtering solution. Finally, this paper concludes in Section 5.
MOTIVATION AND BACKGROUND

Consider a K
where K 3 denotes the number of image frames. Each image pixel x(p, q, t), for p = 1, 2, . . . , K 1 , q = 1, 2, . . . , K 2 , and t = 1, 2, . . . , K 3 , is a function of the spatial coordinates (p, q) and time t.
The most common approach to the problem of noise reduction is the utilization of some kind of smoothing operation which filters out random fluctuations due to noise. This approach is based on a sliding window
with S denoting the 3D support of the window [9] of finite size N. Assuming, for simplicity, that the index i, for i = 1, 2, . . . , N, denotes the position of the samples within the filtering window (Figure 1) , the data population of W(p, q, t) can be equivalently rewritten to W = {x 1 , x 2 , . . . , x N }. The filtering procedure replaces the center x * = x(p, q, t) of the window by some function f (·) of the local neighborhood samples {x 1 , x 2 , . . . , x N }. Thus, the value of the estimated sample y(p, q, t) = y = f (W) depends on the values of the image samples in its neighborhood W. The window operator slides over the image to affect individually all the image pixels. This is based on the assumption that the processes generating the image are stationary within the window and the probability of a particular behavior does not depend on the image coordinates.
Algorithm description
Following the robust theory of order statistics [6] , most popular filtering algorithms developed to suppress impulsive noise in images operate on the ordered values within the observation window. Using sample ordering, both correlation 
and time information is ignored and the estimate is purely constituted based on magnitude information [9] . The fact that the noisy samples usually correspond to the extreme ranks in the ordered sequence makes the samples occupying the middle ranks favorable to complete the filtering task. Let W = {x 1 , x 2 , . . . , x N } be the set of the input samples within the observation window ( Figure 1 ). Based on magnitude information, the ordering of x 1 , x 2 , . . . , x N results in the ordered set commonly defined as follows [6, 15, 17, 19] :
where x (i) ∈ W, for i = 1, 2, . . . , N, represents the ith order statistic. Using the smoothing parameter k = 1, 2, . . . , (N + 1)/2, the comparison of the lower and upper order statistics x (k) and x (N−k+1) of (1), respectively, along with the middle sample x * = x (N+1)/2 in W forms a lower-upper-middle (LUM) smoothing function [30, 31] , defined as follows:
where y k denotes the LUM smoother output and med is a median operator. If x * lies in the range formed by x (k) and x (N−k+1) , it is not modified. If x * lies outside this range, it is replaced with one of the two extremes that lies closer to the sample median (x (N+1)/2 ). By varying the filter parameter k, the amount of smoothing done by the LUM smoother can range from no smoothing equivalent to the identity operation (k = 1) to the maximum amount of smoothing provided by the median (k = (N + 1)/2). It is evident that the LUM smoothing capability increases with k. However, with large k, the smoothing operation often results in image blurring [30] . Therefore, depending on varying image and noise statistics, the adaptive choice of k is of paramount importance.
In order to track the changes in local image statistics and provide the best balance between the smoothing and detail-preserving LUM characteristics, the NAVF scheme has been introduced [53] . Its adaptive behavior is achieved by the comparison of the absolute differences |x * − y k | with associated thresholds ξ k ≥ 0, for k = 1, 2, . . . , (N +1)/2. Since k = 1 denotes the identity filter y 1 = x * whose filtering operation does not affect the input, and the smoothing capability of the LUM smoother increases with k, it is reasonable to say that |x * − y k | increases in magnitude as follows:
where |y 1 − x * | = 0. Following the terms of (3), the associated thresholds should be set according to
with ξ 1 = 0. The zero value of ξ 1 corresponds to the use of the identity operator y 1 which keeps the central sample x * unchanged. Based on (3) and (4), the NAVF output y is equivalent to y η , with η = (N+1)/2 k=1 λ k defined via the parameters
where ξ 1 , ξ 2 , . . . , ξ (N+1)/2 are the thresholds used to control the accuracy of the NAVF estimates. In the case of the 3 × 3 × 3 spatiotemporal filter window (N = 27) shown in Figure 1 , the NAVF scheme requires the calculation of (N + 1)/2 = 14 different y k 's. For this spatiotemporal processing commonly used in video filtering applications and a conventional 8-bitper-pixel image representation, the recommended setting of ξ 1 , ξ 2 , . . . , ξ (N+1)/2 found through a genetic algorithm optimization is defined as follows [53] : 4, 5, 7, 9, 12, 15, 16, 22, 23, 38, 43, 48 , 52}.
These values are sufficiently robust for a wide range of test videos with various image statistics and/or motion complexity [53] . During optimization of the filter parameters ξ 1 , ξ 2 , . . . , ξ 14 , it has been observed that larger threshold values spoil the noise attenuation characteristics of the NAVF scheme and result in unremoved outliers in the filter output. While smaller thresholds improve this situation, the detailpreserving characteristics are, in turn, negatively impacted. It is clear that the NAVF scheme requiring the determination of 14 different y k 's is computationally demanding and cannot be used as a cost-effective solution for real-time video and multimedia applications. Therefore, the NAVF structure controlled by (6) has been reduced as follows [51] :
where ξ 1 , ξ 7 , ξ 14 are associated with the identity operation y 1 , 
Let the LUM smoother y 7 = med{x (7) , x * , x (N−7+1) } Let the LUM smoother y 14 = med{x (14) , the balanced LUM smoothing y 7 , and the median (the maximum smoothing) operation y 14 , respectively. The robustness of the setting defined in (7) has been verified, through the use of the linear and evolutionary optimization tools, in [52] . The results indicate that the filter is sufficiently robust to relatively large deviations from the assumed during-training conditions. In the case of substantial qualitative difference in terms of noise characteristics, reoptimization of the filter parameters may be recommended. It should be emphasized that both the full (original) and reduced NAVF solutions are primarily geared to address the problem of impulsive noise removal. For such a task, the proposed solution results in excellent performance.
The algorithmic steps performed by the reduced NAVF scheme are summarized, in pseudocode format, in Algorithm 1. The scheme requires in each processing location (p, q, t): (i) to determine the window center x * and the input set W = {x 1 , x 2 , . . . , x N }, (ii) to order the input samples according to their magnitude, (iii) to determine the outputs of the two LUM smoothers y 7 and y 14 , (iv) to compare the absolute differences |y 7 − x * | and |y 14 − x * | with the corresponding thresholds ξ 7 and ξ 14 , respectively, and (v) based on these comparisons, to set one of x * , y 7 , and y 14 as the filter output y.
Filtering efficiency
Experimentation with a number of test videos corrupted by impulsive noise [15] showed that the reduced NAVF scheme of (7) is sufficiently robust and operates without significant loss in performance [52] . This is demonstrated here using test image sequences consisting of 30 frames with an 8-bitper-sample representation and 256 × 256 spatial resolution. For better illustration, Figures 2a, 2b , and 2c depict the 5th frame of the test videos. The example of the noisy frame is shown in Figure 2d . This image corresponds to a video frame contaminated by 10% random-valued impulsive noise [15, 53] , with the rate denoting the amount of the corrupted pixels and the noise magnitude independent from pixel to pixel.
The method is applied to the test videos degraded by 5% and 10% noise and performance is measured via the mean absolute error (MAE) and mean square error (MSE) measures commonly used in the image processing community. Using these error criteria, the reduced NAVF scheme [51, 52 ] is compared to other video filtering techniques, including the previously mentioned full NAVF scheme [53] , median filter (MF) [6] , standard LUM smoothers [30] , and multistage median filters (MMFs) [41] as well as some spatiotemporal switching median filters with the switching control based on the averaging operations defined over the middle-ranked samples (ICM) [54] , local contrast probability (LCP) [55] , center-weighted median switching filter (CWMSF) [56] , variance of the input set (VSMF) [57] , and advanced multilevel processing (ASM) [58] . Tables 1 and 2 summarize the objective, numerical results corresponding to the test videos shown in Figures 2a, 2b , and 2c. The results indicate that the full NAVF scheme achieves the best performance in terms of MAE and MSE among all the tested filtering schemes. Moreover, it can be seen that the reduced NAVF scheme also produces excellent results, although its filtering structure has been simplified from 14 to 3 smoothing levels y k compared to the full NAVF scheme. Therefore it can be concluded that the reduced NAVF scheme is useful for video filtering purposes and, because of its simplicity, it is suitable also for cost-effective applications.
Figures 2c, 2d, 2e, and 2f show the visual comparison of the original frame, the contaminated frame, and the filtered outputs produced using the MF technique and the reduced Figure 2e illustrates that the MF scheme blurs image edges, structural content, and fine details. On the other hand, the reduced NAVF scheme preserves the image details and removes outliers ( Figure 2f ). Due to this impressive performance, the reduced NAVF produces a denoised image similar to the original depicted in Figure 2c . Figures  2g and 2h show the estimation errors of the standard MF scheme compared to the reduced NAVF scheme. It can be seen that the MF approach is characterized by large estimation error, which corresponds to edge blurring and destruction of fine details ( Figure 2g ). The reduced NAVF scheme tends to avoid the blurring of structural content and excellently preserves the desired signal features. This results in very small estimation error, as depicted in Figure 2h .
Filter implementation: prior art
Apart from the numerical behavior (actual performance) of any proposed algorithm, its computational complexity is a realistic measure of its practicality and usefulness. The (full and reduced) NAVF filtering schemes belong to the class of filters based on order statistics. To determine the output based on their rank within a group of inputs, various techniques have been proposed for implementing these kinds of filters [46, 47, 48, 49, 50] . Based on the amount of information processed concurrently, implementation approaches can be classified into two main groups [47] : word-based and bit-based techniques. Word-based architectures (or bit-parallel architectures) process the bits of the input samples in parallel, but the samples are usually processed sequentially. On the contrary, bit-based filters (or bit-serial architectures) process input samples bitwise, but the samples included in the window are processed in parallel. In contrast to bit-parallel algorithms, the bit-serial algorithms often enable the creation of efficient pipelined structures. Another kind of classification recognizes nonrecursive and recursive algorithms [50] . Since recursive algorithms use the same piece of hardware in an iterative manner, they are usually more area-efficient, but slower. Because of the existing loop, the pipelining cannot be applied. On the other hand, nonrecursive algorithms enable speeding up the filtering process via pipelining techniques and block processing [59] .
The architectures of the rank-order-based filters can be divided into three main categories [50] : array architectures, sorting network architectures, and stack filter-based architectures. In array architectures [50] , each element in the window is associated with a rank, and with each window shift, the ranks of the elements are updated. The array architectures with window size N consist of a semisystolic linear array of N processors. These architectures are suited for both bitparallel and bit-serial implementations. Furthermore, they can be easily pipelined, thereby supporting high throughput applications. However, this kind of architecture is not suitable for large processing windows such as those used in spatiotemporal (3D) video filters considered in this paper.
The sorting network architectures implement the rankorder filter by first sorting the samples and then selecting the sample of corresponding rank [46, 50, 59, 60, 61, 62] . The filtering can be faster, if the sorting of samples from the previous position of the sliding window is maintained and only the incoming sample is positioned to the correct rank. Sorting network architectures with presorted values can be relatively efficient for one dimensional (1D) filters, where only one sample has to be classified in each sample period. However, these architectures are not suitable for 3D filters, since multiple samples (in our case, 9 samples for a cubic 3 × 3 × 3 window) arrive at each new sample period.
Probably the most efficient implementation approach related to the use of rank-order-based filters for image processing applications is based on stack filters. A stack filter translates the filtering operation to the binary domain through the use of threshold decomposition [49, 63] . The bit-parallel realization of the stack filter decomposes the input sample to 2 B − 1 bit levels [49, 50] , where B is the sample wordlength. Each level is processed separately. It is clear that if B is high, the bit-parallel architecture of the stack filter is not suitable for a cost effective application, since the number of processing levels depends on B exponentially. In the bit-serial version of the stack filter [50, 64, 65, 66] , the input samples are processed bitwise in only B bit levels using (i) a majority function [67, 68, 69] , (ii) a positive Boolean function (PBF) [64, 70] , or (iii) a polarizing function [71, 72] . Since the area of the bit-serial stack filters depends linearly on the number of bit levels, these stack filters usually permit the most areaefficient implementations [50, 65] .
While there are several implementations of rank-orderbased filters in field-programmable gate arrays (FPGAs) published in the literature [62, 73, 74] , we did not find any FPGA implementation of 3D rank-order-based filters suitable for video processing applications. In fact, due to the significant growth in time delays and hardware requirements in spatiotemporal video filtering such as the considered reduced NAVF scheme, very few algorithms are suitable for hardware implementation. Based on the aforementioned facts, we have selected the nonrecursive, bit-serial stack architecture based on a PBF function as the best candidate. Besides its areaefficient implementation, it enables the use of a pipelined structure and thus an increase in the speed of the filtering process. The proposed hardware structure is presented in the next section.
PROPOSED HARDWARE IMPLEMENTATION
The FPLD target technology is selected in this paper to improve adaptability and flexibility of the filtering system for different types of cameras and video processing applications. The scalability of the filter together with the reconfigurable technology used for filter implementation should enable easy modification of the proposed architecture for video signals differing in parameters such as sample frequency and frame size, as well as the number of bits used for sample representation. Since not all filter implementations are directly exploitable in FPLD technology, the utilization of FPLD devices in video signal filtering sometimes needs a special approach. Our goal is to (i) propose a cost-effective and flexible solution using FPLD devices, and (ii) ensure its suitability for real-time spatiotemporal (3D) video filtering.
Bit-serial structure implementation
The bit-serial approach of [64] reduces the filtering procedure to binary calculations and simplifies the smoothing operators to become PBFs. In this way, the designer avoids implementing time-consuming ordering operations, which make the filtering algorithm significantly slower and difficult for realization especially for large window shapes such as the employed 3×3×3 spatiotemporal window shown in Figure 1 .
Algorithm 2 summarizes the steps performed using the bit-serial realization of the PBF-based LUM smoother. Note that the LUM smoother is required to process the set of N image samples coded with B bits per sample. Each input sample x i ∈ W = {x 1 , x 2 , . . . , x N } is expressed in binary form as The most important part of the procedure summarized in Algorithm 2 corresponds to the LUM-PBF expression in [75] . This simplifies the LUM smoother defined by the smoothing parameter k and the window size N into the PBF defined as follows:
It has been proven [75] that the output bit
of the LUM smoother can be simply determined using the jth bit of the central input sample x * and 1's in the set W * j of neighboring binary samples associated with the jth bit. This results in the fast, binary LUM smoothing defined in (8) .
For illustrative purposes, Table 3 summarizes the computational steps of the bit-serial LUM-PBF approach. We consider the window size N = 9, the smoothing parameter k = 4, the word length B = 8, and the input set Table 4 .
The hardware implementation of the conventional LUM smoothing algorithm consists of two basic types of blocks [76] : (i) N × B LUM propagation cells, and (ii) B combinatorial blocks implementing the LUM-PBF defined by (8) . The combinatorial block representing the PBF implementation according to (8) is composed of the adder and the comparator (see Figure 4) . The adder counts the number of 1's (high bits) present at 26 binary inputs (included in W * j ) of the block. The comparator produces the output bit of one binary LUM smoother by comparing the result of the addition with the value k − 1 for data inputs (samples x 1 , x 2 , x 3 , x 10 , x 11 , x 12 , x 19 , x 20 , and x 21 from Figure 1 ). Vertical propagation flags and propagated data bits coming from upper levels are updated and transmitted to lower levels. The PBF has 26 equivalent inputs and one special input for the central sample. The output of the PBF represents the bit-level filter output.
Parallel and pipelined filter structure
In the parallel version of the LUM filter, all the bit levels of the input set are processed in parallel. However, one of the main advantages of the bit-serial structure is that the bit levels can be processed independently and thus faster. This feature can be successfully used in the pipelined version of the filter, in which each bit level of the filter processes corresponding bit of one of B subsequent samples. The differences between these two implementations of the filter will now be discussed. The complete LUM filter with parallel structure contains B identical levels. The critical data path (the longest data path between any two registers determining the maximum clock frequency of the filter) starts at the highest bit-level cells and passes horizontally through the PBF at the same level, coming back to the input of the XNOR gate of the bit cell. It continues vertically to the next lower level and so on until the lowest level of the filter. Thus, the parallel version of the filter has B propagation cells and B PBFs in the data path. Figure 6 shows the pipelined version of the LUM filter propagation cell. Comparing Figures 3 and 6 , it can be observed that the standard structure shown in Figure 3 has been extended with two pipeline registers. Due to the bit-serial nature of the algorithm, the bit-level pipelining allows concurrent processing of B subsequent bits corresponding to different bit levels using a set of B identical bit-slices from Figure 5 . Since the bits of the input sample are not processed concurrently, they have to be delayed in input/output delay lines composed of triangles of shift registers (Figure 7) . The critical data path of the pipelined version includes only one propagation cell and one PBF at the same level. It is therefore up to B times faster than the standard parallel version. However, the pipelined version is larger, because each propagation cell has two additional registers and B + 1 clock periods latency. Because of its higher speed, the pipelined scheme is much more appropriate for the real-time video applications. Therefore, the pipelined version is used throughout this paper. Figure 8 presents the complete structure of the pipelined reduced NAVF scheme. Two pipelined B-level LUM smoothers (for k = 7 and k = 14) process B levels of input samples concurrently. Since nine new samples appear at the input of the proposed filter for each updated location (p, q, t), nine input shift register blocks are necessary. The use of two LUM filters necessitates the implementation of two output shift register blocks. Because the central sample is used for computing the output, it has to be delayed by B + 1 clock periods in a B-bit shift register. This delay corresponds to the sum of delays of input and output shift registers. The complete reduced NAVF scheme has a B + 3 latency, because both subtraction block and comparators contain one pipeline register, too. bit slice) usually contains one 16-bit lookup table (LUT) followed by a configurable register. We could expect that the propagation cell from Figure 6 will be implemented in three such logic cells. However, the combinatorial function at the output of the multiplexer represents the pipeline register input and at the same time represents the propagation cell output. Therefore, the propagation cell will occupy four logic cells instead of three: one cell where only a register is employed (data bit register), two cells where both LUTs and registers (pipeline registers) are utilized, and one cell with only a LUT employed (output to PBF). Some FPLD technologies enable the output of both the combinatorial function and its registered version from the same logic cell (register packing). This option would lead to a filter-area reduction of up to 20%. The fact that the reduced NAVF filter from Figure 8 contains two similar LUM smoother structures can be used to further reduce the filter area through efficient resource sharing. This possibility is based on the observation that the most important part of the PBF function area is occupied by the adder from Figure 4 . This has the consequence that the PBF function area changes only slightly with the parameter k (about 53 ± 1 logic cells per PBF). For the same reason, the double PBF function block for the two parameters k 1 and k 2 , based on the sharing of the adder by two comparators from Figure 9 , will increase the size of the block by an insignificant amount (about 55 ± 1 logic cells per double PBF). The price to be payed for this efficient resource sharing is a reduction in speed by a factor of 2, as the binary outputs corresponding to k 1 and k 2 must be multiplexed in time.
Area/cost reduction
Unfortunately, the sharing of the propagation cell is not as successful as it is in the case of the PBF function. Since propagated values of the shared smoothers are not the same, two pairs of the pipeline registers will be necessary (see Figure 10 ) to propagate the bit and flag for both of them. Two new multiplexers on the vertical outputs of the propagation cell are also added. However, the data register on the input of the propagation cell, the multiplexer, the XNOR, and AND gates remains sharable. Therefore, we can expect that the size of the double propagation cell will grow from three to at least five logic cells. The clock-enable signal employed in the double propagation cell does not influence the overall cell size, because it is a standard signal available in all logic cells in the FPLD device.
Implementation issues of a complete video processing system
We have used the Stratix EP1S25 DSP board from Altera [77] to verify the overall area and performance of a complete video system based on the proposed filter. Besides the Stratix EP1S25 FPLD being in the fastest speed grade, the board features other components, which has been used in our design: one of two 14-bit 165 MHz D/A converters, two synchronous 1 MB SRAM blocks with independent data and control buses and common address bus, an expansion prototype connector (EPC), oscillator, and so forth. We have placed a video amplifier and a double 12-bit A/D converter with a maximum rate of 50 mega-samples per second to a small expansion board connected to the EPC, because two A/D converters (ADCs) available in the DSP board were not suitable for the video signal conversion. All other hardware blocks were implemented in the STRATIX device (see Figure 11 ): the detector of synchronization impulses (DSI), 12-bit-to-8-bit data width converter (DWC), input line buffer (ILB), three triple line buffers (TLBs) containing three line shift registers (LSRs) each, two data bus demultiplexers, output line buffer (OLB), control unit, and PLL-based clock generator.
Buffers ILB, OLB and the first LSR of each TLB block use the dual-port feature of the embedded memory blocks. They can therefore have independent read and write frequency. The control frequency of the external memory blocks can thus be much higher than the input/output filter speed. Since the memory data bus is four times wider than input/output data stream, all the memory accesses can be realized during inactive portion of the line. In fact, this interval is divided in two halves: in the first part input data from the camera and output data from the reduced NAVF filter are written simultaneously to two external memory blocks; in the second half of the inactive line portion two lines of two previous images are read to TLB buffers. Since each line buffer can store up to 1024 pixels, any camera having up to 60 images of 1024×1024 pixels can be connected to the system. Please note that the detailed description of the system exceeds the scope of this paper. The reader can find more information in [78] . 
RESULTS
The basic structures and blocks utilized in the reduced NAVF have been described in very high-speed integrated circuit hardware description language (VHDL). Filter components have been synthesized using Altera Quartus II v. 4.1 VHDL compiler. We have chosen the Quartus II development system to implement the complete filtering scheme, because it enables good control over compilation, placement, and routing parameters, namely, register packing and shift-register implementation in embedded memory. VHDL output generated by the fitter was used as simulation input for ModelSim v. 5.8c VHDL simulator. Output values have been compared with Matlab-generated test values in an automatic test bench procedure.
The reduced NAVF scheme has been mapped into the Altera STRATIX EP1S25 device. This device is also used in the Altera STRATIX EP1S25 DSP development board, which has been used to implement the whole video system including the filter. We have used power calculator for Stratix devices, version 3.0 from Altera, to estimate typical power consumption of various filter versions for the toggle rate of 12.5%. We have selected a typical power estimation instead of a worstcase one, because the filter structure occupies only a small part of the device (up to 13%) and the standby power consumption is higher (typical standby power consumption of the device is about 135 mW) or much higher (worst-case standby power consumption is about 450 mW) than that of the filter. The standby power consumption estimation precision has thus a dominant influence on the overall precision of the method.
Most of the blocks are realized as parameterized modules. Top-level parameters include the window size corresponding to the spatiotemporal filtering window (default value N = 27), the word-length B (8 bits by default), smoothing parameters (k = 7 and k = 14 by default), and the associated thresholds (ξ 7 = 15 and ξ 14 = 52 by default).
LUM-PBF function implementation results
Since the LUM-PBF function represents a relatively complex N-input combinatorial function and it is included in the critical path of the smoother, it was important to analyze the impact of the smoothing parameter k on the overall complexity and speed of the PBF function realization. Fortunately, the function size and speed change insignificantly with k and both of them are dominated by the adder block present in the function entry, as explained in the previous section. As it can be seen in Table 5 , the LUM-PBF function f 7,27 (·) occupies 54 logic cells (note that carry-chain must be enabled). Table 5 also illustrates the input/output point-to-point delay (t pd ) corresponding to the LUM-PBF implementation using the selected STRATIX [79] device. We can conclude that a low number of logic cells and small point-to-point delays of the LUM-PBF function demonstrate the cost efficiency of (8) . This is a very important fact, because these parameters determine, to a great extent, the size and especially the clock frequency of the reduced NAVF scheme.
LUM smoother implementation results
Given N = 27, B = 8, and k = 7, the first two lines of Table 6 allow the quantitative comparison of the standard and pipelined LUM, in terms of logic cell count, static timing analysis frequency, and estimation of the power dissipation. In both of these mappings, register packing was not employed. The pipelined realization achieves a sevenfold increase in processing speed with the tradeoff of a 50% increase in area compared to the parallel version. Higher count of logic cells in the pipelined version is caused by the use of the shift registers, namely, ten input and one output register blocks of [B · (B + 1)/2] registers.
The LUM cells matrix area remains very similar in both standard and pipelined filter versions (for 8 bit levels and 27 input elements, the LUM area is realized by less than 27 × 8 × 4 = 864 logic cells). Since 8 PBFs occupy approximately 400 LCs, the total area estimation of the LUM smoother without input/output shift registers (1048) is close to the values obtained for both implementations (subtracting the 360 logic cells necessary for the implementation of the input/output shift registers in the pipelined version).
However, the size of the pipelined version can be significantly reduced using register packing, as shown in the third line of the table. The small reduction in speed is insignificant. Additional area reduction can be obtained by implementing the shift registers in the embedded memory. Although Xilinx devices allow the implementation of up to 16-element shift registers in one LUT, this is not available in Altera devices. Some limited functionality (concerning the minimal length of the chain) exists for the implementation of small shift registers in the M4K or M512 embedded memory blocks available in the STRATIX family [77, 79] . The fourth line of Table 6 presents results obtained using this method of shift register implementation. Using the aforementioned techniques, the pipelined LUM smoother's size becomes comparable to the size of its parallel version, with the pipelined smoother being about six times faster.
It can be observed in Table 6 that the parallel structure (PRS) has the smallest power consumption (note that the typical standby power consumption is 135 mW). However, if we reduce the clock frequency for the pipelined version of the smoother with register packing and embedded shift registers (PPS + RP + ESR) to 14.3 MHz, we will obtain a very similar result (145 mW instead of 142 mW). Table 7 allows for the comparison of the proposed method to the efficient adaptive switching ASM video filtering solution. It can be seen that the proposed architecture consumes significantly less hardware resources compared to the ASM scheme. This advantage is obtained using unique binary operations required in the LUM smoothers which are effectively utilized in the proposed reduced NAVF structure from Figure 8 . The second line of the table presents the results obtained for the pipelined version of the reduced NAVF scheme without register packing and without implementation of the shift registers in the embedded memory. The overall LC count is, in this case, slightly lower than that for the two pipelined LUM smoothers from the second line of Table 6 . This is because nine input shift register fields can be shared by the LUM smoothers. Absolute value computation and comparison blocks are realized using the standard Library of Parameterized Modules (functions lpm abs and lpm compare). Since the outputs of these modules are registered, they do not influence the final filter speed. The speed is therefore limited mostly by the PBF implementation and can be as high as 97.6 Mpixels per second. Because the obtained speeds are much faster than required in common video applications, we have concentrated our effort on the reduction of the filter area. As can be observed in the third line of Table 7 , register packing remains an efficient method for LUM smoother size reduction (about 20%) while preserving the speed of the filter. The use of embedded shift registers in the STRATIX family can further reduce the logic area (see the fourth line of Table 7 ). Another significant reduction in the area proposed in this paper (about 16%) can be obtained using the LUM function sharing in a double LUM smoother structure described in the previous subsection. However, the LUM smoother sharing slightly increases the complexity of the PBF function and it thus decreases the clock speed. An even more important fact is that the use of double structures necessitates time multiplexing. The overall speed of the reduced NAVF structure based on the shared LUM smoother is thus two times slower (as indicated by the parenthesis in the fifth and sixth lines of the Table 7 ). While this reduced filtering speed is still higher than the speed of most conventional video cameras, the proposed reduced NAVF scheme with LUM smoother sharing is the most area-and energyefficient.
Reduced NAVF scheme implementation results
Complete video filtering system implementation results
The proposed reduced NAVF scheme needs 9 pixels to be available at the filter input at each video sample period. The set of input/output line buffers and data bus-width converter together with a control logic described in Section 3.4 were implemented in the same reconfigurable device. As it can be seen in Table 8 , this additional logic occupies few logic cells (about 2% of the cells available in the selected device) and a small amount of RAM bits (about 4% of all available bits). The frequency 123.1 MHz given in the table specifies the maximum clock frequency of the 32-bit data bus. Thanks to the use of the true dual-port embedded memory blocks, this frequency can be independent of the video signal sampling rate. This memory access frequency is high enough to store incoming lines and to read two lines of the past images from the external memories during inactive portion of the video line. The speed of the complete filtering solution is thus limited only by the used reduced NAVF scheme. Table 8 presents also the area, speed, and power dissipation estimation of the complete systems using the fastest (the reduced NAVF filter) and the most economic version of the filter (the reduced NAVF filter with the LUM sharing, register packing, and embedded shift registers). As it can be seen, the unused part of the device (about 90%) is still big enough to constitute the necessary resources for implementing additional image processing functions, such as compression, analysis, and others utilized in computer vision. When considering the speed of the system, two parameters have to be taken into account [47] : the delay T d (sometimes called the latency) is the time from the presentation of a set of input until the output of the results and the period T p is the time between successive presentations of problem instances. The period T p of the proposed solution corresponds to the maximum usable sampling period and it is limited by the speed of the reduced NAVF scheme (10.2 nanoseconds and 24.3 nanoseconds for the fast and economic solutions). Since both the fast and economic solutions are faster than the output pixel rate of a common video camera, the data can be filtered in real time. The latency of the proposed solution is defined by the principle of the sliding window and its dimension (3 × 3 × 3 pixels). The output of the system is therefore delayed by two frames, two lines, and two pixel periods because of the window size, plus B + 3 pixel periods imposed by the pipelining principle applied in the reduced NAVF filter.
We did not specify power consumption estimates for filter control and buffers (first line of Table 8 ), because it would be dominated by the standby power and the control unit, and buffers do not represent a stand-alone part of the filtering scheme. However, the consumption estimation of this block is included in the next two lines of the table.
CONCLUSIONS
In this paper, an efficient video filtering technique useful for real-time computer vision applications was introduced. The behavior of the filtering scheme under consideration was analyzed in detail with respect to the parameters used. Experimentation with a wide range of test videos and noise intensities showed that the reduced NAVF structure produces excellent results. Moreover, its simple structure suggests the possibility of implementation as a cost-effective FPLD solution, keeping the majority of available resources unused for the implementation of a compact, modern, integrated computer vision system. Recent FPLD devices have the capacity and performance comparable to application-specific integrated circuits (ASICs), while maintaining flexibility and low development costs. The main disadvantages of FPLD devices-higher unit price in high-volume applications and higher power consumption-can be successfully resolved using their mask-programmed equivalents (e.g., HardCopy version of FPLD devices for Altera). Although the choice of the Altera Stratix EP1S25 device was motivated by the use of Altera STRATIX DSP development board, the filter area is so small that it can be mapped into almost any low cost FPLD device (e.g., the smallest Cyclone device [80] ). The flexibility of the complete video filtering system structure is only limited by the architecture of the development board and size of the memory blocks (embedded and external memory used for line and frame buffers). The proposed system structure enables modification of the pixel and frame frequency, and is able to process videos with different frame spatial dimensions. Thus, the proposed solution allows for easy adaptation to the camera chosen by the end-user. However, the reduced NAVF filtering structure is more flexible itself. It can be easily adapted to the window size (by the use of the parameter N), to the number of bits per pixel (parameter B), and to the statistics of the processed video (smoothing parameters k 1 and k 2 ), and thus reused in a large variety of image processing applications. It can be therefore concluded that the efficiency and versatility of the proposed solutions make our video filtering system ideal for a new generation of advanced and intelligent vision systems.
ACKNOWLEDGMENT
The work of R. Lukac is partially supported by a NATO/ NSERC Science Award.
