Most adaptive image and signal processing tasks are performed on specialist digital signal processing chips. These devices are highly optimised for efficient computation of the core multiply and accumulate operations required by current algorithms. Attempts to synthesise these types of algorithms on FPGAs have resulted in few competitive implementations. FPGAs generally fail to realise efficient arithmetic functions except in the most constrained cases such as constant coefficient multipliers 1 . The approach adopted in this paper is based on the use of stack filters that avoid these difficulties by employing logical algorithms that do not rely on any arithmetic functions.
Introduction
Stack filters for adaptive applications are suited to a hardware implementation on reconfigurable FPGAs. These filters are an ideal framework for the implementation of rank order and morphological operators 2, 3, 4, 5 .
The two primitive morphological operators, erosion and dilation, can be easily performed by a stack filter. This allows the more sophisticated morphological filters to be built up using these base blocks.
In general a rank order filter operates by sorting an input window and then deriving its output from a function of these values. It is interesting to note that it is possible to derive the erosion and dilation operators from the rank order filter by setting the output to the maximum or minimum value.
The median filter is the best known of the rank order filters. It has excellent noise reduction characteristics and yet preserves the detail and form of the original signal. Unlike many noise reduction systems, the median filter will preserve edges without blurring the details. Figure 1 illustrates a median filter with a window size of five samples. The input window values are sorted into rank order, then the median of the five values is produced. The output value of any simple rank order filter is therefore guaranteed be equal to one of the input window samples. 
Where signal(n) is the nth sample in the input stream. Each of these threshold signals is then processed by one of a set of 2 i -1 stackable Boolean functions before being summed or, more simply, stacked to obtain the output value as shown in For a vector of threshold values to be stacked, the vector must satisfy the stacking property, which can be defined as:
where n n threshold threshold (2) It can be shown that positive Boolean functions (PBFs) always exhibit this stacking property 4 . It is not a requirement for these PBFs to be identical, only that they must stack. Stack filters built using identical PBFs are often called homogeneous stack filters.
The PBF used to perform the median function in Figure 2 is given as follows:
where the nth window element is defined as wn.
This gives a very simple architecture whose operation is very fast due to its high degree of parallelism. Unfortunately, it is also very large, and lacks scalability because the number of thresholds doubles for every additional bit in the input sample.
This architecture has the advantage of being able to implement all stack filters, unlike other architectures which assume a homogenous stack filter is desired 5 .
The range compression 6 architecture was originally proposed to reduce the overall size of the filter. The aim was to reduce the filter's complexity, by only processing the threshold signals that correspond to the maximum values of the individual sample in the window. This has the effect of reducing the number of PBFs from 2 i -1 to just the number of samples in the input window.
This architecture is much slower as it requires the magnitude of each window sample to be compared with every other sample in the window. The reduction in complexity does not translate well to hardware due to the size of the comparators. The filter should be more scaleable as it avoids the exponential growth of the threshold decomposition architecture. The reduction in the number of threshold signals is offset however by the increased complexity of the comparators.
The binary refinement 7 filter uses a binary search algorithm to accelerate the filtering process. This removes a degree of concurrency from the problem, but pipelining can be used to maintain the throughput, albeit at the expense of increased latency.
A binary search algorithm maps well to the threshold signals as they are, by definition, sorted into order. The objective is simply to find the value of the highest threshold to obtain a '1' from the PBF. This approach can be seen in Figure 3 below: This process can now be repeated for the less significant bits. First however, we must evolve the window elements such that they do not lose the information held in the upper bits. This is a simple process of locking the element to a minimum value if it is below the valid threshold range, and to a maximum value if it is above the valid range.
The binary refinement cannot easily map to the input window as the range compression architecture does. Under extreme conditions, where the size of the input window greatly exceeds the number of bits in each sample, it is possible to perform a binary refinement of a compressed input range.
Hardware Implementation
The range compression and the binary refinement filters were implemented on the XC6216 FPGA 8 . Figure 4 shows the floorplan of a pipelined implementation of the binary refinement algorithm for a minimum filter. It operates on a 3x3 window and can process a 256 level greyscale image at an estimated speed of 15 M samples/second and occupies less than 20% of the device, or fewer than approximately 5,000 logic gates. The design is highly regular, makes extensive use of pipelining and scales linearly, with either window size or data width. Of the architectures considered, it is by far the fastest filter. The critical path of the pipelined architecture, exhibits the delay of the PBF with an additional overhead of just two gate delays on the XC6216. The latency of the pipeline, measured in system clock cycles, is equal to the width of the input sample window.
In comparison, the threshold decomposition architecture exhibits the PBF delay and also the delays resulting from the decomposition of the input signals and the stacking of the output signal.
The input range compression architecture is the slowest of the three algorithms due to the combined delays of the PBF, the comparators at the input window, and the multiplexing switch at the output.
Future Work
The use of reconfigurable FPGAs allows the exploration of adaptive image processing algorithms. The FPGA may be used within the training phase to compute the optimum PBF that can be configured onto the device for run-time execution. Additionally, multiple PBFs can run concurrently and be dynamically adapted to the characteristics of the image data. The potential for run-time adaptation of the shape and size of the image window, in real time, make the implementation of stack filters on FPGAs a highly novel area of on-going research.
Conclusions
Stack filters may be implemented efficiently on FPGAs because they rely on Boolean functions rather than arithmetic functions. Of the three algorithms considered in this paper, the binary refinement filter offers the best speed-area characteristics with modest latency.
The use of dynamically reconfigurable FPGAs allows considerable scope for the implementation of real-time, adaptive stack filter algorithms.
