Abstract
Introduction
Run-time reconfiguration (RTR) typically alternates phases of computing and reconfiguring, trading the time cost of hardware reconfiguration for the time benefit of computing with specialized hardware [CoH02, VaT03] . It employs the fast reconfiguration time possible with commercially available field-programmable gate arrays (FPGAs) or specially designed reconfigurable hardware [BrR96, Hau98] . RTR has shown fruitful applications in a number of areas, particularly image and video processing [GaCL02, JeTY+99, Kre00, NaMN99, TaBW00], networking [Bre02, DiTJ00, GuLD00, HoK02, MeMM02] , and cryptography [DaPR00, ImSN+01] . The applications for which RTR is most useful usually display a large number of repeated operations, so the solution approach computes in each configuration for a substantial amount of time before reconfiguring for the next phase [BoP98] .
This compensates for the reconfiguration overhead.
In image processing, a common approach to remove noise from an image is to filter the image by scanning a filtering window across the image. When a pixel is at the center of the window, calculate a new value for the pixel by taking the inner product of the filtering window coefficients and the pixel values in the surrounding window-size neighborhood. Window coefficients of this filter could be fixed or vary over the image. Fixed coefficients have the drawback of filtering the same way for uniform regions, edges, and regions of quick transitions, however. One way to overcome this drawback is to use adaptive coefficients whose value depends on local conditions around the center pixel [GoW02, Tek95] .
This paper applies RTR to solve this adaptive image filtering problem. Filtering exhibits regular, repeated operations, taking an inner product among the same number of elements at each pixel position. If window coefficients were fixed, then we could design multipliers tied to these constant values. In the adaptive case, however, window coefficients can change at each new pixel location. On the other hand, each pixel appears (at different locations) in several windows at regular, predetermined intervals.
Our solution exploits this observation to build an efficient RTR solution for adaptive image filtering, featuring a collection of compact, simple modules and a regular reconfiguration cycle.
Section 2 reviews background concepts from RTR and image filtering. Section 3 describes our implementation and demonstrates the merits of our approach.
Background concepts
This section sketches the underlying concepts relevant to understanding the problem and our approach. Section 2.1 deals with RTR, then Section 2.2 deals with image filtering.
Run-time reconfiguration
As described earlier, the distinguishing mark of RTR is the active role of both computation and reconfiguration. These may interact in different ways [CoH02] . We can implement RTR using fully reconfigurable or partially reconfigurable hardware.
With fully reconfigurable hardware, to load a new configuration, the application stops while the new configuration loads and then restarts. Partially reconfigurable hardware, on the other hand, permits loading a new configuration to part of the hardware while the remainder continues to operate. Partial reconfiguration uses hardware resources more effectively by overlapping computation and reconfiguration phases to significantly reduce the running time of an application.
A reconfigurable computing unit prominent in many RTR implementations is the constant coefficient multiplier (KCM). An n×m KCM multiplies an n-bit input by an m-bit constant, k, to produce their (n+m)-bit product. Look-up tables (LUTs) form the heart of a KCM. Each LUT holds one or more bits of the product of k and a range of input values. Replacing constant k with a new constant involves replacing values in these LUTs [Xil99] .
Some RTR research on image processing applications holds relevance to the current work. Gause et al. [GaCL02] developed FPGA solutions for shape-adaptive template matching, exploring designs that are fully dynamic, configuring for a given template and size, and partially dynamic, configuring for maximum size and given template data. Their fully dynamic design attained a speedup of almost 7000 times over a software solution for a sufficiently large template.
Kreuger [Kre00] implemented a finite impulse response (FIR) filter in a Xilinx Virtex-EM FPGA, employing the reconfigurability of the FPGA as well as the on-chip memory to store multiple lines of pixels. Nagano et al. [NaMN99] used partially reconfigurable FPGAs toward fractal image compression. They used KCMs in many cases where multiplications could be regarded as multiplications by a constant, and they used reconfiguration to specialize the design to eliminate computations useless to the particular data set at hand.
Wojko and ElGindy [WoE99] devised an RTR solution to adaptive FIR filtering. The problem involves an n-tap filter with n filter coefficients and computing the inner product of these coefficients with each n element window of data. They turned on its head the conventional approach to this problem of leaving the filter coefficients in a fixed location and streaming data past them. They recognized that each data element participates in n multiplications, while the number of multiplications for an adaptive filter coefficient is uncertain, so they configured a KCM for each data element and cycled the coefficients past them.
The approach performs all multiplications contributing to a single inner product at the same time. Our approach extends the idea of fixing data to KCMs while letting filter coefficients flow past them to the two-dimensional case of an image. Additional obstacles arise in our two-dimensional case, including the fact that window data are no longer all present at the same time.
Our implementation will use a Xilinx Virtex-E FPGA as its partially reconfigurable hardware. The features of this FPGA that lend themselves to this purpose are fast reconfiguration of KCMs and on-chip blocks of RAM. Because the design processes pixels by rows, but filter windows span several rows, we generate a number of intermediate results. The size and number of RAM blocks are more than adequate and suit our modular design well.
Image filtering
This section describes the image filtering algorithm that this paper implements. An image filtering algorithm removes noise from an image. It works on the principle that any pixel with an intensity value very much different from its surrounding pixels is noisy. The image filter under consideration moves a filtering window over an image pixel by pixel in row-major order starting from the top left corner of the image. Filtering windows are typically of size 3×3, 5×5, or 7×7. (In this paper, we use a 3×3 window, though the method applies to other sizes.) To produce the new value of a pixel at the center of the window, the algorithm computes the inner product of the values of pixels overlapped by the filtering window and the coefficients of the filtering window. Such a filter is termed a linear smoothing filter.
The following equation represents the working of a linear smoothing filter of size w × w, for odd integer w: If each of the w 2 filter coefficients remains the same for all window positions over the image, then the filter is spatially invariant. This filter removes noise from an image, but can blur the image by smoothing sharp edges and softening step variations to gradual changes. A second type of filter is spatially variant or adaptive in which the filter coefficients adapt to the varying nature of the image and are different for different window positions over the image. Such a filter can adjust the values of its coefficients to perform less smoothing near the edges and to perform more smoothing in areas where the image is largely uniform in nature, thus preserving the details in the image [GoW02, Tek95] . This paper implements an adaptive linear smoothing filter.
Tekalp [Tek95] describes one approach to generating coefficients for an adaptive filter applied to video images or two-dimensional gray scale images. The coefficient values depend on the uniformity of the image so that the coefficients are of equal weights when the image is uniform under a window. When intensity values differ widely across a window, the coefficient values are greater for pixels with intensity values nearer to that of the center pixel. This requires optimizing a criterion function, which depends upon the intensity values of the pixels overlapped by the smoothing filter. In particular, where g and h range from −(w−1)/2 to (w−1)/2. In this paper, we assume that coefficient generation takes place outside the FPGA.
To accommodate maximum adaptivity, we assume that each coefficient can change arbitrarily from one window position to the next.
Implementation
The implementation that we describe assumes a 256×256 image that represents pixel intensities in an 8-bit gray scale and a 3×3 filtering window with 8-bit fixedpoint filtering coefficients. The approach readily extends to other values for these parameters.
We now give an overview of our solution approach. The hardware comprises 16 modules that act upon 16 contiguous pixels from a row at a time. The actions of each module on one pixel value follow three sets of three steps each, corresponding to the three rows in a 3×3 window and the three positions in each row. For each of these nine steps, a module contributes to one of the nine window computations in which its pixel participates. Each module includes two KCMs -one active and one reconfiguring. (The time to reconfigure one KCM overlaps the computation time of the other.) The external inputs (from outside the FPGA) to the active KCMs are the window coefficients. A KCM output feeds into an adder that receives its other input from a neighboring module or from memory; these are partial results of inner product computations for filtering windows. The adder output goes to a neighboring module or to memory or to I/O pins. Section 3.1 describes a module, then Section 3.2 details the computation steps of a module while configured for one pixel value. Finally, Section 3.3 evaluates the size of the implementation and time to filter an image.
Module
A module comprises a number of separate entities (Figure 1) . Its primary components are two 8×8 KCMs and an adder. The presence of two KCMs is the key to RTR as one KCM provides data to the module adder while the system is reconfiguring the other. Each KCM receives the filtering window coefficient as its input (assume that the KCM is already configured with a value of an image pixel) and produces a 16-bit value that it feeds to the module adder. The other input to the module adder comes from the module mux. The module mux has three inputs, the first with a constant input of 0, the second receiving data from its previous module, and the third receiving data from a memory block. The input selection depends on whether the current position in a window is the first element of the window, a middle or end element of the window, or the first element on a new row of the window, respectively.
The circuit comprises 16 modules. All the modules work in parallel, and data moves simultaneously along the same path within each module.
Every KCM alternates between computation and reconfiguration phases. At any time, the set of 16 KCMs in their computation phase (one per module) is called the active set, while the other set of 16 KCMs in their reconfiguration phase (one per module) is called the reconfiguring set. Call the circuit with the active set of KCMs configured for a particular set of 16 pixel values as an active state. When the system changes the contents of the KCM LUTs in the reconfiguring set and this set switches to computation mode (and the KCMs in the active set switch to reconfiguring mode), then we begin a new active state. We refer to this as a state switch. The circuit undergoes a state switch after every 16 clock cycles (the maximum of nine computation steps and 16 steps to reconfigure a KCM [Xil00] ).
A set of 16 filtering window coefficients, one for each module, arrives at each of the first nine clock cycles of each active state. A set of 16 pixel values, one for each module, arrives at each state switch as new multiplier constants to be loaded into the reconfiguring KCMs.
Computation steps
Because a 3×3 window includes nine elements, each pixel participates in nine windows, so the computation phase for a module comprises nine steps. Because a 3×3 window spans three rows, the nine steps partition into three cells of three steps each, one cell per row.
Due to many simultaneously active computation flows during a computation phase, we describe the computation flow from two vantage points: (1) from the point of view of a module which interacts with nine windows that overlap the pixel position assigned to the module; and (2) from the point of view of a pixel for which one window applies to compute an inner product leading to a new value for that pixel.
From a module's viewpoint, in each computation phase, its active KCM maintains a pixel value v(i, j) as constant multiplier. (Call the portion of a filtering inner product corresponding to one row of a window as a window-row sum.) A module executes nine steps in three cells of three steps each: in one cell it contributes to top window-row sums for three new pixel values; in a second cell it contributes to middle window-row sums for three new pixel values; and in the third cell it contributes to bottom window-row sums for three new pixel values. Figure 2 presents procedure THREEPIX for one module, which is a three step cell that contributes to three windowrow sums on the same (top, middle, or bottom) window row for three adjacent pixels, where in and out specify the source of one adder input and the destination of the adder output. Figure 3 gives pseudocode for all modules over the entire image.
Procedure THREEPIX(r, in, out)
Step 0:
Adder(r) ← KCM(r) + in j, g, h) , the product of a pixel value and a window coefficient. Let rsum(g)(i, j) denote the window-row sum for row g of the window centered at pixel position (i, j), for g ∈ {−1, 0, 1}. The column labeled v(6, 25) of Figure 4 shows the π values computed and top window-row sum participation for one module during the first cell of three steps. Each column in the figure denotes a module, and each row denotes a multiply-accumulate computation step. Despite the many streams of computations, the flow of data through the modules is regular. Table 1 details the steps of one module configured for v(i, j) as multiplier constant.
From a pixel's viewpoint, its window spans three rows. Since the inputs (image pixels) arrive at the modules in row-major order, the circuit deals with the three rows of a window at separate times. Figure 4 shows the three steps in the computation of rsum(−1) (7, 25) , that is, the top window-row sum for newv (7, 25) . In
Step 0, the module configured for v(6, 24) makes the first contribution of π(7, 25, −1, −1), then passes its result to the module configured for v(6, 25). In Step 1, that module adds its π(7, 25, −1, 0) term then passes the sum to the module configured for v(6, 26). In Step 2, that module adds its π(7, 25, −1, 1) term to complete rsum(−1)(7, 25), which it sends to memory.
When the data for the corresponding positions of the next row (row 7) arrive, the modules configure themselves for that data, and, among other things, the same three modules compute the middle window-row sum rsum(0) (7, 25) . Similar actions occur for row 8 to compute the bottom row sum rsum(1)(7, 25) and complete determining newv (7, 25) .
The memory block access pattern is also regular. Each module writes to one memory block and reads from another. Specifically, module r writes to memory block r and reads from memory block (r+2) mod 16. The reason for the offset by two modules is that a module r handles a pixel in the rightmost position of a window when it writes a window-row sum rsum. On the next row, the module (r−2) mod 16 handles the pixel in the leftmost position of the window (two places to the left) and needs to read the previously written window-row sum rsum from module r. Addressing individual locations within a memory block also follows a regular sequence, cycling through 64 locations of memory.
Evaluation
Our design implements 16 modules (Figure 1 ) simultaneously. The overall circuit uses 492 slices of a Xilinx Virtex-E FPGA to perform adaptive image filtering. The Xilinx Virtex-E device XCV200E has 2352 slices and 28 SelectRAM blocks, each with 4096 bits, which is more than sufficient for the size and number requirements for memory blocks.
This design runs at a clock frequency of 101.9 MHz. To filter a 256×256 image, it will execute 16 iterations (of three instruction cells) per row, with each iteration taking 16 steps. This leads to an overall time of 643 µs to filter an image. This improves on the 20 ms time for a pure software solution (as executed on an 866 MHz Pentium III system), giving a speedup of 31. (7,24,-,-) (7,23,-,0) (7,22,-, ) (7,25,-,-) (7,26,-,-) (7,27,-,-) (7,28,-,-) (7,24,-,0) (7,25,-,0) (7,26,-,0) (7,27,-,0) (7,23,-, ) (7,24,-, ) (7,25,-, ) (7,26,-, ) rsum(-)(7,25)
Bibliography
Input 0 Input 0 Input 0 Input 0 Input 0 rsum(-)(7,24) rsum(-)(7,26)
module
Step v(6,23) v(6,24) v(6,25) v(6,26) v(6,27) (7,24,-,-) (7,23,-,0) (7,22,-, ) (7,25,-,-) (7,26,-,-) (7,27,-,-) (7,28,-,-) (7,24,-,0) (7,25,-,0) (7,26,-,0) (7,27,-,0) (7,23,-, ) (7,24,-, ) (7,25,-, ) (7,26,-, ) rsum(-)(7,25)
Step v(6,23) v(6,24) v(6,25) v(6,26) v(6,27) (7,24,-,-) (7,23,-,0) (7,22,-, ) (7,25,-,-) (7,26,-,-) (7,27,-,-) (7,28,-,-) (7,24,-,0) (7,25,-,0) (7,26,-,0) (7,27,-,0) (7,23,-, ) (7,24,-, ) (7,25,-, ) (7,26,-, ) 
