Abstract
Introduction
Modern medical image processing has come a long way since the first hazy X-ray images produced by Dr. Roentgen in 1895. The state-of-the-art Computerized Tomography (CT), Magnetic Resonance (MR), Angiography equipments produced by medical imaging companies testify the colossal progress. In this paper, we study applications from angiography which deal with interactive catheter guidance (passing to all kinds of vessels) and minimal-invasive vessel treatments (stenosis, aneurism, embolization, etc.). The major problem still facing medical images is the poor signal to noise ratio (SNR) due to limited dosage and exposure for health reasons [4] . Limited dosage especially is an issue if fluoroscopic images are acquired. Fluoro images are not used for diagnostic purpose but for guiding a catheter, placing a device (e.g. a stent) and other interactive interventional procedures taking up to several hours. Therefore, the use of image processing algorithms to reduce the present noise along with preservation of visual structures is a major field of research. Usually, a sequence of algorithms within an imaging pipeline is considered to tackle the different types of noise. Important among them are 2D filtering algorithms. Basic pre-processing image filters such as mean or other linear filters fail to preserve * Supported in part by the German Science Foundation (DFG) in project under contract TE 163/13-1 and Siemens AG, Medical Solutions. the visually important structures as edges and lines. Bilateral filtering as introduced by Tomasi and Manduchi is the state-of-the-art adaptive filtering algorithm which offers much better edge-preserving smoothing [13] . The second major challenge facing the medical imaging community is real-time medical image pre-processing with futuristic data rates of up to 120 Mpixel/sec and latency constraint of 100 ms (clinically relevant in order to get sufficient feedback for interactivity). Therefore, for real-time processing hardware acceleration is unavoidable. There are numerous well known software/hardware alternatives from which to choose, including digital signal processors (DSPs), custom application-specific integrated circuits (ASICs), application specific instruction processors (ASIPs) including graphic processors and field programmable gate arrays (FPGAs) and also lesser known ones such as coarse-grained processor arrays (CGAs). These alternatives offer varying degrees of performance benefits that must be weighed against other factors including cost, power consumption, and design time. The current imaging systems mostly use multi-DSP solutions for pre-processing of the images. However, there is a growing trend of combining multi-DSP based platforms with FPGAs to keep in pace with the next generation algorithms and technical advancements in equipments. Today, FPGAs offer millions of programmable gates and hundreds of hardwired multipliers for custom implementation of next generation number crunching dataflow algorithms. The most of the custom implementation of such algorithms are realized as processor array architectures which provide an optimal platform for the parallel execution. However, there is a lack of tools which map algorithm descriptions (given as for or while loops in C language) onto a hardware target (e.g. FPGA) subject to performance constraints in latency, area, or power. The current state-of-the-art HLS tools in the industry and academia are PICO Express from Synfora [10] , Catapult C from Mentor Graphics [7] , MMalpha [2] , and PARO [8] . These tools convert algorithm descriptions to RTL along with analysis of micro-architecture implementation possibilities. Some of these tools are based on the polytope model [5] which is an intuitive methodol- 
VHDL

Figure 1. PARO design flow.
ogy for loop parallelization and mapping of loop nests onto massively parallel architectures.
The major contributions of this paper are an efficient and methodic realization of a computational intensive algorithm from medical imaging as hardware accelerator on an FPGA. But first in Section 2, we present the basic concepts in our mapping methodology for generation of hardware accelerators. Afterwards in Section 3, a concise description of the bilateral filter algorithm in the context of a reference imaging pipeline is given. Subsequently in Section 4, an efficient architecture realization of a bilateral filter is derived. Finally, in Section 5 and 6, the results and the conclusion of our mapping methodology and implementation are presented.
Background
In this section we give an overview of our mapping methodology for generating synthesizable descriptions of massively parallel processor arrays from nested loop programs. The design flow of our approach is depicted in Fig. 1 . The input to the compiler is a class of algorithmic descriptions (of loop nests). This class of algorithms is formally defined as Piecewise Linear Algorithm (PLA) [11] . A set of transformations is applied to the algorithmic descriptions like localization, partitioning, . . . , spacetime mapping (see Fig. 1 ) in order to obtain a parallel architecture description. Localization is a transformation which converts affine data dependencies into uniform data dependencies by propagation of variables from one index point to neighboring index points [11] . The transformation enables maximum data reuse within the processor array, and henceforth minimizes the amount of external I/O communication (with peripheral memory). Partitioning is a transformation which enables generation of array architectures under resource constraints. Space-time mapping defines the placement of processing elements (PEs) and scheduling of the iterations. The algorithmic description obtained after space-time mapping can be translated into a hardware description using a code generator shown by the white ellipse in Fig. 1 . The hardware generation is responsible for both data path and control path generation. The successive refinement of algorithmic descriptions for hardware generation has also been done for applications from the field of neural networks, digital filtering [9] , and image analysis [6] , etc. In our case study, we derived not only a new efficient architecture but also incorporated some of the hand tuned architecture optimizations in the design flow.
Imaging Pipeline
The aim of the medical imaging community is to reduce the noise associated with the images as much as possible through the use of better algorithms. Meanwhile, due to different sources of errors, there is no single algorithm for noise removal but a sequence of algorithms usually called imaging pipeline. The algorithms in the imaging pipeline are responsible for the following functions.
• Conditioning: Conditioning algorithms handle defects produced due to sensor-introduced pixel errors, e.g., defect pixel interpolation (DPI), etc.
• Pre-Processing: This step involves the application of a series of algorithms for image enhancement through noise removal, e.g., filtering.
A reference imaging pipeline (see Fig. 2 ) constituted of conditioning and pre-processing steps is selected for our study with respect to algorithm, architecture and implementation complexity. The algorithm defect pixel interpolation (DPI) and gain-offset (G-O) correction constitute the conditioning phase of the imaging pipeline. The DPI algorithm corrects the value of defect pixels (stored in defect map) by replacing a pixel value with an average pixel value of the neighborhood (3 × 3 or 5 × 5 window). G-O correction is used to correct each pixel as they are associated with particular noise due to dark current and their own peculiar sensitivities. The G-O correction is done by multiplying the incom- ing image with the values in gain map and then offsetting it with the values in an offset map. The maps are obtained during the calibration step. The algorithms bilateral filter and multi-resolution filter constitute the pre-processing phase of the imaging pipeline. The bilateral filter is a 2D adaptive filter which removes the low frequency noise and preserves vessels and other structures. If in an adaptive filter the kernel size is large (say 9 × 9), then although low frequency noise is removed, however, structures may not be preserved. Also if the kernel size is small then the structures in the image are preserved better, but, with reduced removal of low frequency noise. This dilemma is solved by undertaking a multi-resolution representation where the adaptive filter works on Gaussian (the images with subsequent reduced resolutions (g 0 (1024×1024), g 1 (512×512), g 2 (256 × 256), . . . )) and Laplacian (difference of images at different resolutions) pyramids. The basic concept is that on going to a lower resolution, it is easier to filter low frequency noise and preserve the structures.
To use the complete parallelism available in the algorithms one needs to ascertain the required number of operations per cycle, number of memory accesses, and latency. A close study of the imaging pipeline (see Fig. 3 ) reveals that the algorithmic requirements of the conditioning algorithms are comparatively lower than for the pre-processing algorithms. The requirements of the conditioning algorithms can easily be met by software implementations. Whereas for a 5×5 mask, an efficient parallel implementation of a bi- lateral filter requires 25 multipliers, 24 adders, and 25 parallel memory accesses for processing one pixel per clock cycle. Furthermore, the multi-resolution algorithm uses a gradient adaptive filter [4] (similar to bilateral filter) at different resolutions. In order to accelerate these pre-processing algorithms one needs parallel implementations for this class of adaptive filter algorithms. Besides, conception of dedicated architectures for a generic class of adaptive filter algorithms is subjected to constraints and algorithmic change calls for an automatic mapping methodology. Therefore, we take the bilateral filter as running example throughout the paper to exemplify our mapping methodology. In the following section we describe bilateral filtering in detail.
Bilateral Filtering
Filtering is perhaps the most fundamental operation of image processing and computer vision. The term filtering means that the value of the filtered image at a given location is a function of the values of the input image in a small neighborhood of the same location. E.g., Gaussian low-pass filtering computes for each pixel a weighted average of pixel values in a neighborhood where the weights decrease as a Gaussian function of distance from the center pixel. The problem with Gaussian filtering is that low spatial variation of the nearby pixels fails at edges, where we have abrupt changes in the pixel values. So, these edges are consequently blurred by low pass filtering. To prevent the blurring of edges, the local image variation needs to be measured at each pixel, and the pixel values are averaged with mask coefficients whose values depend also on the local variation. This is the main idea behind adaptive filtering. A bilateral filter does filtering in the range (i.e., gray values or RGB values) and spatial domain
where c(m, n) = e
of an image whereas traditional filters do filtering only in the spatial domain. In the discrete domain, the bilateral filter is described by Eq. (1), where y(i, j) is the processed output pixel, u(i − m, j − n) describes the neighborhood of input pixels required for the computation. The convolution is carried out with closeness mask, c and similarity mask, s. The closeness mask corresponds to a Gaussian filtering mask as it enforces closeness by weighting pixel values with coefficients that fall off with the distance as an exponential function (see Eq. (3)). The coefficients of the similarity mask depend on the difference of center pixel and corresponding neighborhood pixel. E.g., the similarity mask as shown in Fig. 4 can be 0,1,1,0,1,1,0,1,1 substituting the "?" with zeros corresponding to maximum pixel value difference and 1 corresponding to 0 pixel difference as input pixel neighborhood (as 64-64=0). The following algorithm represents the functioning of a bilateral filter. 
ALGORITHM: Bilateral Filter The window operator or the mask is given by a(m, n).
(i, j)+=a(m, n) · u(i − m, j − n) sum w (i, j)+=a(m, n) END FOR END FOR y(i, j)= prod(i, j) sumw(i, j)
END FOR END FOR
The values in the mask are determined by the geometric spread, σ d and photometric spread, σ d . The mask coefficients conveniently stored in a LUT as exponential functions as they are difficult to calculate at run-time. prod contains the intermediate weighted sum, whereas sum w contains sum of mask coefficients.
Boundary Conditions
The pixels on the border strip do not have enough neighboring pixels to process the filter algorithms. For example, in case of a 3 × 3 filter operation, the pixels on the left, right, top and bottom strip of width 1 pixel (denoted by x) cannot be processed (see Fig. 4 ). The following approaches summarize the way the boundary pixels can be handled. a) The boundary pixel values have the same values after the filtering step as before, b) Zero padding: Additional strips of pixels are pasted around the boundary, the pixels in the new strip have zero or maximum gray value, and c) Symmetric extension: An additional strip of pixels are pasted around the boundary. However, the value of pixels in the strip are mirrored values of pixels across the boundary. The same boundary handling method is also carried out for DWT based encoders [1] .
Design Methodology
In this section, the PARO methodology is applied for synthesis of a massively parallel architecture from algorithmic description. Starting point of the methodology is the PLA description which explicitly represents the full data parallelism in the algorithm. The variables in the PLA description of the nested loop algorithm with different number of indices needs to be embedded into an index space with common dimensions. This allows the scheduling and placement of the variables in the polytope model. Therefore, variables u, a and y out , i.e., inputs, weights and output image are embedded in the common 4D index space. The following PLA is equivalent to the algorithm in Section 3.1.
The index space is given by Localization enables maximum data reuse within the processor array and henceforth minimizes the amount of external I/O communication (with a line buffer as will be seen later). After the transformation Localization, the equivalent equations for input pixel, u and center pixel, u c , are obtained as follows: Although our dependence graph is a 4D graph, the idea of localization of u can be illustrated as in Fig. 5(a) and (b) by neglecting index variables j and n. By means of localization the broadcast of input data as shown in Fig. 5(a) is converted to propagation of data by introduction of equivalent uniform data dependencies (see Fig. 5(b) ). The computations of the above PLA which has only uniform dependencies may be represented by a dependence graph (DG). However, the four dimensional nature of the iteration makes it difficult to visualize the DG. However, the DGs can be represented in a reduced form, the so called reduced dependence graphs (RDGs). All the dependence vectors are associated with corresponding edges of the graph. The RDG of the algorithm can be seen in Fig. 6 .
Space-time mapping is a linear transformation that assigns index vectors (i, j, m, n) to processors (p 1 , p 2 ) and time t. In other words, it determines on which processor and at what time an iteration is to be executed. The reason for using linear allocation and scheduling functions is local and regular data flow required between the processing elements (PEs). We chose the following transformation to describe the processor array structure and mapping:
L is the number of pixels in a single image line. The interpretation is that p1 = m and p2 = n means all iterations (i, j) corresponding to same coefficient position (same m and n) takes place on the same PE. Furthermore the set of operations defined at index points λ · (i, j, m, n) T = const, where λ = (1 L 1 1) denotes all iteration scheduled at the same time step. In our approach, the scheduling vector λ is obtained by the formulation of a latency minimization problem as a mixed integer linear program (MILP) similar as in [12, 3] . Assuming constant j and n, the dotted line in Fig. 5(b) denotes the isotemporal hyperplanes (i.e. all the points on the dotted line are executed at same time step). The following description is obtained after space-time mapping:
ALGORITHM: PLA after space-time mapping
The above program description can be parsed to obtain the final architecture. The delays can be modeled either as shift registers (t − 2 represents a shift register of size 2). Furthermore, the difference in processor index in the same equation indicates that the data is obtained from the neighboring processor. In the following section, we discuss the obtained architecture.
Architecture
The core of the obtained design is a 2D processor array (PA) of P × P PEs. The value P × P corresponds to the size of the filter mask. The PA architecture is shown for P = 3 in Fig. 7 . Corresponding to the RDG, each PE contains one fixed point, full precision multiply and accumulate unit (MAC). The subtract unit calculates the difference of the center pixel and the neighbor pixel in order to calculate the mask coefficients. The mask coefficient is an exponential function of the difference. Instead of calculating the exponential function which is expensive in hardware, a LUT is used to store the pre-calculated function values corresponding to the difference. Fig. 7 ). The pipelined divider is used to divide the pixel sum with the mask sum to obtain the final output pixel. The pipelined divider has a latency depending on the size of divisor and dividend and delivers the output pixel. The latency (i.e. the time from reading the first input pixel to obtain the output pixel) of the shown processor array architecture with a 3 × 3 filter mask for an image of 512 × 512 is 1060 cycles. The factors deciding the total latency are latency of the buffer memory, latency of the processor array (e.g., the time the output pixel takes to wander from PE(0,0) to PE(2,2)), and the latency of the pipelined divider. Control generation is required in order not to calculate the border pixels.
Results
The above derived PA architecture was implemented on a Xilinx Virtex XC2V6000 based board. Table 1 gives the resource usage of the processor array architecture without the input and output system. Of the 144 Block RAMs, only 9 and 2 BRAMs were needed by the processor array and the pixel pipeline respectively. However, the whole design included buffering of image data therefore leading to high consumption of buffer memory. This recommends usage of on-board SRAMs as intermediate video memory. Secondly, for this FPGA (which is considered relatively large by any standards), the resources would not be enough to handle bilateral filtering for a mask bigger than 11 × 11. In case of bigger masks, the architecture design must be further partitioned. This aspect was not studied as filter masks not bigger than 7 × 7 were considered in the imaging pipeline. A maximal clock speed of 87.65 Mhz was obtained during the synthesis of the processor array design.
Conclusion
In this paper, we introduced a methodological derivation of a massively parallel architecture for a bilateral filter algorithm. The fine tuning of memory access is formalized by the PARO methodology. Our future work entails inclusion of other high level transformations. For example, although the processor array can attain maximal clock rate of 100Mhz, the input data frequency for current medical imaging applications is only 20-30 Mhz. In Fig. 7 , for the input rate of 25Mhz, a processor array clocked at 75Mhz can implement the algorithm with only 3 PEs and by switching the input data with the help of a multiplexer. This optimization 
