Abstract-This paper presents the design and implementation of finite Radon transform (FRAT) on field programmable gate array (FPGA). To improve the implementation time, Xilinx AccelDSP, a software for generating hardware description language (HDL) from a high-level MATLAB description has been used. FPGA-based architectures with three design strategies have been proposed: direct implementation of pseudocode with a sequential or pipelined description, and block random access memory (BRAM)-based approach. Various medical images modalities have been deployed for both software simulation and hardware implementation. An analysis on the image de-noising using the FRAT is addressed and demonstrates a promising capability for medical image de-noising. Moreover, the impact of different block sizes on reconstructed images has been analysed. Furthermore, performance analysis in terms of area, maximum frequency and throughput is presented and reveals a significant achievement.
I. INTRODUCTION
The contributions of transform domains in various spectrums of operations including image de-noising, enhancement, and compression are undebatable facts. As an example, the wavelet transform has been extensively used as a solution to the short time Fourier transform (STFT), and excels in isolation discontinuities and spikes [1] . However, the wavelet suffers from inflexible directionality as it does not isolate the smoothness along edges.
These demerits of wavelet are well addressed by the ridgelet and curvelet transforms, as they extend the functionality of wavelets to higher dimensional singularities, and it is proven as an effective tool to perform sparse directional analysis. The basic building block of these transforms is the finite Radon transform (FRAT).
In medical imaging applications, three-dimensional (3-D) modalities have been widely exploited for more truthful assessment of pathological changes [2] . Although remarkable contributions for screening and diagnosis can be achieved with 3-D modalities, the complexities of their algorithms are computationally demanding. Since medical images contain several objects and curves, the ridgelet, curvelet and the FRAT as its basic building block play a significant role for better medical image analysis. The FRAT algorithm is inherently serial, iterative and has long latency. By off-loading the intensive processing procedures into a properly hardware platform, computational acceleration can be achieved whilst maintaining the outcomes quality. Field programmable gate arrays (FPGAs) with massive parallelism options, multimillion gate counts and special low-power packages appears an attractive solution for FRAT's hardware implementation.
A comprehensive survey of literature indicates that the research is still in its infancy as demonstrated by the limited contributions for the FRAT's FPGA implementations and have only been investigated in the context of general image processing [1] , [3] - [6] . From the programming approaches and hardware implementation perspectives, various approaches were used: Verilog HDL [4] , Handel-C [5] , [6] and the latest by Chandrasekaran et al. [1] with the combination of Handel-C and parameteriseable VHDL core (Coregen).
This paper presents the design and implementation of FRAT's FPGA-based architecture for medical image denoising by employing Xilinx AccelDSP tool. The Xilinx AccelDSP has been selected, since it can converts automatically from high-level languages (HLLs) to register transfer level (RTL) HDL and even directly to FPGA configuration bitstream [7] . Apparently, this feature is important to reduce the design cycle as well as to allow more optimisation to be carried out on the algorithmic and architectural level. The proposed architectures will be deployed in a reconfigurable environment for adaptive 978-1-4244-7456-1/10/$26.00 ©2010 IEEE medical image compression. Three design strategies have been proposed: direct implementation of pseudo-code with a sequential and pipelined description, and block random access memory (BRAM)-based method. Analysis for both software simulation and hardware implementation with different medical image modalities has been carried out and discussed. An evaluation of FRAT's capability on medical image de-noising is also addressed.
The structures of this paper are as follows. Discussions on the design methodology with an overview of the proposed systems applications, Xilinx AccelDSP [8] design flow and the proposed FRAT architecture are presented in Section 2. Experimental results, comparison and analysis for medical image de-noising, software simulation and hardware implementation are explained in Section 3. Finally, concluding remarks are given in Section 4. Figure 1 illustrates an overview of the proposed 3-D compression system including the transform, quantisation and entropy coding blocks with the pre-processing block. In each block, buffers have been used for storing intermediate results to be processed. This research aims at developing an adaptive compression system for medical images, where all of its blocks are reconfigurable. It is well known that noise on medical image resulting in low image quality, hence limited the diagnostic effectiveness. Therefore, noise reduction for medical images is significantly vital for the pre-processing stage before compression systems. This paper emphasises on FPGA-based architecture of FRAT for the image de-noising block in preprocessing stage.
II. DESIGN METHODOLOGY

A. Proposed systems applications
B. Xilinx AccelDSP design flow
To ease the process of transforming a MATLAB [9] floating point design into a hardware module, Xilinx introduced the Xilinx AccelDSP software for rapid prototyping of an algorithm in MATLAB into hardware. The main feature of the Xilinx AccelDSP can be summarised as follows:
• A synthesisable RTL design can be obtained from the floating point M-code; • A set of test bench can be automatically generated; and • Capability to invoke HDL simulation, synthesis and implementation tools.
There are two main parts in M-code: a script and function file [8] . The script files works to create stimuli, feeds the stimuli to the function in a streaming loop and verifies the output from the function. Moreover, the script file also serves as a source file for future test bench auto generation.
Furthermore, the function file comprises the actual function to be translated into HDL, and it is written as an ordinary MATLAB function with an interface of input and output variables. The Xilinx AccelDSP verifies the generated module in each step to be as true as the previous one, or to be subjectively acceptable with a minor difference during the conversion from floating point design to fixed point [7] , [8] . Figure 2 shows the architecture for the FRAT, and it is obtained based on the pseudo-code [1] for hardware implementation. To exploit the hardware resources available, the operations of the various counters used to track the addresses of the output vectors are parallelised and pipelined (by changing rollover conditions and count limits suitably). The number of counters required remains the same, only the triggering conditions, order and reset logic are modified suitably. It must be highlighted that whilst the algorithm is still serial and cycles through ( 1) p p ⋅ + iterations, the number of steps in the algorithm have been reduced, thereby improving latency. The architecture has serial inputs/outputs (I/Os) and a serial core. The total latency of the core is 2 ( ( 1)) O p p + . The input section consists of a one-dimensional (1-D) random access memory (RAM) of width eight-bits and a depth of p 2 . Although each input image block is a square tile of side p, buffering it in a 1-D RAM reduces the computational complexity of the control logic associated with data access. This is because, a two-dimensional (2-D) RAM is implemented on FPGA as a number of 1-D RAMs and uses additional multiplexing logic to dereference the address locations.
C. Proposed FRAT architecture
The FRAT operation requires reading and writing from the same memory location within a single clock pulse. This is compactly and effectively implemented using a dual ported RAM at the output section instead of an array based buffer. The output buffer is a 1-D dual port RAM of width 2 log ( 255) p ⋅ and depth p. Only a single FRAT vector is buffered and the final values are written to the output port in serial fashion at the end of each iteration. At the end of ( 1) p + iterations, the entire image block is transformed to the FRAT domain.
Based on the FRAT architecture, three design strategies have been proposed as shown Figure 3 (a)-(c) , with 'R', 'E' and 'W' refer to 'Read', 'Enable' and 'Write' processes, respectively.
... To analyse the effectiveness of the FRAT in medical image de-noising, a Gaussian noise with mean (µ) zero and various variances (σ) has been added to the experimental images. By utilising FRAT in medical image noise reduction, results obtained shown promising achievement. The de-noising results achieved reveal that the FRAT implementation is effective to reduce Gaussian noise. Table I shows quantitative results for MRI images, whilst Figure 4 (a)-(c) present a significant achievement of 13.25% Gaussian noisy de-noising for the MRI image using the FRAT. It is worth noting that the filtered back projection (FBP) is a mathematically perfect inversion for the FRAT, and the peak signal to noise ratio (PSNR) depends only on the accuracy required [1] . The truncation or rounding step that follows the FRAT, determines the PSNR values.
A. Medical image de-noising
As it is usually used as a sub-block in other transforms such as FRIT and curvelets, and it is followed by a wavelet stage in these transforms, the rounding or truncation process can easily be incorporated along with the wavelet block with no extra computational effort by suitably modifying the wavelet coefficients.
However, to illustrate the effect of bit-width limitations on the PSNR, Figure 6 shows the relationship of the PSNR values for the reconstructed medical images with various block sizes (p). Results obtained exhibit that the PSNR of the reconstructed image drops by 7.93, 21.70 and 21.80 dB for MRI, CT and PET, respectively when the block size increases from p = 7 to 31. This is because as p increases, the rounding error becomes more significant. Using a divider with greater precision can reduce the rounding error.
C. Hardware implementation
For all three cases of hardware implementation: sequential, pipelined and BRAM-based method, pseudo-codes have been implemented in MATLAB and the Xilinx AccelDSP has been used for architecture and synthesis exploration. The designs have been implemented on Virtex-5 (XC5VSX50T) FPGA devices. As the prime aim of this paper is to examine the best hardware implementation applied for medical image de-noising, results for both medical image denoising as well as the software simulation justify the hardware implementation with p = 7. Comparison of performance metrics for the proposed FRAT architectures with existing work is presented in Table II .
Results achieved for the hardware implementation demonstrate various trade-offs with sequential and pipelined descriptions yielding better achievement for maximum frequency and throughput, respectively. Moreover, BRAMbased method also reveals less area occupied and better maximum frequency.
A detail comparison for both hardware implementation and software simulation with test medical images has been carried out. As shown in Table III , software simulation achieved better PSNR over hardware implementation with the percentage different 12.92%, 21.47% and 33.09% for p = 7, 17 and 31, respectively. This is due to the use of floating point in MATLAB, which yields better PSNR values compared with fixed point model in the hardware implementation. IV. CONCLUSIONS In conclusion, an FPGA-based architecture with three different design strategies has been proposed and an analysis with various medical imaging modalities has been conducted. Image de-noising implementation using FRAT exhibits a significant achievement to reduce Gaussian noise in medical images. An evaluation of the implementation results indicates promising trade-offs achievement in terms of maximum frequency, throughput and area.
