3 research outputs found
Efficient architectures of heterogeneous fpga-gpu for 3-d medical image compression
The advent of development in three-dimensional (3-D) imaging modalities have generated a massive amount of volumetric data in 3-D images such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US). Existing survey reveals the presence of a huge gap for further research in exploiting reconfigurable computing for 3-D medical image compression. This research proposes an FPGA based co-processing solution to accelerate the mentioned medical imaging system. The HWT block implemented on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board. Analysis and performance evaluation of the 3-D images were been conducted. Furthermore, a novel architecture of context-based adaptive binary arithmetic coder (CABAC) is the advanced entropy coding tool employed by main and higher profiles of H.264/AVC. This research focuses on GPU implementation of CABAC and comparative study of discrete wavelet transform (DWT) and without DWT for 3-D medical image compression systems. Implementation results on MRI and CT images, showing GPU significantly outperforming single-threaded CPU implementation. Overall, CT and MRI modalities with DWT outperform in term of compression ratio, peak signal to noise ratio (PSNR) and latency compared with images without DWT process. For heterogeneous computing, MRI images with various sizes and format, such as JPEG and DICOM was implemented. Evaluation results are shown for each memory iteration, transfer sizes from GPU to CPU consuming more bandwidth or throughput. For size 786, 486 bytes JPEG format, both directions consumed bandwidth tend to balance. Bandwidth is relative to the transfer size, the larger sizing will take more latency and throughput. Next, OpenCL implementation for concurrent task via dedicated FPGA. Finding from implementation reveals, OpenCL on batch procession mode with AOC techniques offers substantial results where the amount of logic, area, register and memory increased proportionally to the number of batch. It is because of the kernel will copy the kernel block refer to batch number. Therefore memory bank increased periodically related to kernel block. It was found through comparative study that the tree balance and unroll loop architecture provides better achievement, in term of local memory, latency and throughput
High performance reconfigurable architectures for biological sequence alignment
Bioinformatics and computational biology (BCB) is a rapidly developing
multidisciplinary field which encompasses a wide range of domains, including genomic
sequence alignments. It is a fundamental tool in molecular biology in searching for
homology between sequences. Sequence alignments are currently gaining close attention due
to their great impact on the quality aspects of life such as facilitating early disease diagnosis,
identifying the characteristics of a newly discovered sequence, and drug engineering. With
the vast growth of genomic data, searching for a sequence homology over huge databases
(often measured in gigabytes) is unable to produce results within a realistic time, hence the
need for acceleration. Since the exponential increase of biological databases as a result of the
human genome project (HGP), supercomputers and other parallel architectures such as the
special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs)
and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms.
Nevertheless, there are always trade-off between area, speed, power, cost, development time
and reusability when selecting an acceleration platform. FPGAs generally offer more
flexibility, higher performance and lower overheads. However, they suffer from a relatively
low level programming model as compared with off-the-shelf microprocessors such as
standard microprocessors and GPUs. Due to the aforementioned limitations, the need has
arisen for optimized FPGA core implementations which are crucial for this technology to
become viable in high performance computing (HPC).
This research proposes the use of state-of-the-art reprogrammable system-on-chip
technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the
Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model
(HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The
three novel aspects of this research are firstly that the algorithms are designed and
implemented in hardware, with each core achieving the highest performance compared to the
state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering
technique is adopted into the hardware architectures. Here, when the alignment matrix
computation task is overlapped with the PE configuration in a folded systolic array, the
overall throughput of the core is significantly increased. This is due to the bound PE
configuration time and the parallel PE configuration approach irrespective of the number of
PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without
relying on restricted onboard memory resources. Finally, a new performance metric is
devised, which facilitates the effective comparison of design performance between different
FPGA devices and families. The normalized performance indicator (speed-up per area per
process technology) takes out advantages of the area and lithography technology of any
FPGA resulting in fairer comparisons.
The cores have been designed using Verilog HDL and prototyped on the Alpha Data
ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation
results show that the proposed architectures achieved giga cell updates per second (GCUPS)
performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman
with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In
terms of speed-up improvements, comparisons were made on performance of the designed
cores against their corresponding software and the reported FPGA implementations. In the
case of comparison with equivalent software execution, acceleration of the optimal
alignment algorithm in hardware yielded an average speed-up of 269x as compared to the
SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core
achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of
HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped
BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up
compared to the latest NCBI BLAST software. In terms of comparison against other reported
FPGA implementations, the proposed normalized performance indicator was used to
evaluate the designed architectures fairly. The results showed that the first architecture
achieved more than 50 percent improvement, while acceleration of the profile HMM
sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the
gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after
taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in
terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per
dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive
platform for high performance computation with advantages of smaller area footprint as well
as represent economic ‘green’ solution compared to the other acceleration platforms. Higher
throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs
with minimal design effort