












EFFICIENT ARCHITECTURE OF HETEROGENOUES FPGA-GPU FOR             
3-D MEDICAL IMAGE COMPRESSION 
AZLAN BIN MUHARAM 
A thesis submitted in 
fulfillment of the requirement for the award of the 
Degree of Philosophy in Electrical Engineering  
Faculty of Electrical and Electronic Engineering 
Universiti Tun Hussein Onn Malaysia 

































I gratefully acknowledge who has supported me throughout my Ph.D work and 
finally the preparation of this thesis. In particular, I would like to thank my 
supervisor, Assoc. Prof. Dr. Afandi Ahmad, for accepting me as a student, and for 
the continuous support of my Ph.D study and related research, for his patience, 
motivation, and immense knowledge.  His guidance helped me in all the time of 
research and writing of this thesis.  I could not have imagined having a better advisor 
and mentor for my Ph.D study. 
I would also like to thank Dr. Mohammad Hairol Jabbar, and Dr. Ariffudin 
Joret for insightful guidance of my research studies.  I am also grateful to my fellow 
labmates in Reconfigurable Computing for Analytic Acceleration Focus Group 
(ReCAA): Noor Huda, Musa, Zarina Tukiran, Muzakir and Faris, for their patience 
and support in overcoming numerous obstacles I have been facing through my 
research. 
I must also thank to Kolej Komuniti Masjid Tanah (KKMT), especially to 
colleagues in Department of Engineering and Skill for their support. 
I would achieve nothing without the encouragement and compassion I received 
from my understanding wife Nor Izazaya Mat Desa, beloved kids Iwana, Uwais, 
Miqdad and all of my families.  This thesis is dedicated to them.  Their love, 













The advent of development in three-dimensional (3-D) imaging modalities have 
generated a massive amount of volumetric data in 3-D images such as magnetic 
resonance imaging (MRI), computed tomography (CT), positron emission 
tomography (PET), and ultrasound (US).  Existing survey reveals the presence of a 
huge gap for further research in exploiting reconfigurable computing for 3-D medical 
image compression.  This research proposes an FPGA based co-processing solution 
to accelerate the mentioned medical imaging system.  The HWT block implemented 
on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board.  
Analysis and performance evaluation of the 3-D images were been conducted.  
Furthermore, a novel architecture of context-based adaptive binary arithmetic coder 
(CABAC) is the advanced entropy coding tool employed by main and higher profiles 
of H.264/AVC.   This research focuses on GPU implementation of CABAC and 
comparative study of discrete wavelet transform (DWT) and without DWT for 3-D 
medical image compression systems.  Implementation results on MRI and CT 
images, showing GPU significantly outperforming single-threaded CPU 
implementation.  Overall, CT and MRI modalities with DWT outperform in term of 
compression ratio, peak signal to noise ratio (PSNR) and latency compared with 
images without DWT process.  For heterogeneous computing, MRI images with 
various sizes and format, such as JPEG and DICOM was implemented.  Evaluation 
results are shown for each memory iteration, transfer sizes from GPU to CPU 
consuming more bandwidth or throughput.  For size 786, 486 bytes JPEG format, 
both directions consumed bandwidth tend to balance.  Bandwidth is relative to the 
transfer size, the larger sizing will take more latency and throughput.  Next, OpenCL 
implementation for concurrent task via dedicated FPGA.  Finding from 
implementation reveals, OpenCL on batch procession mode with AOC techniques 
offers substantial results where the amount of logic, area, register and memory 












the kernel block refer to batch number. Therefore memory bank increased 
periodically related to kernel block.  It was found through comparative study that the 
tree balance and unroll loop architecture provides better achievement, in term of 













Kemunculan pembangunan imej perubatan pencitraan tiga dimensi (3-D) telah 
menghasilkan sejumlah besar data volumetrik dalam bentuk imej 3-D seperti 
perubatan resonan magnet (MRI), kiraan tomografi (CT), pancaran pesitron 
tomografi (PET), dan bunyi ultra (US).  Kajian tinjauan menunjukkan kewujudan 
jurang yang besar untuk penyelidikan lanjut dalam mengeksploitasi pengkomputeran 
yang dapat dikonfigurasi untuk pemampatan imej perubatan 3-D.  Projek ini 
mencadangkan penyelesaian berasaskan FPGA untuk mempercepat sistem 
pengimejan perubatan tersebut.  Blok HWT yang telah dipasangkan pada papan 
sbRIO-9632 FPGA adalah papan prototaip cip Spartan 3 (XC3S2000).  Analisa dan 
penilaian prestasi imej 3-D telah dijalankan.  Selain itu, binaan unggul Binary 
Arithmetic Coder Adaptif Berasaskan Konteks (CABAC) adalah alat pengekod 
entropi canggih yang digunakan oleh H.264/AVC yang utama dan berprofil lebih 
tinggi.  Untuk menyingkatkan masa pelaksanaan, unit pemprosesan grafik (GPU) 
NVIDIA GeForce 820M telah digunakan.  Kajian ini memberi tumpuan kepada 
penggunaan GP  pada CABAC dan kajian perbandingan Transformasi Wavelet 
Diskrit (DWT) dan tanpa DWT untuk sistem 3-D.  Hasil pelaksanaan pada imej MRI 
dan CT menggunakan GPU dan unit pemproses berpusat (CPU) dibentangkan, 
menunjukkan GPU jauh melebihi prestasi CPU.    Keseluruhannya, modaliti CT dan 
MRI dengan DWT mengatasi prestasi nisbah mampatan, Nisbah Puncak kepada 
Nisbah Kebisingan (PSNR) dan latensi berbanding dengan imej tanpa proses DWT.  
Pada pengkomputeran heterogen, imej MRI dengan pelbagai saiz dan format, seperti 
JPEG dan DICOM diterapkan.  Hasil penilaian menunjukkan untuk setiap lelaran 
memori, saiz pemindahan dari GPU ke CPU menggunakan lebih banyak lebar jalur.  
Untuk saiz 786, 486 bytes JPEG format, kedua-dua arah menggunakan jalur lebar 
yang hampir seimbang.  Jalur lebar relatif dengan saiz pemindahan, saiz yang lebih 
besar akan mengambil lebih banyak pendaman dan daya pemprosesan.  Seterusnya, 












membuktikan, OpenCL pada proses perarakan kumpulan dengan teknik AOC 
menawarkan dapatan yang lebih baik di mana jumlah logik, ruang, daftar dan 
memori meningkat secara berkadaran kepada bilangan kumpulan.  Ini kerana kernel 
menyalin blok kernel merujuk kepada nombor batch.  Selari dengan bank memori 
meningkat secara berkala berkadaran dengan blok kernel. Kajian perbandingan 
mendapati keseimbangan rajah dan seni bina gelung unroll memberikan pencapaian 



















TABLE OF CONTENTS ix 
LIST OF TABLES xiv 
LIST OF FIGURES xvi 
LIST OF SYMBOLS AND ABBREVIATIONS xx 
LIST OF APPENDICES xxiii 
CHAPTER 1 INTRODUCTION  1 
1.1 Overview 1 
1.2 Problem Statement 4 
1.3 Three-Dimensional (3-D) Medical Image Processing 4 
1.3.1 Computed Tomography (CT) 5 
1.3.2 Positron Emission Tomography (PET) 6 
1.3.3 Magnetic Resonance Imaging (MRI) 7 
1.3.4 Ultrasound (US) 8 
1.4 High-Performance Solution for Medical Image 
Processing Application 9 












1.4.2 Special Purpose Application-Specific 
Integrated Circuit (ASIC) Hardware 11 
1.4.3 Graphics Processing Unit (GPU) 11 
1.4.4 Field-Programmable Gate Array (FPGA) 14 
1.5 Design and Implementation Strategies 16 
1.6 Motivation and Research Objectives 17 
1.7 Overall Contributions 18 
1.8 Thesis Organization 20 
CHAPTER 2 RELATED WORK 21 
2.1 Overview 21 
2.2 3-D Medical Image Compression 24 
2.3 Reconfigurable Architecture 30 
2.3.1 FPGA-based Implementation of 3-D HWT 
using Haar Filter 30 
2.3.2 GPU-based Implementation of a 3-D Medical 
Images Compression System using CABAC 32 
2.3.3 Efficient Implementation of a Heterogeneous 
Computing with OpenCL for Medical Image 
Modalities 35 
2.3.4 Efficient FPGA Implementation of OpenCL 
on Concurrent Task via High-Level Synthesis 37 
2.4 Limitation of Existing Work and Research 
Opportunities 39 
2.5 Summary 43 
CHAPTER 3 FPGA-BASED IMPLEMENTATION OF 3-D HAAR             
WAVELET TRANSFORM 44 
3.1 Overview 44 
3.2 Mathematical Background and Design Methodology 45 
3.2.1 3-D HWT Algorithm 45 
3.2.2 Discrete Wavelet Transform (DWT) with 
Haar Filter 48 
3.3 Proposed System Architecture 48 












3.4 Results and Analysis 50 
3.4.1 Functional Simulation 51 
3.4.2 Hardware Implementation 54 
3.5 Summary 59 
CHAPTER 4 GPU-BASED IMPLEMENTATION OF CABAC FOR 
3-D MEDICAL IMAGE COMPRESSION  60 
4.1 Overview 60 
4.2 Algorithm and Methodology 62 
4.2.1 3-D HWT 62 
4.3 Proposed System Architectures 64 
4.3.1 Context-based Adaptive Binary Arithmetic 
Coding (CABAC) Block 68 
4.3.2 GPU Programming in MATLAB 70 
4.3.3 Parallelization on CPU-GPU Computing in 
MATLAB 71 
4.3.4 Experimental Procedure 73 
4.4 Results and Analysis 76 
4.4.1 Compression Efficiency 77 
4.4.2 Speed up 79 
4.4.3 Comparison with Previous Work 80 
4.4.4 Haar Filters Performance 82 
4.4.5 Gray Scale Performance 82 
4.5 Summary 83 
CHAPTER 5 HETEROGENEOUS COMPUTING ON TRANSFER 
SIZE FOR 3-D MEDICAL IMAGE MODALITIES 
WITH OpenCL  84 
5.1 Overview 84 
5.2 Algorithm and Design Methodology 85 
5.2.1 Hardware Configuration 85 
5.2.2 OpenCL Algorithm and GPU Programming 85 
5.3 Proposed System Architecture 89 
5.3.1 Experimental Setup 89 












5.3.3 GPU Computing 90 
5.3.4 Bandwidth 92 
5.3.5 Theoretical Bandwidth Calculation 92 
5.3.6 Effective Bandwidth Calculation 93 
5.3.7 Data Transfer between Host and Device 93 
5.4 Result and Analysis 94 
5.4.1 CPU-GPU Implementation 94 
5.4.2 Small Memory Iteration 94 
5.4.3 Large Memory Iteration 96 
5.4.4 3-D Medical Images Memory Iteration 98 
5.5 Summary 101 
CHAPTER 6 EFFICIENT FPGA IMPLEMENTATION OF OpenCL                
ON CONCURRENT TASK 
VIA HIGH-LEVEL SYNTHESIS  102 
6.1 Overview 102 
6.2 Architecture and Design Methodology 104 
6.2.1 Architecture of the Intel FPGA SDK for 
OpenCL FPGA Programming Flow 104 
6.2.2 OpenCL FPGA Programming Flow 106 
6.3 Proposed System Architecture 108 
6.3.1 Proposed System Architecture 108 
6.3.2 Optimizing Floating-Point Operations 110 
6.3.3 Unrolling Loops 113 
6.3.4 Experimental Setup and Design Test 114 
6.4 Experimental Results and Analysis 116 
6.4.1 OpenCL on FPGA implementation 116 
6.4.2 Discussion 117 
6.5 Summary 122 
CHAPTER 7 CONCLUSION AND FUTURE WORK 123 
7.1 Overview 123 
7.2 Achievements 123 
7.3 Limitations 126 

























LIST OF TABLES 
1.1 Possible processing function of GPU and FPGA. 3 
2.1 Summary of 3-D medical image compression. 42 
3.1 Overall proposed architecture’s performance and        
comparison with previous works. 55 
4.1 CPU configuration. 75 
4.2 GPU configuration. 75 
4.3 Overall performance and comparison of different           
modalities with DWT. 79 
4.4 Overall performance and comparison with different       
modalities without DWT. 79 
4.5 GPU and CPU performance on Time, GPU speedup             
ratios, and comparison with different size (in seconds). 81 
4.6 Comparison of the proposed method and previous work. 81 
5.1 GPU and CPU specification. 85 
5.2 Transfer bandwidths for CPU/GPU as a function of             
small size. 94 
5.3 Transfer bandwidths for GPU/CPU as a function of              
small size. 95 
5.4 Transfer bandwidths for CPU/GPU as a function of               
large size. 96 
5.5 Transfer bandwidths for GPU/CPU as a function of               
large size. 97 
5.6 3-D MRI images (JPEG format). 99 
5.7 3-D MRI images (DICOM format). 100 
6.1 CPU configuration. 115 












6.3 Resources utilisation and overall proposed                 
architectures performance towards batch  procession           
mode. 118 
6.4 Comparison of total area analysis of kernel system                    
and source towards tree balance and unroll loop. 119 
6.5 Comparison of throughput and latency analysis of                
kernel system towards tree balance and unroll loop. 121 
6.6 Comparison and analysis of total usage of memory in             
local memory and source towards tree balance and               












LIST OF FIGURES 
1.1 System architecture for 3-D HWT with transpose-based 
computation. 3 
1.2 Medical image features. 5 
1.3 Electron beam tomography [17]. 6 
1.4 Annihilation reaction and the subsequent coincidence      
detection [21]. 7 
1.5 MRI hardware overview. 8 
1.6 Basic component of US. 9 
1.7 GPGPU programming model, consist of the host (CPU)         
and device (GPU) [34], [64]. 13 
1.8 GPGPU Processor and Memory Hierarchy [34]. 14 
1.9 Basic FPGA structure. 16 
1.10 Generic design and implementation strategies. 17 
1.11 Overall research strategies with potential contribution               
to be achieved. 19 
2.1 Structure of related research issues. 23 
2.2 JPEG block diagram [70]. 25 
2.3 Block diagram of new transformation method [82]. 26 
2.4 Block diagram of hybrid approach [83]. 27 
2.5 Block diagram of the method [23]. 28 
2.6 The Overall flow of the proposed image compression      
approach [84]. 29 
2.7 Block diagram of vector quantisation [86]. 30 
2.8 (a) General procedure for binary AC. (b) CABAC          
procedure [105]. 32 
2.9 General block diagram of CABAC [106]. 33 
2.10 Trend of implementation of software and hardware. 39 
3.1 Matrix for 8 × 8 image. 47 












3.3 DWT tree with low and high filter. 48 
3.4 Proposed system architectures. (a) Compression system 
overview  (b) Architecture for 3-D Haar with                  
transpose-based computation (c) Input data for                        
sub-images (d) Output data for sub-images. 50 
3.5 The front panel of the host system application for the     
proposed architectures. 52 
3.6 The block diagram for the host system application. 52 
3.7 FPGA VI block diagram for VHDL configuration via            
CLIP for HWT process. 53 
3.8 Implementation result with (a) MRI image                                 
(b) CT image. 58 
4.1 (a) Framework of compression system (b) Architecture             
for 3-D Haar with transpose-based computation                         
(c) Input data for sub-images  (d) Output data for                    
sub-images. 63 
4.2 The algorithm of encoding and decoding for CABAC         
without DWT. 65 
4.3 The algorithm of encoding and decoding for CABAC             
with DWT. 66 
4.4 Proposed system architecture. 67 
4.5 Typical CABAC block diagram with three (3) stages                  
of process. 68 
4.6 Binary AC example. 70 
4.7 Comparison of the number of cores on a CPU system              
and a GPU. 71 
4.8 Overview of Thread Batching. 72 
4.9 MATLAB code on parallelization scheme. 73 
4.10 Comparison of PSNR with different of modalities on              
GPU and CPU where G/S means grey scale. 78 
4.11 Comparative study of original and reconstructed CT                
and MRI images for the first slices with DWT. 82 
4.12 Comparison of original and reconstructed CT and MRI       












5.1 Define project, data transfer size and iteration. 86 
5.2 GPU context, command queue and device id. 86 
5.3 Define platform and devices. 86 
5.4 Run test code (Host and Device). 87 
5.5 Direct API access to device buffer. 87 
5.6 Calculated bandwidth in second. 87 
5.7 Command queue for selected device. 88 
5.8 Get and log the device info. 88 
5.9 Released memory or clear memory on host and GPU. 89 
5.10 OpenCL heterogeneous platform. 90 
5.11 Latency concealing procedure embraced by GPUs. 92 
5.12 Performance of small CPU to GPU transfer. 95 
5.13 Performance of GPU to CPU for small transfer. 96 
5.14 Performance of large CPU to GPU transfer. 97 
5.15 Performance of large GPU to CPU transfer. 98 
5.16 Performance of 3-D medical images transfer size for            
JPEG format. 99 
5.17 Performance of 3-D medical images transfer size for        
DICOM format. 100 
6.1 Heterogeneous model: devices receive input data and 
instructions to compute results. 103 
6.2 Example of kernel generation and command queues              
code. 103 
6.3 Altera OpenCL platform. 105 
6.4 AOCL based FPGA design programming flow. 106 
6.5 Propose system architecture on concurrent task on              
FPGA. 109 
6.6 Task to write the buffer on DDR3. 109 
6.7 Task to execute the kernel code on FPGA. 110 
6.8 Task to send the resultant kernels to Host. 110 
6.9 Task for concurrent process: execute kernels on FPGA             
and send results to Host. 110 
6.10 Default Floating-Point Implementation. 111 












6.12 Kernel program without unroll loop. 113 
6.13 Balanced Tree Floating-Point Implementation. 114 
6.14 Developer command prompt for VS 2017 during the       
execution of AOC command. 116 
6.15 Results in HTML file. 117 
6.16 System viewer for the full computation. 119 












LIST OF SYMBOLS AND ABBREVIATIONS 
1-D - One-dimensional 
2-D - Two-dimensional 
3-D - Three-dimensional 
ASIC - Application specific integrated circuit  
AES   - Advanced encryption standard                                               
ALU - Arithmetic logic unit 
ALCM - Activity level classification model 
AC - Arithmetic coding  
API - Application programming interface  
bpp - bit per pixel 
CR - Compression ratio 
CLIP - Component level IP 
CABAC - Context-based adaptive binary arithmetic coder 
CAVLC - Context-adaptive variable length coding  
CT - Computed tomography  
CPU - Central processing unit 
CAD - Computed aided design 
CUDA - Computed unified device architecture 
DWT - Discrete wavelet transform  
DA - Distributed arithmetic  
DCT - Discrete cosine transform 
DFT - Discrete fraction Fourier transform  
DICOM - Digital imaging and communication in medicine 
DSP - Digital signal processor  
DMA - Direct memory access 
Daub4 - Daubechies 4-tab  
EDT - Enhance DPCM transformation 
FFT - Fast Fourier transform 












FPGA - Field programmable gate array   
GPU - Graphical processing unit  
GPGPU - General purpose GPU   
HDL - Hardware description language 
HWT - Haar wavelet transform   
HPC     -   High performance computing  
IOB - Input-output block  
IDWT - Inverse discrete wavelet transform  
i/o - input/output 
JPEG -          Joint photographic experts group 
LabVIEW    -  Laboratory virtual instrumentation engineering    
                                                 workbench 
LUT - Look-up table  
LOR - Line of response 
LPS - Least probable symbol   
MRI     -  Magnetic resonance imaging 
MPS -          Most probable symbol    
MSE - Means square error 
MAC - Multiply-accumulate   
MBI     -   Multi frequency bi-planar inter-imaging  
NPP - NVIDIA performance primitive 
OpenCL - Open computer language 
PET - Positron emission tomography  
PSNR - Peak signal to noise ratio   
PCM - Prediction coding model                                               
PCI-e - Peripheral component interconnect express 
PE - Processing element 
RAM - Random access memory 
rLPS - Range least probable symbol 
RH  - Reconfigurable hardware  
ROM  - Read access memory 
ROI - Region of interest 
RTL - Register transfer level 












SIMT - Single instruction, multiple thread  
SP  - Streaming processor 
SPIHT - Set partitioning in hierarchical trees 
SEC - System error compensation 
SA - Systolic array 
SP - Symmetry processor 
SIMD - Single instruction, multiple data 
SPMT - Single program, multiple treads  
US  - Ultrasound 
VHDL - VHSIC Hardware Description Language 
VI - Virtual instrument  
Xilinx ISE  - Xilinx integrated software development 



























LIST OF APPENDICES 
APPENDIX TITLE PAGE 
A Xilinx ISE and FPGA program       125 
B List of Publication  149 


















Recently, the impact of tremendous development in three-dimensional (3-D) imaging 
modalities, has produced enormous measure of volumetric information such as 
computed tomography (CT), positron emission tomography (PET), magnetic 
resonance imaging (MRI), and ultrasound (US).  Spin-off from this scenario has 
created different applications, specifically, for telemedicine and teleradiology.         
In line with these issues, effective capacity or storage and transmission of 
information through high data transfer capacity using computerized correspondence 
lines are significant in medical image compression  [1], [2]. 
1.1 Overview 
Most of the 3-D medical imaging algorithms with matrix transformation and popular 
fundamental operation are involved in the transform-based methods.  Therefore, 
high-performance systems are really needed while keeping the architectures flexible 
to allow for quick upgradeability with real-time applications [3].  Next, an efficient 
implementation of these operations is of significant importance, in order to obtain 
efficient solutions for large medical volumes data. 
Previous work is more focused on fundamental actions matrix or vector 
operation of the algorithms used in real-time medical image processing [1], [4].   
Normally, most of these operations are matrix transforms including discrete wavelet 
transforms (DWT), fast Fourier transform (FFT) and some recently developed 
transforms such as curvelet, finite Radon, and ridgelet transforms which are used in 













Distinctively, wavelets are localized waves.  DWT permits a signal to be 
localized in each time and frequency [5].  The time-frequency representation of the 
signal is provided by wavelet transform.  The development of DWT is to tackle 
problems of the disadvantage in the short-time Fourier transform (STFT), which may 
even be accustomed to analyze non-stationary signals.  They have their energy 
concentrated in time or space and are suited for analysis of transient signals.  Fourier 
transform and STFT utilize waves to analyze signals, whereas wavelet transform 
uses wavelets of finite energy.  The wavelet transform, at high frequencies, gives 
good time resolution and poor frequency resolution, while at low frequencies, the 
wavelet transform gives good frequency resolution and poor time resolution.  
  Presently, other approaches such as a network of computers have been 
implemented in 3-D transforms, but a chip devoted to this change will give 
tremendous results.  Despite its intricacy, there has been an interest in 3-D DWT 
implementation on plenty of platforms.  Existing survey exhibits that the research 
can be classified into three (3) categories: architecture development [6], architecture 
with field programmable gate array (FPGA) implementation [7], and finally 
architecture that has been implemented on other silicon platforms [8]. 
As can be seen from the existing research [8], there is still remaining research 
gap for additional exploration in reconfigurable computing for 3-D medical image 
compression and two (2) major restrictions can be distinguished as : 
i. Medical image compression has not been radical in using the current 3-D 
DWT implementation.  The Daubechies filter has been broadly utilized in 
several implementations [6], [9], while Haar, Symlet, Coifflet and 
Biorthogonal filters remain open for further experimentation; and 
ii. Image compression is one of the well-established research areas.  However, 
medical image compression especially those dealing with 3-D modalities are 
considered as an immature research range.  Moreover, plenty of new 
compressions are proposed but very minimal hardware implementation of    
3-D medical image compression is explored [1], [3]. 
Based on the existing work limitations, this project is concerned with having 
an efficient architecture for 3-D DWT that will be used in a reconfigurable 













[1] A. Arthur and V. Saravanan, “Efficient medical image compression technique 
for telemedicine considering online and offline application,” 2012 Int. Conf. 
Comput. Commun. Appl. ICCCA 2012, 2012. 
[2] D. U. Shah and C. H. Vithlani, “Efficient implementations of discrete wavelet 
transforms using FPGAs,” Int. J. Adv. Eng. Technol., vol. 1, no. 4, pp. 100–
111, 2011. 
[3] A. Ahmad, B. Krill, A. Amira, and H. Rabah, “Efficient architectures for 3D 
HWT using dynamic partial reconfiguration,” J. Syst. Archit., vol. 56, pp. 
305–316, 2010. 
[4] A. Ahmad and A. Amira, “Efficient Reconfigurable Architectures for 3D 
Medical Image Compression,” Int. Conf. Field-Programmable Technol., pp. 
472–474, 2009. 
[5] P. Nicholl and A. Ahmad, “Optimal Discrete Wavelet Transform ( DWT ) 
Features for Face Recognition,” no. December, pp. 6–9, 2010. 
[6] M. Weeks and M. Bayoumi, “3D discrete wavelet transform architectures,” 
ISCAS ’98. Proc. 1998 IEEE Int. Symp. Circuits Syst. (Cat. No.98CH36187), 
vol. 4, 1998. 
[7] B. Das and S. Banerjee, “A memory efficient 3-D DWT architecture,” in 16th 
International Conference on VLSI Design, Proceedings., 2003. 
[8] S. M. Ismail, A. E. Salama, and M. F. Abu-ElYazeed, “FPGA Implementation 
of an Efficient 3D-WT Temporal Decomposition Algorithm for Video 
Compression,” in 2007 IEEE International Symposium on Signal Processing 
and Information Technology, 2007. 
[9] R. M. Jiang and D. Crookes, “FPGA implementation of 3D discrete wavelet 
transform for real-time medical imaging,” in 2007 18th European Conference 
on Circuit Theory and Design, 2007. 
[10] P. Govindan, T. Gonnot, S. Gilliland, and J. Saniie, “3D ultrasonic signal 
compression algorithms for high signal fidelity,” in Midwest Symposium on 













[11] B. Wang, P. Govindan, T. Gonnot, and J. Saniie, “Acceleration of ultrasonic 
data compression using OpenCL on GPU,” in IEEE International Conference 
on Electro Information Technology, 2015. 
[12] M. A. M. Salem, M. Appel, F. Winkler, and B. Meffert, “FPGA-based smart 
camera for 3D wavelet-based image segmentation,” in 2008 2nd ACM/IEEE 
International Conference on Distributed Smart Cameras, ICDSC 2008, 2008. 
[13] G. Z. G. Zhang, M. Talley, W. Badawy, M. Weeks, and M. Bayoumi, “A low 
power prototype for a 3D discrete wavelet transform processor,” ISCAS’99. 
Proc. 1999 IEEE Int. Symp. Circuits Syst. VLSI (Cat. No.99CH36349), vol. 1, 
1999. 
[14] Y. Kuang, G. Pratx, M. Bazalova, B. Meng, J. Qian, and L. Xing, “First 
demonstration of multiplexed X-Ray Fluorescence Computed Tomography 
(XFCT) imaging,” IEEE Trans. Med. Imaging, vol. 32, no. 2, pp. 262–267, 
2013. 
[15] M. Alem´n-Flores, L. Alvarez, P. Alem´n-Flores, and R. Fuentes-Pavon, 
“Segmentation of Computed Tomography 3D Images Using Partial 
Differential Equations,” 2011 Seventh Int. Conf. Signal Image Technol. 
Internet-Based Syst., no. 3, pp. 345–349, 2011. 
[16] L. W. Goldman, “Principles of CT and CT Technology *,” no. September, pp. 
115–129, 2009. 
[17] S. Tabakov, “Basic Principles of CT scanners and image reconstruction,” The 
Abdul Salam International Centre for Theoretical Physics, 2010. 
[18] A. L. R. Monteiro, A. M. C. Machado, and M. H. M. Lewer, “A multicriteria 
method for cervical tumor segmentation in Positron Emission Tomography,” 
Proc. - IEEE Symp. Comput. Med. Syst., pp. 205–208, 2014. 
[19] C. Lartizien, S. Marache-Francisco, and R. Prost, “Automatic detection of 
lung and liver lesions in 3-D positron emission tomography images: A pilot 
study,” IEEE Trans. Nucl. Sci., vol. 59, no. 1 PART 1, pp. 102–112, 2012. 
[20] D. B. Keator and A. Ihler, “Crystal Identification in Positron Emission 
Tomography Using Probabilistic Graphical Models,” vol. 62, no. 5, pp. 2102–
2112, 2015. 
[21] A. A. M. Van Der Veldt, E. F. Smit, and A. A. Lammertsma, “Positron 
emission tomography as a method for measuring drug delivery to tumors in 












[22] J. M. T. Xiaojun Qi, “A progressive transmission capable diagnostically 
lossless compression scheme for 3D medical image sets,” Inf. Sci. (Ny)., vol. 
175, no. 3, pp. 217–243, Oct. 2005. 
[23] S. Amraee, N. Karimi, S. Samavi, and S. Shirani, “Compression of 3D MRI 
images based on symmetry in prediction-error field,” Proc. - IEEE Int. Conf. 
Multimed. Expo, 2011. 
[24] Y. C. Guo, L. Q. He, and S. B. Wang, “Soft-Threshold De-noising Method of 
Medical Ultrasonic Image Based on PCNN,” Artif. Intell. Comput. Intell. 
2009. AICI ’09. Int. Conf., vol. 3, no. 1, pp. 554–558, 2009. 
[25] V. Radha, “A Comparative Study on ROI-Based Lossy Compression 
Techniques for Compressing Medical Images,” vol. I, 2011. 
[26] L. Santos, E. Magli, R. Vitulli, J. F. Lopez, and R. Sarmiento, “Highly-parallel 
gpu architecture for lossy hyperspectral image compression,” IEEE J. Sel. 
Top. Appl. Earth Obs. Remote Sens., 2013. 
[27] D. Signal, “Design of the Assembler,” pp. 1483–1497, 1981. 
[28] J. Boddie, “A Brief History of AT&T’s First Digital Signal Processor,” pp. 
14–18, 2017. 
[29] M. Muthulakshmi, J. R. Heath, K. L. Calvert, and J. Griffioen, “ESP : A 
Flexible , High-Performance , PLD-Based Network Service,” vol. 0, no. c, pp. 
1014–1018, 2004. 
[30] R. Kumar, “Evolution of Management Processes for ASIC Development and 
Implementation,” 2006 IEEE Int. Eng. Manag. Conf., pp. 292–296, 2006. 
[31] L. Ronghua, Z. Xiaoyang, H. Jun, G. Yehua, and M. Lang, “Design and VLSI 
implementation of a security ASIP,” ASICON 2007 - 2007 7th Int. Conf. ASIC 
Proceeding, pp. 866–869, 2007. 
[32] J. P. Farrugia, P. Horain, E. Guehenneux, and Y. Alusse, “GPUCV: A 
framework for image processing acceleration with graphics processors,” 2006 
IEEE Int. Conf. Multimed. Expo, ICME 2006 - Proc., vol. 2006, pp. 585–588, 
2006. 
[33] A. Peternier, M. Defilippi, P. Pasquali, and A. Cantone, “Performance 
Analysis of GPU-based SAR and Interferometric SAR image processing,” 














[34] M. Cavus, D. Sumerkan, O. S. Simsek, and H. Hassan, “GPU based Parallel 
Image Processing Library for Embedded Systems,” Int. Conf. Comput. Vis. 
Theory Appl., pp. 234–241, 2014. 
[35] A. Asaduzzaman, A. Martinez, and A. Sepehri, “A Time-Efficient Image 
Processing Algorithm for Multicore/Manycore Parallel Computing.” 
[36] S. Mittal and J. S. Vetter, “A Survey of CPU-GPU Heterogeneous Computing 
Techniques,” ACM Comput. Surv., vol. 47, no. 2, pp. 1–36, 2015. 
[37] J. G. C. Panappally and M. S. Dhanesh, “Design of graphics processing unit 
for image processing,” First Int. Conf. Comput. Syst. Commun., no. 12, pp. 
299–302, 2014. 
[38] L. Fang, M. Wang, H. Ying, and F. Hu, “Multi-GPU based near real-time 
preprocessing and releasing system of optical satellite images,” Int. Geosci. 
Remote Sens. Symp., pp. 2467–2470, 2014. 
[39] Y. Liu, B. Chen, H. Yu, Y. Zhao, Z. Huang, and Y. Fang, “Applying GPU and 
POSIX thread technologies in massive remote sensing image data processing,” 
in Proceedings - 2011 19th International Conference on Geoinformatics, 
Geoinformatics 2011, 2011. 
[40] J. Bodily, J. Chase, B. Nelson, D.-J. Lee, and Z. Wei, “A Comparison Study 
on Implementing Optical Flow and Digital Communications on FPGAs and 
GPUs,” ACM Trans. Reconfigurable Technol. Syst., vol. 3, no. 2, pp. 1–22, 
2010. 
[41] A. Ruiz, M. Ujaldón, J. A. Andrades, J. Becerra, K. Huang, T. Pan, and J. 
Saltz, “The GPU on biomedical image processing for color and phenotype 
analysis,” Proc. 7th IEEE Int. Conf. Bioinforma. Bioeng. BIBE, pp. 1124–
1128, 2007. 
[42] Y. Tan, S. Member, and K. Ding, “A Survey on GPU-Based Implementation 
of Swarm Intelligence Algorithms,” IEEE Trans. Cybern., pp. 1–14, 2015. 
[43] F. Xu and K. Mueller, “Towards a unified framework for rapid 3D computed 
tomography on commodity GPUs,” IEEE Nucl. Sci. Symp. Med. Imaging 
Conf., vol. 4, pp. 2757–2759, 2003. 
[44] S. Ha, M. Ispiryan, S. Matej, and K. Mueller, “GPU-Based spatially variant 
SR kernel modeling and projections in 3D DIRECT TOF PET 













[45] K. Mueller and F. Xu, “Practical considerations for GPU-accelerated CT,” 3rd 
IEEE Int. Symp. Biomed. Imaging Nano to Macro, 2006, pp. 1184–1187, 
2006. 
[46] S. Kinouchi, T. Yamaya, E. Yoshida, H. Tashima, H. Kudo, H. Haneishi, and 
M. Suga, “GPU-Based PET Image Reconstruction Using an Accurate 
Geometrical System Model,” Ieee Trans. Nucl. Sci., vol. 59, no. 5, pp. 1977–
1983, 2012. 
[47] M. Xiao, H. Deng, F. Wang, and K. Ji, “A survey on GPU techniques in 
astronomical data processing,” Proc. - 2013 Int. Conf. Comput. Sci. Appl. CSA 
2013, pp. 206–209, 2013. 
[48] H. L. L. Khor, S. C. Liew, J. M. Zain, S. Engineering, L. T. Razak, and P. D. 
Makmur, “A review on parallel medical image processing on GPU,” 2015 4th 
Int. Conf. Softw. Eng. Comput. Syst. ICSECS 2015 Virtuous Softw. Solut. Big 
Data, pp. 45–48, 2015. 
[49] P. P. Shete, P. P. K. Venkat, D. M. Sarode, M. Laghate, S. K. Bose, and R. S. 
Mundada, “Object oriented framework for CUDA based image processing,” 
Proc. - 2012 Int. Conf. Commun. Inf. Comput. Technol. ICCICT 2012, pp. 1–
6, 2012. 
[50] Z. Juhasz and G. Kozmann, “A GPU-based simultaneous real-time EEG 
processing and visualization system for brain imaging applications,” in 2015 
38th International Convention on Information and Communication 
Technology, Electronics and Microelectronics, MIPRO 2015 - Proceedings, 
2015. 
[51] J. R. Ferreira, M. C. Oliveira, and A. L. Freitas, “Performance Evaluation of 
Medical Image Similarity Analysis in a Heterogeneous Architecture,” 2014 
IEEE 27th Int. Symp. Comput. Med. Syst., pp. 159–164, 2014. 
[52] C. Dai and J. Yang, “Research on orthorectification of remote sensing images 
using GPU-CPU cooperative processing,” in 2011 International Symposium 
on Image and Data Fusion, ISIDF 2011, 2011. 
[53] S. Philip, B. Summa, V. Pascucci, and P. T. Bremer, “Hybrid CPU-GPU 
solver for gradient domain processing of massive images,” Proc. Int. Conf. 














[54] V. Q. Dang, E. El-Araby, L. H. Dao, and L.-C. Chang, “Accelerating 
nonlinear diffusion tensor estimation for medical image processing using high 
performance GPU clusters,” IEEE 24th Int. Conf. Appl. Syst. Archit. Process., 
no. 0, pp. 265–268, 2013. 
[55] B. Goossens, J. De Vylder, and W. Philips, “Quasar: a new heterogeneous 
programming framework for image and video processing algorithms on CPU 
and GPU,” Proc. IEEE Int. Conf. Image Process., pp. 2183--2185, 2014. 
[56] K. Mueller, “Accelerating regularized iterative ct reconstruction on 
commodity graphics hardware (GPU),” in EEE International Symposium 
onBiomedical Imaging: From Nano to Macro, 2009, pp. 1287–1290. 
[57] G. V. Stoica, R. Dogaru, and E. C. Stoica, “Speeding-up Image Processing in 
Reaction- Diffusion Cellular Neural Networks using CUDA-enabled GPU 
Platforms,” Int. Conf. – 6th Ed. Electron. Comput. Artif. Intell., vol. 2, no. 2, 
2014. 
[58] Z. Zheng and K. Mueller, “Cache-aware GPU memory scheduling scheme for 
CT back-projection,” in Nuclear Science Symposium Conference Record 
(NSS/MIC), 2010, pp. 2248–2251. 
[59] R. Shams, P. Sadeghi, R. a Kennedy, and R. I. Hartley, “A Survey of Medical 
Image Registration on Multicore and the GPU,” IEEE Signal Process. Mag., 
vol. 27, no. March, pp. 50–60, 2010. 
[60] F. Xu, W. Xu, M. Jones, B. Keszthelyi, J. Sedat, D. Agard, and K. Mueller, 
“On the efficiency of iterative ordered subset reconstruction algorithms for 
acceleration on GPUs,” Comput. Methods Programs Biomed., 2010. 
[61] S. Ha, J. Pi, and K. Mueller, “GPU-Accelerated First-Order Scattering 
Simulation for X-Ray CT Image Reconstruction,” 2nd Int. Conf. Image Form. 
X-ray Comput. Tomogr., pp. 1–5, 2012. 
[62] A. Gregerson, “Implementing Fast MRI Gridding on GPUs via CUDA,” 2010. 
[63] P. Meng, G. R. C. Jr, R. Kastner, and D. A. Demer, “GPU Accelerated Post-
Processing for Multifrequency Biplanar Interferometric Imaging,” Ocean. - 
San Diego, 2013. 
[64] L. Shi, W. Liu, H. Zhang, Y. Xie, and D. Wang, “A survey of GPU-based 
medical image computing techniques,” Quant. Imaging Med. Surg., vol. 2, no. 













[65] Y. Kui-Ying, J. Lin, Y. Jun-Peng, and X. Lu-Ping, “Processing piecewise 
autoregressive model image interpolation algorithm on GPU with CUDA,” in 
2011 International Conference on Wireless Communications and Signal 
Processing (WCSP), 2011, pp. 1–4. 
[66] J. Bert, H. Perez-Ponce, S. Jan, Z. El Bitar, P. Gueth, V. Cuplov, H. Chekatt, 
D. Benoit, D. Sarrut, Y. Boursier, D. Brasse, I. Buvat, C. Morel, and D. 
Visvikis, “Hybrid GATE: A GPU/CPU implementation for imaging and 
therapy applications,” in IEEE Nuclear Science Symposium Conference 
Record, 2012. 
[67] Y. Han, K. Chakraborty, S. Roy, and V. Kuntamukkala, “Design and 
Implementation of a Throughput-Optimized GPU Floorplanning Algorithm,” 
ACM Trans. Des. Autom. Electron. Syst., vol. 16, no. 3, pp. 1–21, 2011. 
[68] D. Hallmans, A. B. B. Ab, K. Sandström, M. Lindgren, and T. Nolte, 
“GPGPU for Industrial Control Systems,” pp. 1–4, 2013. 
[69] M. Sundaresan and E. Devika, “Image compression using H.264 and deflate 
algorithm,” Int. Conf. Pattern Recognition, Informatics Med. Eng., pp. 242–
245, Mar. 2012. 
[70] S. J. Pinto and Jayanand P.Gawande, “Performance analysis of medical image 
compression techniques,” in 3rd Asian Himalayas International Conference 
on Internet (AH-ICI), 2012, pp. 5–8. 
[71] Z. Zuo, X. Lan, L. Deng, S. Yao, and X. Wang, “An improved medical image 
compression technique with lossless region of interest,” Optik (Stuttg)., vol. 
126, no. 21, pp. 2825–2831, 2015. 
[72] M. Sabarimalai Sur and S. Dandapat, “Wavelet-based electrocardiogram 
signal compression methods and their performances: A prospective review,” 
Biomed. Signal Process. Control, vol. 14, no. 1, pp. 73–107, 2014. 
[73] P. Suapang, M. Thongyoun, and S. Chivapreecha, “Medical image 
compression and quality assessment,” Proc. SICE Annu. Conf., pp. 841–846, 
2013. 
[74] H. Zaineldin, M. A. Elhosseini, and H. A. Ali, “Image compression algorithms 
in wireless multimedia sensor networks: A survey,” Ain Shams Eng. J., vol. 6, 














[75] N. Karimi, S. Samavi, S. Shirani, A. Banaei, and E. Nasr-Esfahani, “Real-time 
lossless compression of microarray images by separate compaction of 
foreground and background,” Comput. Stand. Interfaces, vol. 39, pp. 34–43, 
2015. 
[76] X. Cheng, H. Long, W. Chen, J. Xu, Y. Huang, and F. Li, “Three-dimensional 
alteration of cervical anterior spinal artery and anterior radicular artery in rat 
model of chronic spinal cord compression by micro-CT,” Comput. Methods 
Programs Biomed., vol. 37, no. 2, pp. 838–848, 2015. 
[77] B. Koc, Z. Arnavut, and H. Koçak, “The pseudo-distance technique for 
parallel lossless compression of color-mapped images,” Comput. Electr. Eng., 
vol. 46, pp. 456–470, 2015. 
[78] A. M. Rufai, G. Anbarjafari, and H. Demirel, “Lossy Medical Image 
Compression Using Huffman Coding and Singular Value Decomposition,” 
21st Signal Process. Commun. Appl. Conf., 2013. 
[79] Y. Nian, M. He, and J. Wan, “Distributed near lossless compression algorithm 
for hyperspectral images,” Comput. Electr. Eng., vol. 40, no. 3, pp. 1006–
1014, 2014. 
[80] T. G. Shirsat and V. K. Bairagi, “Lossless medical image compression by IWT 
and predictive coding,” in 2013 International Conference on Energy Efficient 
Technologies for Sustainability, 2013, pp. 1279–1283. 
[81] Y. Nian, M. He, and J. Wan, “Lossless and near-lossless compression of 
hyperspectral images based on distributed source coding,” J. Vis. Commun. 
Image Represent., vol. 28, pp. 113–119, 2015. 
[82] F. Sepehrband, M. Mortazavi, S. Ghorshi, and J. Choupan, “Simple lossless 
and near-lossless medical image compression based on enhanced DPCM 
transformation,” IEEE Pacific RIM Conf. Commun. Comput. Signal Process. - 
Proc., pp. 66–72, 2011. 
[83] R. Pizzolante, B. Carpentieri, and A. Castiglione, “A secure low complexity 
approach for compression and transmission of 3-D medical images,” Proc. - 
2013 8th Int. Conf. Broadband, Wirel. Comput. Commun. Appl. BWCCA 
2013, pp. 387–392, 2013. 
[84] K. T. Kumari, P Vasanthi, “A Secure Fast 2D - Discrete Fractional Fourier 
Transform Based Medical Image Compression Using Hybrid Encoding 












[85] M. Satti and S. Kak, “Multilevel indexed quasigroup encryption for data and 
speech,” IEEE Trans. Broadcast., vol. 55, pp. 270–281, 2009. 
[86] T. Phanprasit, “Compression of Medical Image Using Vector Quantization,” 
no. Clc, pp. 0–3, 2013. 
[87] C. July, “Accelerating Haar wavelet transform with CUDA-GPU (July 2017),” 
in 2017 13th International Conference on Natural Computation, Fuzzy 
Systems and Knowledge Discovery (ICNC-FSKD), 2017, no. July, pp. 791–
796. 
[88] S. D. Thepade, “Partial Energy of Hybrid Wavelet Transformed Videos for 
Content Based Video Retrieval with various Similarity measures using Cosine 
, Haar and Walsh Transforms,” in 2015 Global Conference on Communication 
Technologies (GCCT), 2015, no. Gcct, pp. 261–266. 
[89] G. Saldana and M. Arias-Estrada, “Compact FPGA-based systolic array 
architecture suitable for vision systems,” J. High Perform. Syst. Archit., pp. 3–
8, 2007. 
[90] S. L. Pinjare, K. Mudnaf, and S. Kumar, “Distributed arithmetic multiplier 
based artificial neural network architecture for image compression,” 2nd IEEE 
Int. Conf. Parallel, Distrib. Grid Comput., pp. 135–140, 2012. 
[91] A. N. Sazish and A mira, “An efficient architecture for HWT using sparse 
matrix factorisation and DA principles,” in IEEE Asia-Pacific Conference on 
Circuits and Systems, Proceedings, APCCAS, 2008, pp. 1308–1311. 
[92] M. Vucha, “Design and FPGA Implementation of Systolic Array Architecture 
for Matrix Multiplication,” vol. 26, no. 3, pp. 18–22, 2011. 
[93] Y. Zhou and P. Shi, “Distributed arithmetic for FIR filter implementation on 
FPGA,” 2011 Int. Conf. Multimed. Technol. ICMT 2011, vol. 1, no. 4, pp. 
294–297, 2011. 
[94] L. Wenna, G. Yang, Y. Yufeng, and G. Liqun, “Medical image coding based 
on wavelet transform and distributed arithmetic coding,” 2011 Chinese 
Control Decis. Conf., pp. 4159–4162, 2011. 
[95] M. Martina, G. Masera, M. R. Roch, and G. Piccinini, “Result-biased 
distributed-arithmetic-based filter architectures for approximately computing 














[96] A. M. Al-haj, “Fast Discrete Wavelet Transformation Using FPGAs and 
Distributed Arithmetic,” Int. J. Appl. Sci. Eng., vol. 1, no. 2, pp. 160–171, 
2003. 
[97] M. Nagabushanam, “Design and FPGA implementation of modified 
distributive arithmetic based dwt-idwt processor for image compression,” in 
International Conference on Communications and Signal Processing 
(ICCSP), 2011, pp. 1–4. 
[98] A. Otero, Y. E. Krasteva, E. De La Torre, and T. Riesgo, “Generic systolic 
array for run-time scalable cores,” Lect. Notes Comput. Sci. (including Subser. 
Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5992 LNCS, pp. 4–
16, 2010. 
[99] P. C. Chandrasekhar and S. N. Reddy, “FPGA Implementation of Systolic 
Array Architecture for 3D- DWT Optimizing Speed and Power,” IOSR J. 
Eng., vol. 2, no. 10, pp. 39–50, 2012. 
[100] M. M. Azadfar, “Implementation of A Optimized Systolic Array Architecture 
for FSBMA using FPGA for Real-time Applications,” Ijcsns, vol. 8, no. 3, p. 
46, 2008. 
[101] B. K. Saptalakar, M. Rachannavar, and M. K. Pavankumar, “Design and 
Implementation of VLSI Systolic Array Multiplier for DSP Applications,” Int. 
J. Sci. Eng. Technol., vol. 2, no. 3, pp. 156–159, 2013. 
[102] Y. Sun, P. Li, G. Gu, Y. Wen, Y. Liu, and D. Liu, “Accelerating HMMer on 
FPGAs using Systolic Array Based Architecture,” IPDPS 2009 - 2009 IEEE 
Int. Parallel Distrib. Process. Symp., 2009. 
[103] S. Whitty, H. Sahlbach, R. Ernst, and W. Putzke-Roming, “Mapping of a film 
grain removal algorithm to a heterogeneous reconfigurable architecture,” in 
Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE 
’09., 2009, pp. 27–32. 
[104] J. Wu, M. Wang, J. Jeong, and L. Jiao, “Adaptive-distributed arithmetic 
coding for lossless compression,” in 2010 2nd IEEE International Conference 
on Network Infrastructure and Digital Content, IC-NIDC 2010, 2010, pp. 
541–545. 
[105] R. R. Osorio and J. D. Bruguera, “High-Throughput Architecture for 
H.264/AVC CABAC Compression System,” IEEE Trans. Circuits Syst. Video 












[106] W. Wang, B. Guo, S. Zhang, and Q. Ye, “A CABAC accelerating algorithm 
based on adaptive probability estimation update,” Proc. 2009 2nd Int. Congr. 
Image Signal Process. CISP’09, 2009. 
[107] H. Shojania and S. Sudharsanan, “A high performance CABAC encoder,” 3rd 
Int. IEEE Northeast Work. Circuits Syst. Conf. NEWCAS 2005, vol. 2005, pp. 
315–318, 2005. 
[108] “Optimization of the critical loop in Renormalization CABAC decoder,” pp. 
199–203, 2015. 
[109] N. Zhang, J. Wang, and Y. Chen, “Image parallel processing based on GPU,” 
in 2nd International Conference on Advanced Computer Control, 2010, pp. 
367–370. 
[110] L. M. Russo, E. C. Pedrino, E. Kato, and V. O. Roda, “Image convolution 
processing: A GPU versus FPGA comparison,” SPL 2012 - 8th South. 
Program. Log. Conf., no. 2010, 2012. 
[111] R. R. Osorio and J. D. Bruguera, “An FPGA Architecture for CABAC 
Decoding in Manycore Systems ∗,” pp. 293–298, 2008. 
[112] G. Pastuszak, “High speed architecture of the CABAC 3 robability 0 odeling 
for H . 265 / HEVC ( encoders),” 2016, pp. 143–146. 
[113] H. Z. Si, Xiaoshu, “High Performance Remote Sensing Image Processing 
Using CUDA,” in Third International Symposium on Electronic Commerce 
and Security, 2010, pp. 121–125. 
[114] A. Karantza, S. L. Alarcon, and N. D. Cahill, “A comparison of sequential and 
GPU-accelerated implementations of B-spline signal processing operations for 
2-D and 3-D images,” in 2012 3rd International Conference on Image 
Processing Theory, Tools and Applications, IPTA 2012, 2012. 
[115] H. Liu, T. Ma, S. Chen, Y. Liu, S. Wang, and Y. Jin, “Development of GPU 
based image reconstruction method for clinical SPECT,” IEEE Nucl. Sci. 
Symp. Med. Imaging Conf. Rec., pp. 3415–3418, 2012. 
[116] J. P. D. Comput, A. J. Maier, and B. F. Cockburn, “Optimization of Low-
Density Parity Check decoder performance for OpenCL designs synthesized to 
FPGAs,” J. Parallel Distrib. Comput., vol. 107, pp. 134–145, 2017. 
[117] J. C. Peng, Hai, Letian Huang, “An Efficient FPGA Implementation for Odd-
Even sort based KNN Algoritj uisng OpenCL,” in ISOCC 2016 : 13th 












[118] W. Van Ranst and J. Vennekens, “International Journal of Approximate 
Reasoning An OpenCL implementation of a forward sampling algorithm for 
CP-logic,” Int. J. Approx. Reason., vol. 67, pp. 60–72, 2015. 
[119] S. J. Parker and V. A. Chouliaras, “An OpenCL software compilation 
framework targeting an SoC-FPGA VLIW chip multiprocessor,” J. Syst. 
Archit., vol. 68, pp. 17–37, 2016. 
[120] J. S. Lee and T. Ebrahimi, “Perceptual video compression: A survey,” IEEE J. 
Sel. Top. Signal Process., vol. 6, no. 6, pp. 684–697, 2012. 
[121] A. Amselem, T. Hatsui, and M. Yamaga, “Real-Time Embedded Lossless 
Compression for Sparse Signal Data Optimized for X-Ray Free- Electron 
Laser Experiments,” in IEEE Nuclear Science Symposium Conference Record, 
2011, pp. 2180–2182. 
[122] I. Chiuchisan, “Implementation of medical image processing algorithm on 
reconfigurable hardware,” IEEE Int. Conf. E-Health Bioeng., pp. 4–7, 2013. 
[123] V. Sanchez, R. Abugharbieh, and P. Nasiopoulos, “3-D scalable medical 
image compression with optimized volume of interest coding,” IEEE Trans. 
Med. Imaging, vol. 29, no. 10, pp. 1808–1820, 2010. 
[124] I. Chiuchisan, “A new FPGA-based real-time configurable system for medical 
image processing,” 2013 E-Health Bioeng. Conf. EHB 2013, pp. 0–3, 2013. 
[125] V. Akkala, P. Rajalakshmi, P. Kumar, and U. B. Desai, “FPGA based 
ultrasound backend system with image enhancement technique,” ISSNIP 
Biosignals Biorobotics Conf. BRC, 2014. 
[126] Q. Min and Robert J.T. Sadleir, “An Edge-based Prediction Approach for 
Medical Image Compression,” IEEE EMBS Int. Conf. Biomed. Eng. Sci., no. 
December, pp. 717–722, 2012. 
[127] K. G. Thanushkodi and S. Bhavani, “Comparison of fractal coding methods 
for medical image compression,” IET Image Process., vol. 7, no. 7, pp. 686–
693, 2013. 
[128] S. Kim, H. Sohn, J. H. Chang, T. Song, and Y. Yoo, “A PC-based fully-
programmable medical ultrasound imaging system using a graphics processing 















[129] S. Saha, K. H. Uddin, M. S. Islam, M. Jahiruzzaman, and A. B. M. A. 
Hossain, “Implementation of simplified normalized cut graph partitioning 
algorithm on FPGA for image segmentation,” Ski. 2014 - 8th Int. Conf. 
Software, Knowledge, Inf. Manag. Appl., no. 3, 2014. 
[130] Y. Li, W. Jia, B. Luan, Z. H. Mao, H. Zhang, and M. Sun, “A FPGA 
implementation of JPEG baseline encoder for wearable devices,” 2015 41st 
Annu. Northeast Biomed. Eng. Conf. NEBEC 2015, pp. 3–4, 2015. 
[131] K. Benkrid, A. Akoglu, C. Ling, Y. Song, Y. Liu, and X. Tian, “High 
performance biological pairwise sequence alignment: FPGA versus GPU 
versus cell BE versus GPP,” International Journal of Reconfigurable 
Computing. 2012. 
[132] N. Zhou, H. Li, D. Wang, S. Pan, and Z. Zhou, “Image compression and 
encryption scheme based on 2D compressive sensing and fractional Mellin 
transform,” Opt. Commun., vol. 343, pp. 10–21, 2015. 
[133] A. Ahmad, B. Krill, A. Amira, and H. Rabah, “3D Haar wavelet transform 
with dynamic partial reconfiguration for 3D medical image compression,” in 
2009 IEEE Biomedical Circuits and Systems Conference, 2009, vol. 1, pp. 
137–140. 
[134] N. Kehtarnavaz and S. Mahotra, “FPGA implementation made easy for 
applied digital signal processing courses,” in 2011 IEEE International 
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 
2892–2895. 
[135] K. Benkrid, A. Akoglu, C. Ling, Y. Song, Y. Liu, and X. Tian, “High 
performance biological pairwise sequence alignment: FPGA versus GPU 
versus cell BE versus GPP,” Int. J. Reconfigurable Comput., vol. 2012, 2012. 
[136] D. Chowdhury, S. K. Samaddar, and A. Sinha, “FPGA based implementation 
of FDCT using Distributed Arithmetic,” 2011 2nd Int. Conf. Comput. 
Commun. Technol. ICCCT-2011, pp. 400–404, 2011. 
[137] A. Pedro, D. Binotto, D. Doering, T. Stetzelberger, P. Mcvittie, S. 
Zimmermann, and C. E. Pereira, “A CPU, GPU, FPGA System for X-ray 
Image Processing using High-speed Scientific Cameras.” 
[138] N. Huda, A. Ahmad, and A. Amira, “Rapid Prototyping of Three-Dimensional 
Transform for Medical Image Compression,” in The 11th International 












Applications: Main Tracks, 2012, pp. 842–847. 
[139] N. Huda, A. Ahmad, and A. Amira, “Rapid Prototyping of Three-Dimensional 
Transform,” 11th Int. Conf. Inf. Sci. Signal Process. their Appl. Main Tracks, 
pp. 842–847, 2012. 
[140] B. Krill, A. Amira, A. Ahmad, and H. Rabah, “A new FPGA-based dynamic 
partial reconfiguration design flow and environment for image processing 
applications,” 2010 2nd Eur. Work. Vis. Inf. Process. EUVIP2010, pp. 226–
231, 2010. 
[141] K. H. Talukder and K. Harada, “Haar Wavelet Based Approach for Image 
Compression and Quality Assessment of Compressed Image,” Int. J. Appl. 
Math., vol. 36:1, no. February, 2007. 
[142] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. 
Stockhammer, and T. Wedi, “Video coding with H.264/AVC: Tools, 
performance, and complexity,” IEEE Circuits Syst. Mag., vol. 4, no. 1, pp. 7–
28, 2004. 
[143] T. Wiegand, “Overview of the H. 264/AVC video coding standard,” … Syst. 
Video …, vol. 13, no. 7, pp. 560–576, 2003. 
[144] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary 
arithmetic coding in the H.264/AVC video compression standard,” IEEE 
Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620–636, 2003. 
[145] R. A. Kandalkar and P. M. R. Ingle, “CABAC Entropy Decoding Algorithm 
Implementation on FPGA For H . 264,” Int. J. Emerg. Trends Electr. 
Electron., vol. 5, pp. 70–75, 2013. 
[146] Z. Juhasz and G. Kozmann, “A GPU-based simultaneous real-time EEG 
processing and visualization system for brain imaging applications,” 2015 
38th Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2015 
- Proc., no. May, pp. 299–304, 2015. 
[147] E. H. Sibley, I. A. N. H. Willen, R. M. Neal, and J. G. Cleary, “Arithmetic 
Coding for data compression,” vol. 30, no. 6, 1987. 
[148] U. W. Lok and P. C. Li, “Transform-Based Channel-Data Compression to 
Improve the Performance of a Real-Time GPU-Based Software Beamformer,” 














[149] A. Ahmad, A. Abbes, M. Guarisco, H. Rabah, and Y. Berviller, “Efficient 
Implementation Of A 3-D Medical Imaging Compression System Using 
CAVLC,” in Proceeding of 2010 IEEE 17th International Conference on 
Image Processing, 2010, pp. 3773–3776. 
[150] D. Keymeulen, N. Aranki, B. Hopson, A. Kiely, M. Klimesh, and K. Benkrid, 
“GPU lossless hyperspectral data compression system for space applications,” 
in IEEE Aerospace Conference Proceedings, 2012. 
[151] P. Govindan, T. Gonnot, S. Gilliland, and J. Saniie, “3D ultrasonic signal 
compression algorithms for high signal fidelity,” Midwest Symposium on 
Circuits and Systems, vol. 2, no. 2. pp. 1263–1266, 2013. 
[152] M. D. A. Freitas, M. R. Jimenez, H. Benincaza, and J. P. Von Der Weid, “A 
New Lossy Compression Algorithm for Ultrasound Signals,” 2008. 
[153] N. Zhang, J. L. Wang, and Y. S. Chen, “Image parallel processing based on 
GPU,” in Proceedings - 2nd IEEE International Conference on Advanced 
Computer Control, ICACC 2010, 2010, vol. 3, pp. 367–370. 
[154] J. P. D. Comput, D. Coimbra, D. Andrade, and L. Gonzaga, “An OpenCL 
framework for high performance extraction of image features,” J. Parallel 
Distrib. Comput., vol. 109, pp. 75–88, 2017. 
[155] F. B. I. N. Muslim, S. Member, L. Ma, and S. Member, “Efficient FPGA 
Implementation of OpenCL High-Performance Computing Applications via 
High-Level Synthesis,” Ieee Access Multidiscip. Open Access J., vol. 5, 2017. 
  
 
 
 
 
 
 
 
 
 
 
 
