Search CORE

80 research outputs found

Efficient architectures of heterogeneous fpga-gpu for 3-d medical image compression

Author: Muharam Azlan
Publication venue
Publication date: 01/04/2019
Field of study

The advent of development in three-dimensional (3-D) imaging modalities have generated a massive amount of volumetric data in 3-D images such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US). Existing survey reveals the presence of a huge gap for further research in exploiting reconfigurable computing for 3-D medical image compression. This research proposes an FPGA based co-processing solution to accelerate the mentioned medical imaging system. The HWT block implemented on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board. Analysis and performance evaluation of the 3-D images were been conducted. Furthermore, a novel architecture of context-based adaptive binary arithmetic coder (CABAC) is the advanced entropy coding tool employed by main and higher profiles of H.264/AVC. This research focuses on GPU implementation of CABAC and comparative study of discrete wavelet transform (DWT) and without DWT for 3-D medical image compression systems. Implementation results on MRI and CT images, showing GPU significantly outperforming single-threaded CPU implementation. Overall, CT and MRI modalities with DWT outperform in term of compression ratio, peak signal to noise ratio (PSNR) and latency compared with images without DWT process. For heterogeneous computing, MRI images with various sizes and format, such as JPEG and DICOM was implemented. Evaluation results are shown for each memory iteration, transfer sizes from GPU to CPU consuming more bandwidth or throughput. For size 786, 486 bytes JPEG format, both directions consumed bandwidth tend to balance. Bandwidth is relative to the transfer size, the larger sizing will take more latency and throughput. Next, OpenCL implementation for concurrent task via dedicated FPGA. Finding from implementation reveals, OpenCL on batch procession mode with AOC techniques offers substantial results where the amount of logic, area, register and memory increased proportionally to the number of batch. It is because of the kernel will copy the kernel block refer to batch number. Therefore memory bank increased periodically related to kernel block. It was found through comparative study that the tree balance and unroll loop architecture provides better achievement, in term of local memory, latency and throughput

UTHM Institutional Repository

PC-grade parallel processing and hardware acceleration for large-scale data analysis

Author: Yang Su
Publication venue
Publication date: 01/01/2009
Field of study

Arguably, modern graphics processing units (GPU) are the first commodity, and desktop parallel processor. Although GPU programming was originated from the interactive rendering in graphical applications such as computer games, researchers in the field of general purpose computation on GPU (GPGPU) are showing that the power, ubiquity and low cost of GPUs makes them an ideal alternative platform for high-performance computing. This has resulted in the extensive exploration in using the GPU to accelerate general-purpose computations in many engineering and mathematical domains outside of graphics. However, limited to the development complexity caused by the graphics-oriented concepts and development tools for GPU-programming, GPGPU has mainly been discussed in the academic domain so far and has not yet fully fulfilled its promises in the real world. This thesis aims at exploiting GPGPU in the practical engineering domain and presented a novel contribution to GPGPU-driven linear time invariant (LTI) systems that are employed by the signal processing techniques in stylus-based or optical-based surface metrology and data processing. The core contributions that have been achieved in this project can be summarized as follow. Firstly, a thorough survey of the state-of-the-art of GPGPU applications and their development approaches has been carried out in this thesis. In addition, the category of parallel architecture pattern that the GPGPU belongs to has been specified, which formed the foundation of the GPGPU programming framework design in the thesis. Following this specification, a GPGPU programming framework is deduced as a general guideline to the various GPGPU programming models that are applied to a large diversity of algorithms in scientific computing and engineering applications. Considering the evolution of GPU’s hardware architecture, the proposed frameworks cover through the transition of graphics-originated concepts for GPGPU programming based on legacy GPUs and the abstraction of stream processing pattern represented by the compute unified device architecture (CUDA) in which GPU is considered as not only a graphics device but a streaming coprocessor of CPU. Secondly, the proposed GPGPU programming framework are applied to the practical engineering applications, namely, the surface metrological data processing and image processing, to generate the programming models that aim to carry out parallel computing for the corresponding algorithms. The acceleration performance of these models are evaluated in terms of the speed-up factor and the data accuracy, which enabled the generation of quantifiable benchmarks for evaluating consumer-grade parallel processors. It shows that the GPGPU applications outperform the CPU solutions by up to 20 times without significant loss of data accuracy and any noticeable increase in source code complexity, which further validates the effectiveness of the proposed GPGPU general programming framework. Thirdly, this thesis devised methods for carrying out result visualization directly on GPU by storing processed data in local GPU memory through making use of GPU’s rendering device features to achieve realtime interactions. The algorithms employed in this thesis included various filtering techniques, discrete wavelet transform, and the fast Fourier Transform which cover the common operations implemented in most LTI systems in spatial and frequency domains. Considering the employed GPUs’ hardware designs, especially the structure of the rendering pipelines, and the characteristics of the algorithms, the series of proposed GPGPU programming models have proven its feasibility, practicality, and robustness in real engineering applications. The developed GPGPU programming framework as well as the programming models are anticipated to be adaptable for future consumer-level computing devices and other computational demanding applications. In addition, it is envisaged that the devised principles and methods in the framework design are likely to have significant benefits outside the sphere of surface metrology.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

CENTRAL PROCESSING UNIT-GRAPHICS PROCESSING UNIT COMPUTING SCHEME FOR MULTI-OBJECT TRACKING IN SURVEILLANCE

Author: R Jagadeesh Kannan
Rai Ankush
Publication venue: 'Innovare Academic Sciences Pvt Ltd'
Publication date: 01/04/2017
Field of study

This research work presents a novel central processing unit-graphics processing unit (CPU-GPU) computing scheme for multiple object trackingduring a surveillance operation. This facilitates nonlinear computational jobs to avail completion of computation in minimal processing time forÂ tracking function. The work is divided into two essential objectives. First is to dynamically divide the processing operations into parallel units, andÂ second is to reduce the communication between CPU-GPU processing units

Innovare Academic Sciences: E-Journals

GPU acceleration of predictive partitioned vector quantization for ultraspectral sounder data compression

Author: [[corresponding]]Huang Bormin
Wei Shih-chieh
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

[[abstract]]For the large-volume ultraspectral sounder data, compression is desirable to save storage space and transmission time. To retrieve the geophysical paramters without losing precision the ultraspectral sounder data compression has to be lossless. Recently there is a boom on the use of graphic processor units (GPU) for speedup of scientific computations. By identifying the time dominant portions of the code that can be executed in parallel, significant speedup can be achieved by using GPU. Predictive partitioned vector quantization (PPVQ) has been proven to be an effective lossless compression scheme for ultraspectral sounder data. It consists of linear prediction, bit depth partitioning, vector quantization, and entropy coding. Two most time consuming stages of linear prediction and vector quantization are chosen for GPU-based implementation. By exploiting the data parallel characteristics of these two stages, a spatial division design shows a speedup of 72x in our four-GPU-based implementation of the PPVQ compression scheme.[[notice]]補正完畢[[journaltype]]國外[[incitationindex]]SCI[[booktype]]紙本[[countrycodes]]US

Crossref

Tamkang University Institutional Repository

Towards Fast and High-quality Biomedical Image Reconstruction

Author: Tran Minh Quan
Publication venue: Graduate School of UNIST
Publication date: 01/02/2019
Field of study

Department of Computer Science and EngineeringReconstruction is an important module in the image analysis pipeline with purposes of isolating the majority of meaningful information that hidden inside the acquired data. The term ???reconstruction??? can be understood and subdivided in several specific tasks in different modalities. For example, in biomedical imaging, such as Computed Tomography (CT), Magnetic Resonance Image (MRI), that term stands for the transformation from the, possibly fully or under-sampled, spectral domains (sinogram for CT and k-space for MRI) to the visible image domains. Or, in connectomics, people usually refer it to segmentation (reconstructing the semantic contact between neuronal connections) or denoising (reconstructing the clean image). In this dissertation research, I will describe a set of my contributed algorithms from conventional to state-of-the-art deep learning methods, with a transition at the data-driven dictionary learning approaches that tackle the reconstruction problems in various image analysis tasks.clos

ScholarWorks@UNIST

Performance-analysis-based Acceleration of Image Quality Assessment

Author: Phan Thien D.
Publication venue: 'Oklahoma State University Library'
Publication date: 01/12/2014
Field of study

Algorithms for image/video quality assessment (QA) aim to predict the qualitiesof images in a manner that agrees with subjective quality ratings. Over the lastseveral decades, the major impetus in QA research has focused on improving predictiveperformance; very few studies have focused on analyzing and improving theruntime performance of QA algorithms. Modern algorithms of image/video qualityassessment commonly employed two stages: (1) a local frequency-based decomposition, and (2) block-based statistical comparisons between the frequency coefficients of the reference and distorted images. These two stages constitute the bulk of the computation and runtime required for QA. This research thesis presents a performance analysis of and techniques for accelerating these stages. We also specifically analyze and accelerate one representative QA algorithm, Most Apparent Distortion (MAD), which was developed by Eric Larson and Damon Chandler in 2010 [1]. We identify the bottlenecks from the above-mentioned stages, and we present methods of acceleration using generalized integral image, inline expansion, a GPGPU implementation, and other code modifications. We show how a combination of these approaches can yield a speedup of 47x.The content of the report is divided into five different chapters. In Chapter 1,we present a general overview of QA algorithms, current work on improving the computational performance and execution time of QA algorithms, and an introduction toour work. In Chapter 2, we describe MAD algorithm, the first performance analysis,and the systems used to test the performance. In Chapter 3, we present generalizedintegral image and inline expansion techniques. In this chapter, we also providethe results of each technique in terms of speeding up running time. Chapter 4 providesGPGPU and some other code optimization techniques with the timing results.Finally, the conclusion are proposed in the Chapter 5 to summarize the report.Electrical Engineerin

SHAREOK repository

GPU Integration into a Software Defined Radio Framework

Author: Millage Joel Gregory
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2010
Field of study

Software Defined Radio (SDR) was brought about by moving processing done on specific hardware components to reconfigurable software. Hardware components like General Purpose Processors (GPPs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs) are used to make the software and hardware processing of the radio more portable and as efficient as possible. Graphics Processing Units (GPUs) designed years ago for video rendering, are now finding new uses in research. The parallel architecture provided by the GPU gives developers the ability to speed up the performance of computationally intense programs. An open source tool for SDR, Open Source Software Communications Architecture (SCA) Implementation: Embedded (OSSIE), is a free waveform development environment for any developer who wants to experiment with SDR. In this work, OSSIE is integrated with a GPU computing framework to show how performance improvement can be gained from GPU parallelization. GPU research performed with SDR encompasses improving SDR simulations to implementing specific wireless protocols. In this thesis, we are aiming to show performance improvement within an SCA architected SDR implementation. The software components within OSSIE gained significant performance increases with little software changes due to the natural parallelism of the GPU, using Compute Unified Device Architecture (CUDA), Nvidia\u27s GPU programming API. Using sample data sizes for the I and Q channel inputs, performance improvements were seen in as little as 512 samples when using the GPU optimized version of OSSIE. As the sample size increased, the CUDA performance improved as well. Porting OSSIE components onto the CUDA architecture showed that improved performance can be seen in SDR related software through the use of GPU technology

Digital Repository @ Iowa State University (ISU)