60 research outputs found

    Efficient architectures of heterogeneous fpga-gpu for 3-d medical image compression

    Get PDF
    The advent of development in three-dimensional (3-D) imaging modalities have generated a massive amount of volumetric data in 3-D images such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US). Existing survey reveals the presence of a huge gap for further research in exploiting reconfigurable computing for 3-D medical image compression. This research proposes an FPGA based co-processing solution to accelerate the mentioned medical imaging system. The HWT block implemented on the sbRIO-9632 FPGA board is Spartan 3 (XC3S2000) chip prototyping board. Analysis and performance evaluation of the 3-D images were been conducted. Furthermore, a novel architecture of context-based adaptive binary arithmetic coder (CABAC) is the advanced entropy coding tool employed by main and higher profiles of H.264/AVC. This research focuses on GPU implementation of CABAC and comparative study of discrete wavelet transform (DWT) and without DWT for 3-D medical image compression systems. Implementation results on MRI and CT images, showing GPU significantly outperforming single-threaded CPU implementation. Overall, CT and MRI modalities with DWT outperform in term of compression ratio, peak signal to noise ratio (PSNR) and latency compared with images without DWT process. For heterogeneous computing, MRI images with various sizes and format, such as JPEG and DICOM was implemented. Evaluation results are shown for each memory iteration, transfer sizes from GPU to CPU consuming more bandwidth or throughput. For size 786, 486 bytes JPEG format, both directions consumed bandwidth tend to balance. Bandwidth is relative to the transfer size, the larger sizing will take more latency and throughput. Next, OpenCL implementation for concurrent task via dedicated FPGA. Finding from implementation reveals, OpenCL on batch procession mode with AOC techniques offers substantial results where the amount of logic, area, register and memory increased proportionally to the number of batch. It is because of the kernel will copy the kernel block refer to batch number. Therefore memory bank increased periodically related to kernel block. It was found through comparative study that the tree balance and unroll loop architecture provides better achievement, in term of local memory, latency and throughput

    High throughput image compression and decompression on GPUs

    Get PDF
    Diese Arbeit befasst sich mit der Entwicklung eines GPU-freundlichen, intra-only, Wavelet-basierten Videokompressionsverfahrens mit hohem Durchsatz, das für visuell verlustfreie Anwendungen optimiert ist. Ausgehend von der Beobachtung, dass der JPEG 2000 Entropie-Kodierer ein Flaschenhals ist, werden verschiedene algorithmische Änderungen vorgeschlagen und bewertet. Zunächst wird der JPEG 2000 Selective Arithmetic Coding Mode auf der GPU realisiert, wobei sich die Erhöhung des Durchsatzes hierdurch als begrenzt zeigt. Stattdessen werden zwei nicht standard-kompatible Änderungen vorgeschlagen, die (1) jede Bitebebene in nur einem einzelnen Pass verarbeiten (Single-Pass-Modus) und (2) einen echten Rohcodierungsmodus einführen, der sample-weise parallelisierbar ist und keine aufwendige Kontextmodellierung erfordert. Als nächstes wird ein alternativer Entropiekodierer aus der Literatur, der Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), evaluiert. Er gibt Signaladaptivität zu Gunsten von höherer Parallelität auf und daher wird hier untersucht und gezeigt, dass ein aus verschiedensten Testsequenzen gemitteltes statisches Wahrscheinlichkeitsmodell eine kompetitive Kompressionseffizienz erreicht. Es wird zudem eine Kombination von BPC-PaCo mit dem Single-Pass-Modus vorgeschlagen, der den Speedup gegenüber dem JPEG 2000 Entropiekodierer von 2,15x (BPC-PaCo mit zwei Pässen) auf 2,6x (BPC-PaCo mit Single-Pass-Modus) erhöht auf Kosten eines um 0,3 dB auf 1,0 dB erhöhten Spitzen-Signal-Rausch-Verhältnis (PSNR). Weiter wird ein paralleler Algorithmus zur Post-Compression Ratenkontrolle vorgestellt sowie eine parallele Codestream-Erstellung auf der GPU. Es wird weiterhin ein theoretisches Laufzeitmodell formuliert, das es durch Benchmarking von einer GPU ermöglicht die Laufzeit einer Routine auf einer anderen GPU vorherzusagen. Schließlich wird der erste JPEG XS GPU Decoder vorgestellt und evaluiert. JPEG XS wurde als Low Complexity Codec konzipiert und forderte erstmals explizit GPU-Freundlichkeit bereits im Call for Proposals. Ab Bitraten über 1 bpp ist der Decoder etwa 2x schneller im Vergleich zu JPEG 2000 und 1,5x schneller als der schnellste hier vorgestellte Entropiekodierer (BPC-PaCo mit Single-Pass-Modus). Mit einer GeForce GTX 1080 wird ein Decoder Durchsatz von rund 200 fps für eine UHD-4:4:4-Sequenz erreicht.This work investigates possibilities to create a high throughput, GPU-friendly, intra-only, Wavelet-based video compression algorithm optimized for visually lossless applications. Addressing the key observation that JPEG 2000’s entropy coder is a bottleneck and might be overly complex for a high bit rate scenario, various algorithmic alterations are proposed. First, JPEG 2000’s Selective Arithmetic Coding mode is realized on the GPU, but the gains in terms of an increased throughput are shown to be limited. Instead, two independent alterations not compliant to the standard are proposed, that (1) give up the concept of intra-bit plane truncation points and (2) introduce a true raw-coding mode that is fully parallelizable and does not require any context modeling. Next, an alternative block coder from the literature, the Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), is evaluated. Since it trades signal adaptiveness for increased parallelism, it is shown here how a stationary probability model averaged from a set of test sequences yields competitive compression efficiency. A combination of BPC-PaCo with the single-pass mode is proposed and shown to increase the speedup with respect to the original JPEG 2000 entropy coder from 2.15x (BPC-PaCo with two passes) to 2.6x (proposed BPC-PaCo with single-pass mode) at the marginal cost of increasing the PSNR penalty by 0.3 dB to at most 1 dB. Furthermore, a parallel algorithm is presented that determines the optimal code block bit stream truncation points (given an available bit rate budget) and builds the entire code stream on the GPU, reducing the amount of data that has to be transferred back into host memory to a minimum. A theoretical runtime model is formulated that allows, based on benchmarking results on one GPU, to predict the runtime of a kernel on another GPU. Lastly, the first ever JPEG XS GPU-decoder realization is presented. JPEG XS was designed to be a low complexity codec and for the first time explicitly demanded GPU-friendliness already in the call for proposals. Starting at bit rates above 1 bpp, the decoder is around 2x faster compared to the original JPEG 2000 and 1.5x faster compared to JPEG 2000 with the fastest evaluated entropy coder (BPC-PaCo with single-pass mode). With a GeForce GTX 1080, a decoding throughput of around 200 fps is achieved for a UHD 4:4:4 sequence

    Remote Sensing Data Compression

    Get PDF
    A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin

    Image and Video Coding Techniques for Ultra-low Latency

    Get PDF
    The next generation of wireless networks fosters the adoption of latency-critical applications such as XR, connected industry, or autonomous driving. This survey gathers implementation aspects of different image and video coding schemes and discusses their tradeoffs. Standardized video coding technologies such as HEVC or VVC provide a high compression ratio, but their enormous complexity sets the scene for alternative approaches like still image, mezzanine, or texture compression in scenarios with tight resource or latency constraints. Regardless of the coding scheme, we found inter-device memory transfers and the lack of sub-frame coding as limitations of current full-system and software-programmable implementations.publishedVersionPeer reviewe

    Pedestrian Detection Algorithms using Shearlets

    Get PDF
    In this thesis, we investigate the applicability of the shearlet transform for the task of pedestrian detection. Due to the usage of in several emerging technologies, such as automated or autonomous vehicles, pedestrian detection has evolved into a key topic of research in the last decade. In this time period, a wealth of different algorithms has been developed. According to the current results on the Caltech Pedestrian Detection Benchmark the algorithms can be divided into two categories. First, application of hand-crafted image features and of a classifier trained on these features. Second, methods using Convolutional Neural Networks in which features are learned during the training phase. It is studied how both of these types of procedures can be further improved by the incorporation of shearlets, a framework for image analysis which has a comprehensive theoretical basis

    Simulation methodologies for mobile GPUs

    Get PDF
    GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration

    Evolutionary design of deep neural networks

    Get PDF
    Mención Internacional en el título de doctorFor three decades, neuroevolution has applied evolutionary computation to the optimization of the topology of artificial neural networks, with most works focusing on very simple architectures. However, times have changed, and nowadays convolutional neural networks are the industry and academia standard for solving a variety of problems, many of which remained unsolved before the discovery of this kind of networks. Convolutional neural networks involve complex topologies, and the manual design of these topologies for solving a problem at hand is expensive and inefficient. In this thesis, our aim is to use neuroevolution in order to evolve the architecture of convolutional neural networks. To do so, we have decided to try two different techniques: genetic algorithms and grammatical evolution. We have implemented a niching scheme for preserving the genetic diversity, in order to ease the construction of ensembles of neural networks. These techniques have been validated against the MNIST database for handwritten digit recognition, achieving a test error rate of 0.28%, and the OPPORTUNITY data set for human activity recognition, attaining an F1 score of 0.9275. Both results have proven very competitive when compared with the state of the art. Also, in all cases, ensembles have proven to perform better than individual models. Later, the topologies learned for MNIST were tested on EMNIST, a database recently introduced in 2017, which includes more samples and a set of letters for character recognition. Results have shown that the topologies optimized for MNIST perform well on EMNIST, proving that architectures can be reused across domains with similar characteristics. In summary, neuroevolution is an effective approach for automatically designing topologies for convolutional neural networks. However, it still remains as an unexplored field due to hardware limitations. Current advances, however, should constitute the fuel that empowers the emergence of this field, and further research should start as of today.This Ph.D. dissertation has been partially supported by the Spanish Ministry of Education, Culture and Sports under FPU fellowship with identifier FPU13/03917. This research stay has been partially co-funded by the Spanish Ministry of Education, Culture and Sports under FPU short stay grant with identifier EST15/00260.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: María Araceli Sanchís de Miguel.- Secretario: Francisco Javier Segovia Pérez.- Vocal: Simon Luca
    • …
    corecore