2,343 research outputs found

    Performance evaluation of H.264/AVC decoding and visualization using the GPU

    Get PDF
    The coding efficiency of the H.264/AVC standard makes the decoding process computationally demanding. This has limited the availability of cost-effective, high-performance solutions. Modern computers are typically equipped with powerful yet cost-effective Graphics Processing Units (GPUs) to accelerate graphics operations. These GPUs can be addressed by means of a 3-D graphics API such as Microsoft Direct3D or OpenGL, using programmable shaders as generic processing units for vector data. The new CUDA (Compute Unified Device Architecture) platform of NVIDIA provides a straightforward way to address the GPU directly, without the need for a 3-D graphics API in the middle. In CUDA, a compiler generates executable code from C code with specific modifiers that determine the execution model. This paper first presents an own-developed H.264/AVC renderer, which is capable of executing motion compensation (MC), reconstruction, and Color Space Conversion (CSC) entirely on the GPU. To steer the GPU, Direct3D combined with programmable pixel and vertex shaders is used. Next, we also present a GPU-enabled decoder utilizing the new CUDA architecture from NVIDIA. This decoder performs MC, reconstruction, and CSC on the GPU as well. Our results compare both GPU-enabled decoders, as well as a CPU-only decoder in terms of speed, complexity, and CPU requirements. Our measurements show that a significant speedup is possible, relative to a CPU-only solution. As an example, real-time playback of high-definition video (1080p) was achieved with our Direct3D and CUDA-based H.264/AVC renderers

    A survey of new technology for cockpit application to 1990's transport aircraft simulators

    Get PDF
    Two problems were investigated: inter-equipment data transfer, both on board the aircraft and between air and ground; and crew equipment communication via the cockpit displays and controls. Inter-equipment data transfer is discussed in terms of data bus and data link requirements. Crew equipment communication is discussed regarding the availability of CRT display systems for use in research simulators to represent flat panel displays of the future, and of software controllable touch panels

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    Using image morphing for memory-efficient impostor rendering on GPU

    Get PDF
    Real-time rendering of large animated crowds consisting thousands of virtual humans is important for several applications including simulations, games and interactive walkthroughs; but cannot be performed using complex polygonal models at interactive frame rates. For that reason, several methods using large numbers of pre-computed image-based representations, which are called as impostors, have been proposed. These methods take the advantage of existing programmable graphics hardware to compensate the computational expense while maintaining the visual fidelity. Making the number of different virtual humans, which can be rendered in real-time, not restricted anymore by the required computational power but by the texture memory consumed for the variety and discretization of their animations. In this work, we proposed an alternative method that reduces the memory consumption by generating compelling intermediate textures using image-morphing techniques. In order to demonstrate the preserved perceptual quality of animations, where half of the key-frames were rendered using the proposed methodology, we have implemented the system using the graphical processing unit and obtained promising results at interactive frame rates

    Efficient reconfigurable architectures for 3D medical image compression

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities, such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US) have generated a massive amount of volumetric data. These have provided an impetus to the development of other applications, in particular telemedicine and teleradiology. In these fields, medical image compression is important since both efficient storage and transmission of data through high-bandwidth digital communication lines are of crucial importance. Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow for quick upgradeability with real-time applications. Moreover, in order to obtain efficient solutions for large medical volumes data, an efficient implementation of these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system building block in the construction of high-performance systems at an economical price. Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent advantages such as massive parallelism capabilities, multimillion gate counts, and special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits promising results in reducing Gaussian white noise in medical images. In terms of hardware implementation, promising trade-offs on maximum frequency, throughput and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC) has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that 3-D IT demonstrates better computational complexity than the 3-D DWT, whilst the 3-D DWT with LS exhibits a lossless compression that is significantly useful for medical image compression. Additionally, an architecture of CAVLC that is capable of compressing high-definition (HD) images in real-time without any buffer between the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE), Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci

    Motion estimation for H.264/AVC on multiple GPUs using NVIDIA CUDA

    Get PDF
    To achieve the high coding efficiency the H.264/AVC standard offers, the encoding process quickly becomes computationally demanding. One of the most intensive encoding phases is motion estimation. Even modern CPUs struggle to process high-definition video sequences in real-time. While personal computers are typically equipped with powerful Graphics Processing Units (GPUs) to accelerate graphics operations, these GPUs lie dormant when encoding a video sequence. Furthermore, recent developments show more and more computer configurations come with multiple GPUs. However, no existing GPU-enabled motion estimation architectures target multiple GPUs. In addition, these architectures provide no early-out behavior nor can they enforce a specific processing order. We developed a motion search architecture, capable of executing motion estimation and partitioning for an H.264/AVC sequence entirely on the GPU using the NVIDIA CUDA (Compute Unified Device Architecture) platform. This paper describes our architecture and presents a novel job scheduling system we designed, making it possible to control the GPU in a flexible way. This job scheduling system can enforce real-time demands of the video encoder by prioritizing calculations and providing an early-out mode. Furthermore, the job scheduling system allows the use of multiple GPUs in one computer system and efficient load balancing of the motion search over these GPUs. This paper focuses on the execution speed of the novel job scheduling system on both single and multi-GPU systems. Initial results show that real-time full motion search of 720p high-definition content is possible with a 32 by 32 search window running on a system with four GPUs

    A survey of real-time crowd rendering

    Get PDF
    In this survey we review, classify and compare existing approaches for real-time crowd rendering. We first overview character animation techniques, as they are highly tied to crowd rendering performance, and then we analyze the state of the art in crowd rendering. We discuss different representations for level-of-detail (LoD) rendering of animated characters, including polygon-based, point-based, and image-based techniques, and review different criteria for runtime LoD selection. Besides LoD approaches, we review classic acceleration schemes, such as frustum culling and occlusion culling, and describe how they can be adapted to handle crowds of animated characters. We also discuss specific acceleration techniques for crowd rendering, such as primitive pseudo-instancing, palette skinning, and dynamic key-pose caching, which benefit from current graphics hardware. We also address other factors affecting performance and realism of crowds such as lighting, shadowing, clothing and variability. Finally we provide an exhaustive comparison of the most relevant approaches in the field.Peer ReviewedPostprint (author's final draft

    Three architectures for volume rendering

    Get PDF
    Volume rendering is a key technique in scientific visualization that lends itself to significant exploitable parallelism. The high computational demands of real-time volume rendering and continued technological advances in the area of VLSI give impetus to the development of special-purpose volume rendering architectures. This paper presents and characterizes three recently developed volume rendering engines which are based on the ray-casting method. A taxonomy of the algorithmic variants of ray-casting and details of each ray-casting architecture are discussed. The paper then compares the machine features and provides an outlook on future developments in the area of volume rendering hardware

    GPU High-Performance Framework for PIC-like Simulation Methods Using the Vulkan® Explicit API

    Get PDF
    Within computational continuum mechanics there exists a large category of simulation methods which operate by tracking Lagrangian particles over an Eulerian background grid. These Lagrangian/Eulerian hybrid methods, descendants of the Particle-In-Cell method (PIC), have proven highly effective at simulating a broad range of materials and mechanics including fluids, solids, granular materials, and plasma. These methods remain an area of active research after several decades, and their applications can be found across scientific, engineering, and entertainment disciplines. This thesis presents a GPU driven PIC-like simulation framework created using the Vulkan® API. Vulkan is a cross-platform and open-standard explicit API for graphics and GPU compute programming. Compared to its predecessors, Vulkan offers lower overhead, support for host parallelism, and finer grain control over both device resources and scheduling. This thesis harnesses those advantages to create a programmable GPU compute pipeline backed by a Vulkan adaptation of the SPgrid data-structure and multi-buffered particle arrays. The CPU host system works asynchronously with the GPU to maximize utilization of both the host and device. The framework is demonstrated to be capable of supporting Particle-in-Cell like simulation methods, making it viable for GPU acceleration of many Lagrangian particle on Eulerian grid hybrid methods. This novel framework is the first of its kind to be created using Vulkan® and to take advantage of GPU sparse memory features for grid sparsity
    corecore