203 research outputs found

    Ultra high definition video decoding with motion JPEG XR using the GPU

    Get PDF
    Many applications require real-time decoding of highresolution video pictures, for example, quick editing of video sequences in video editing applications. To increase decoding speed, parallelism can be exploited, yet, block-based image and video coding standards are difficult to decode in parallel because of the high number of dependencies between blocks. This paper investigates the parallel decoding capabilities of the new JPEG XR image coding standard for use on the massively-parallel architecture of the GPU. The potential of parallelism of the hierarchical frequency coding scheme used in the standard is addressed and a parallel decoding scheme is described suitable for real-time decoding of Ultra High Definition (4320p) Motion JPEG XR video sequences. Our results show a decoding speed of up to 46 frames per second for Ultra High Definition (4320p) sequences with high-dynamic range (32-bit/ 4: 2: 0) luma and chroma components

    Analysis and Performance Optimization of a GPGPU Implementation of Image Quality Assessment (IQA) Algorithm VSNR

    Get PDF
    abstract: Image processing has changed the way we store, view and share images. One important component of sharing images over the networks is image compression. Lossy image compression techniques compromise the quality of images to reduce their size. To ensure that the distortion of images due to image compression is not highly detectable by humans, the perceived quality of an image needs to be maintained over a certain threshold. Determining this threshold is best done using human subjects, but that is impractical in real-world scenarios. As a solution to this issue, image quality assessment (IQA) algorithms are used to automatically compute a fidelity score of an image. However, poor performance of IQA algorithms has been observed due to complex statistical computations involved. General Purpose Graphics Processing Unit (GPGPU) programming is one of the solutions proposed to optimize the performance of these algorithms. This thesis presents a Compute Unified Device Architecture (CUDA) based optimized implementation of full reference IQA algorithm, Visual Signal to Noise Ratio (VSNR) that uses M-level 2D Discrete Wavelet Transform (DWT) with 9/7 biorthogonal filters among other statistical computations. The presented implementation is tested upon four different image quality databases containing images with multiple distortions and sizes ranging from 512 x 512 to 1600 x 1280. The CUDA implementation of VSNR shows a speedup of over 32x for 1600 x 1280 images. It is observed that the speedup scales with the increase in size of images. The results showed that the implementation is fast enough to use VSNR on high definition videos with a frame rate of 60 fps. This work presents the optimizations made due to the use of GPU’s constant memory and reuse of allocated memory on the GPU. Also, it shows the performance improvement using profiler driven GPGPU development in CUDA. The presented implementation can be deployed in production combined with existing applications.Dissertation/ThesisMasters Thesis Computer Science 201

    Algorithms for compression of high dynamic range images and video

    Get PDF
    The recent advances in sensor and display technologies have brought upon the High Dynamic Range (HDR) imaging capability. The modern multiple exposure HDR sensors can achieve the dynamic range of 100-120 dB and LED and OLED display devices have contrast ratios of 10^5:1 to 10^6:1. Despite the above advances in technology the image/video compression algorithms and associated hardware are yet based on Standard Dynamic Range (SDR) technology, i.e. they operate within an effective dynamic range of up to 70 dB for 8 bit gamma corrected images. Further the existing infrastructure for content distribution is also designed for SDR, which creates interoperability problems with true HDR capture and display equipment. The current solutions for the above problem include tone mapping the HDR content to fit SDR. However this approach leads to image quality associated problems, when strong dynamic range compression is applied. Even though some HDR-only solutions have been proposed in literature, they are not interoperable with current SDR infrastructure and are thus typically used in closed systems. Given the above observations a research gap was identified in the need for efficient algorithms for the compression of still images and video, which are capable of storing full dynamic range and colour gamut of HDR images and at the same time backward compatible with existing SDR infrastructure. To improve the usability of SDR content it is vital that any such algorithms should accommodate different tone mapping operators, including those that are spatially non-uniform. In the course of the research presented in this thesis a novel two layer CODEC architecture is introduced for both HDR image and video coding. Further a universal and computationally efficient approximation of the tone mapping operator is developed and presented. It is shown that the use of perceptually uniform colourspaces for internal representation of pixel data enables improved compression efficiency of the algorithms. Further proposed novel approaches to the compression of metadata for the tone mapping operator is shown to improve compression performance for low bitrate video content. Multiple compression algorithms are designed, implemented and compared and quality-complexity trade-offs are identified. Finally practical aspects of implementing the developed algorithms are explored by automating the design space exploration flow and integrating the high level systems design framework with domain specific tools for synthesis and simulation of multiprocessor systems. The directions for further work are also presented

    Bitplane image coding with parallel coefficient processing

    Get PDF
    Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible

    DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training

    Full text link
    A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint - Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90% of the intermediate tensor elements in fully-connected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g. classification, object detection).Our code and models are available at https://github.com/chenjoya/dropitComment: 16 pages. DropIT can save memory & improve accuracy, providing a new perspective of dropping in activation compressed training than quantizatio

    Distributed Implementation of eXtended Reality Technologies over 5G Networks

    Get PDF
    Mención Internacional en el título de doctorThe revolution of Extended Reality (XR) has already started and is rapidly expanding as technology advances. Announcements such as Meta’s Metaverse have boosted the general interest in XR technologies, producing novel use cases. With the advent of the fifth generation of cellular networks (5G), XR technologies are expected to improve significantly by offloading heavy computational processes from the XR Head Mounted Display (HMD) to an edge server. XR offloading can rapidly boost XR technologies by considerably reducing the burden on the XR hardware, while improving the overall user experience by enabling smoother graphics and more realistic interactions. Overall, the combination of XR and 5G has the potential to revolutionize the way we interact with technology and experience the world around us. However, XR offloading is a complex task that requires state-of-the-art tools and solutions, as well as an advanced wireless network that can meet the demanding throughput, latency, and reliability requirements of XR. The definition of these requirements strongly depends on the use case and particular XR offloading implementations. Therefore, it is crucial to perform a thorough Key Performance Indicators (KPIs) analysis to ensure a successful design of any XR offloading solution. Additionally, distributed XR implementations can be intrincated systems with multiple processes running on different devices or virtual instances. All these agents must be well-handled and synchronized to achieve XR real-time requirements and ensure the expected user experience, guaranteeing a low processing overhead. XR offloading requires a carefully designed architecture which complies with the required KPIs while efficiently synchronizing and handling multiple heterogeneous devices. Offloading XR has become an essential use case for 5G and beyond 5G technologies. However, testing distributed XR implementations requires access to advanced 5G deployments that are often unavailable to most XR application developers. Conversely, the development of 5G technologies requires constant feedback from potential applications and use cases. Unfortunately, most 5G providers, engineers, or researchers lack access to cutting-edge XR hardware or applications, which can hinder the fast implementation and improvement of 5G’s most advanced features. Both technology fields require ongoing input and continuous development from each other to fully realize their potential. As a result, XR and 5G researchers and developers must have access to the necessary tools and knowledge to ensure the rapid and satisfactory development of both technology fields. In this thesis, we focus on these challenges providing knowledge, tools and solutiond towards the implementation of advanced offloading technologies, opening the door to more immersive, comfortable and accessible XR technologies. Our contributions to the field of XR offloading include a detailed study and description of the necessary network throughput and latency KPIs for XR offloading, an architecture for low latency XR offloading and our full end to end XR offloading implementation ready for a commercial XR HMD. Besides, we also present a set of tools which can facilitate the joint development of 5G networks and XR offloading technologies: our 5G RAN real-time emulator and a multi-scenario XR IP traffic dataset. Firstly, in this thesis, we thoroughly examine and explain the KPIs that are required to achieve the expected Quality of Experience (QoE) and enhanced immersiveness in XR offloading solutions. Our analysis focuses on individual XR algorithms, rather than potential use cases. Additionally, we provide an initial description of feasible 5G deployments that could fulfill some of the proposed KPIs for different offloading scenarios. We also present our low latency muti-modal XR offloading architecture, which has already been tested on a commercial XR device and advanced 5G deployments, such as millimeter-wave (mmW) technologies. Besides, we describe our full endto- end complex XR offloading system which relies on our offloading architecture to provide low latency communication between a commercial XR device and a server running a Machine Learning (ML) algorithm. To the best of our knowledge, this is one of the first successful XR offloading implementations for complex ML algorithms in a commercial device. With the goal of providing XR developers and researchers access to complex 5G deployments and accelerating the development of future XR technologies, we present FikoRE, our 5G RAN real-time emulator. FikoRE has been specifically designed not only to model the network with sufficient accuracy but also to support the emulation of a massive number of users and actual IP throughput. As FikoRE can handle actual IP traffic above 1 Gbps, it can directly be used to test distributed XR solutions. As we describe in the thesis, its emulation capabilities make FikoRE a potential candidate to become a reference testbed for distributed XR developers and researchers. Finally, we used our XR offloading tools to generate an XR IP traffic dataset which can accelerate the development of 5G technologies by providing a straightforward manner for testing novel 5G solutions using realistic XR data. This dataset is generated for two relevant XR offloading scenarios: split rendering, in which the rendering step is moved to an edge server, and heavy ML algorithm offloading. Besides, we derive the corresponding IP traffic models from the captured data, which can be used to generate realistic XR IP traffic. We also present the validation experiments performed on the derived models and their results.This work has received funding from the European Union (EU) Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie ETN TeamUp5G, grant agreement No. 813391.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Narciso García Santos.- Secretario: Fernando Díaz de María.- Vocal: Aryan Kaushi

    Strategy of microscopic parallelism for Bitplane Image Coding

    Get PDF
    Recent years have seen the upraising of a new type of processors strongly relying on the Single Instruction, Multiple Data (SIMD) architectural principle. The main idea behind SIMD computing is to apply a flow of instructions to multiple pieces of data in parallel and synchronously. This permits the execution of thousands of operations in parallel, achieving higher computational performance than with traditional Multiple Instruction, Multiple Data (MIMD) architectures. The level of parallelism required in SIMD computing can only be achieved in image coding systems via microscopic parallel strategies that code multiple coefficients in parallel. Until now, the only way to achieve microscopic parallelism in bitplane coding engines was by executing multiple coding passes in parallel. Such a strategy does not suit well SIMD computing because each thread executes different instructions. This paper introduces the first bitplane coding engine devised for the fine grain of parallelism required in SIMD computing. Its main insight is to allow parallel coefficient processing in a coding pass. Experimental tests show coding performance results similar to those of JPEG2000
    • …
    corecore