421 research outputs found
High throughput image compression and decompression on GPUs
Diese Arbeit befasst sich mit der Entwicklung eines GPU-freundlichen, intra-only, Wavelet-basierten Videokompressionsverfahrens mit hohem Durchsatz, das für visuell verlustfreie Anwendungen optimiert ist. Ausgehend von der Beobachtung, dass der JPEG 2000 Entropie-Kodierer ein Flaschenhals ist, werden verschiedene algorithmische Änderungen vorgeschlagen und bewertet. Zunächst wird der JPEG 2000 Selective Arithmetic Coding Mode auf der GPU realisiert, wobei sich die Erhöhung des Durchsatzes hierdurch als begrenzt zeigt. Stattdessen werden zwei nicht standard-kompatible Änderungen vorgeschlagen, die (1) jede Bitebebene in nur einem einzelnen Pass verarbeiten (Single-Pass-Modus) und (2) einen echten Rohcodierungsmodus einführen, der sample-weise parallelisierbar ist und keine aufwendige Kontextmodellierung erfordert. Als nächstes wird ein alternativer Entropiekodierer aus der Literatur, der Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), evaluiert. Er gibt Signaladaptivität zu Gunsten von höherer Parallelität auf und daher wird hier untersucht und gezeigt, dass ein aus verschiedensten Testsequenzen gemitteltes statisches Wahrscheinlichkeitsmodell eine kompetitive Kompressionseffizienz erreicht. Es wird zudem eine Kombination von BPC-PaCo mit dem Single-Pass-Modus vorgeschlagen, der den Speedup gegenüber dem JPEG 2000 Entropiekodierer von 2,15x (BPC-PaCo mit zwei Pässen) auf 2,6x (BPC-PaCo mit Single-Pass-Modus) erhöht auf Kosten eines um 0,3 dB auf 1,0 dB erhöhten Spitzen-Signal-Rausch-Verhältnis (PSNR). Weiter wird ein paralleler Algorithmus zur Post-Compression Ratenkontrolle vorgestellt sowie eine parallele Codestream-Erstellung auf der GPU. Es wird weiterhin ein theoretisches Laufzeitmodell formuliert, das es durch Benchmarking von einer GPU ermöglicht die Laufzeit einer Routine auf einer anderen GPU vorherzusagen. Schließlich wird der erste JPEG XS GPU Decoder vorgestellt und evaluiert. JPEG XS wurde als Low Complexity Codec konzipiert und forderte erstmals explizit GPU-Freundlichkeit bereits im Call for Proposals. Ab Bitraten über 1 bpp ist der Decoder etwa 2x schneller im Vergleich zu JPEG 2000 und 1,5x schneller als der schnellste hier vorgestellte Entropiekodierer (BPC-PaCo mit Single-Pass-Modus). Mit einer GeForce GTX 1080 wird ein Decoder Durchsatz von rund 200 fps für eine UHD-4:4:4-Sequenz erreicht.This work investigates possibilities to create a high throughput, GPU-friendly, intra-only, Wavelet-based video compression algorithm optimized for visually lossless applications. Addressing the key observation that JPEG 2000’s entropy coder is a bottleneck and might be overly complex for a high bit rate scenario, various algorithmic alterations are proposed. First, JPEG 2000’s Selective Arithmetic Coding mode is realized on the GPU, but the gains in terms of an increased throughput are shown to be limited. Instead, two independent alterations not compliant to the standard are proposed, that (1) give up the concept of intra-bit plane truncation points and (2) introduce a true raw-coding mode that is fully parallelizable and does not require any context modeling. Next, an alternative block coder from the literature, the Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), is evaluated. Since it trades signal adaptiveness for increased parallelism, it is shown here how a stationary probability model averaged from a set of test sequences yields competitive compression efficiency. A combination of BPC-PaCo with the single-pass mode is proposed and shown to increase the speedup with respect to the original JPEG 2000 entropy coder from 2.15x (BPC-PaCo with two passes) to 2.6x (proposed BPC-PaCo with single-pass mode) at the marginal cost of increasing the PSNR penalty by 0.3 dB to at most 1 dB. Furthermore, a parallel algorithm is presented that determines the optimal code block bit stream truncation points (given an available bit rate budget) and builds the entire code stream on the GPU, reducing the amount of data that has to be transferred back into host memory to a minimum. A theoretical runtime model is formulated that allows, based on benchmarking results on one GPU, to predict the runtime of a kernel on another GPU. Lastly, the first ever JPEG XS GPU-decoder realization is presented. JPEG XS was designed to be a low complexity codec and for the first time explicitly demanded GPU-friendliness already in the call for proposals. Starting at bit rates above 1 bpp, the decoder is around 2x faster compared to the original JPEG 2000 and 1.5x faster compared to JPEG 2000 with the fastest evaluated entropy coder (BPC-PaCo with single-pass mode). With a GeForce GTX 1080, a decoding throughput of around 200 fps is achieved for a UHD 4:4:4 sequence
Recommended from our members
Automotive embedded systems software reprogramming
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityThe exponential growth of computer power is no longer limited to stand alone computing systems but applies to all areas of commercial embedded computing systems. The ongoing rapid growth in intelligent embedded systems is visible in the commercial automotive area, where a modern car today implements up to 80 different electronic control units (ECUs) and their total memory size has been increased to several hundreds of megabyte.
This growth in the commercial mass production world has led to new challenges, even within the automotive industry but also in other business areas where cost pressure is high. The need to drive cost down means that every cent spent on recurring engineering costs needs to be justified. A conflict between functional requirements (functionality, system reliability, production and manufacturing aspects etc.), testing and maintainability aspects is given.
Software reprogramming, as a key issue within the automotive industry, solve that given conflict partly in the past. Software Reprogramming for in-field service and maintenance in the after sales markets provides a strong method to fix previously not identified software errors. But the increasing software sizes and therefore the increasing software reprogramming times will reduce the benefits. Especially if ECU’s software size growth faster than vehicle’s onboard infrastructure can be adjusted.
The thesis result enables cost prediction of embedded systems’ software reprogramming by generating an effective and reliable model for reprogramming time for different existing and new technologies. This model and additional research results contribute to a timeline for short term, mid term and long term solutions which will solve the currently given problems as well as future challenges, especially for the automotive industry but also for all other business areas where cost pressure is high and software reprogramming is a key issue during products life cycle
The Design and modeling of input and output modules for an ATM network switch
The purpose of this thesis is to design, model, and simulate both an input and an output module for an ATM network switch. These devices are used to interface an ATM switch with the physical protocol that is transporting data along the actual transmission medium. The I/O modules have been designed specifically to interface with the Synchronous Optical Network (SONET) protocol. This thesis studies the ATM protocol and examines the issues involved with designing an ATM I/O module chipset. A model of the design was then implemented in both C++ and \TTDL. These models were simulated in order to verify functionality and document performance. The intent of this work is to provide the background and models necessary to aid in the further study and development of entire ATM switch architectures. The input and output modules .ire onlv two functional pieces of a complete ATM switch. The software models that have been implemented by this thesis can be integrated with the other necessary functional blocks to form a complete model of a working ATM switch. These functional blocks can then be rearranged and altered to assist in the study of how different switch architectures can effect overall network performance and efficiency. The input and output modules have been designed to be as flexible as possible in order to easily adapt to future modifications
Networking DEC and IBM computers
Local Area Networking of DEC and IBM computers within the structure of the ISO-OSI Seven Layer Reference Model at a raw signaling speed of 1 Mops or greater are discussed. After an introduction to the ISO-OSI Reference Model nd the IEEE-802 Draft Standard for Local Area Networks (LANs), there follows a detailed discussion and comparison of the products available from a variety of manufactures to perform this networking task. A summary of these products is presented in a table
Novel algorithms for fair bandwidth sharing on counter rotating rings
Rings are often preferred technology for networks as ring networks can virtually create fully connected mesh networks efficiently and they are also easy to manage. However, providing fair service to all the stations on the ring is not always easy to achieve.
In order to capitalize on the advantages of ring networks, new buffer insertion techniques, such as Spatial Reuse Protocol (SRP), were introduced in early 2000s. As a result, a new standard known as IEEE 802.17 Resilient Packet Ring was defined in 2004 by the IEEE Resilient Packet Ring (RPR) Working Group. Since then two addenda have been introduced; namely, IEEE 802.17a and IEEE 802.17b in 2006 and 2010, respectively. During this standardization process, weighted fairness and queue management schemes were proposed to be used in the standard. As shown in this dissertation, these schemes can be applied to solve the fairness issues noted widely in the research community as radical changes are not practical to introduce within the context of a standard.
In this dissertation, the weighted fairness aspects of IEEE 802.17 RPR (in the aggressive mode of operation) are studied; various properties are demonstrated and observed via network simulations, and additional improvements are suggested. These aspects have not been well studied until now, and can be used to alleviate some of the issues observed in the fairness algorithm under some scenarios. Also, this dissertation focuses on the RPR Medium Access Control (MAC) Client implementation of the IEEE 802.17 RPR MAC in the aggressive mode of operation and introduces a new active queue management scheme for ring networks that achieves higher overall utilization of the ring bandwidth with simpler and less expensive implementation than the generic implementation provided in the standard. The two schemes introduced in this dissertation provide performance comparable to the per destination queuing implementation, which yields the best achievable performance at the expense of the cost of implementation. In addition, till now the requirements for sizing secondary transit queue of IEEE 802.17 RPR stations (in the aggressive mode of operation) have not been properly investigated. The analysis and suggested improvements presented in this dissertation are then supported by performance evaluation results and theoretical calculations. Last, but not least, the impact of using different capacity links on the same ring has not been investigated before from the ring utilization and fairness points of view. This dissertation also investigates utilizing different capacity links in RPR and proposes a mechanism to support the same
FDMA Point-to-Multi-Point Fibre Access System for Latency Sensitive Applications
We present a demo for a multiple uplink access system with real-time services. Several terminals transmit and are detected simultaneously through FDMA. The system can allow latency-sensitive and best-effort applications to share the network
- …