1,068 research outputs found

    High throughput image compression and decompression on GPUs

    Get PDF
    Diese Arbeit befasst sich mit der Entwicklung eines GPU-freundlichen, intra-only, Wavelet-basierten Videokompressionsverfahrens mit hohem Durchsatz, das für visuell verlustfreie Anwendungen optimiert ist. Ausgehend von der Beobachtung, dass der JPEG 2000 Entropie-Kodierer ein Flaschenhals ist, werden verschiedene algorithmische Änderungen vorgeschlagen und bewertet. Zunächst wird der JPEG 2000 Selective Arithmetic Coding Mode auf der GPU realisiert, wobei sich die Erhöhung des Durchsatzes hierdurch als begrenzt zeigt. Stattdessen werden zwei nicht standard-kompatible Änderungen vorgeschlagen, die (1) jede Bitebebene in nur einem einzelnen Pass verarbeiten (Single-Pass-Modus) und (2) einen echten Rohcodierungsmodus einführen, der sample-weise parallelisierbar ist und keine aufwendige Kontextmodellierung erfordert. Als nächstes wird ein alternativer Entropiekodierer aus der Literatur, der Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), evaluiert. Er gibt Signaladaptivität zu Gunsten von höherer Parallelität auf und daher wird hier untersucht und gezeigt, dass ein aus verschiedensten Testsequenzen gemitteltes statisches Wahrscheinlichkeitsmodell eine kompetitive Kompressionseffizienz erreicht. Es wird zudem eine Kombination von BPC-PaCo mit dem Single-Pass-Modus vorgeschlagen, der den Speedup gegenüber dem JPEG 2000 Entropiekodierer von 2,15x (BPC-PaCo mit zwei Pässen) auf 2,6x (BPC-PaCo mit Single-Pass-Modus) erhöht auf Kosten eines um 0,3 dB auf 1,0 dB erhöhten Spitzen-Signal-Rausch-Verhältnis (PSNR). Weiter wird ein paralleler Algorithmus zur Post-Compression Ratenkontrolle vorgestellt sowie eine parallele Codestream-Erstellung auf der GPU. Es wird weiterhin ein theoretisches Laufzeitmodell formuliert, das es durch Benchmarking von einer GPU ermöglicht die Laufzeit einer Routine auf einer anderen GPU vorherzusagen. Schließlich wird der erste JPEG XS GPU Decoder vorgestellt und evaluiert. JPEG XS wurde als Low Complexity Codec konzipiert und forderte erstmals explizit GPU-Freundlichkeit bereits im Call for Proposals. Ab Bitraten über 1 bpp ist der Decoder etwa 2x schneller im Vergleich zu JPEG 2000 und 1,5x schneller als der schnellste hier vorgestellte Entropiekodierer (BPC-PaCo mit Single-Pass-Modus). Mit einer GeForce GTX 1080 wird ein Decoder Durchsatz von rund 200 fps für eine UHD-4:4:4-Sequenz erreicht.This work investigates possibilities to create a high throughput, GPU-friendly, intra-only, Wavelet-based video compression algorithm optimized for visually lossless applications. Addressing the key observation that JPEG 2000’s entropy coder is a bottleneck and might be overly complex for a high bit rate scenario, various algorithmic alterations are proposed. First, JPEG 2000’s Selective Arithmetic Coding mode is realized on the GPU, but the gains in terms of an increased throughput are shown to be limited. Instead, two independent alterations not compliant to the standard are proposed, that (1) give up the concept of intra-bit plane truncation points and (2) introduce a true raw-coding mode that is fully parallelizable and does not require any context modeling. Next, an alternative block coder from the literature, the Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), is evaluated. Since it trades signal adaptiveness for increased parallelism, it is shown here how a stationary probability model averaged from a set of test sequences yields competitive compression efficiency. A combination of BPC-PaCo with the single-pass mode is proposed and shown to increase the speedup with respect to the original JPEG 2000 entropy coder from 2.15x (BPC-PaCo with two passes) to 2.6x (proposed BPC-PaCo with single-pass mode) at the marginal cost of increasing the PSNR penalty by 0.3 dB to at most 1 dB. Furthermore, a parallel algorithm is presented that determines the optimal code block bit stream truncation points (given an available bit rate budget) and builds the entire code stream on the GPU, reducing the amount of data that has to be transferred back into host memory to a minimum. A theoretical runtime model is formulated that allows, based on benchmarking results on one GPU, to predict the runtime of a kernel on another GPU. Lastly, the first ever JPEG XS GPU-decoder realization is presented. JPEG XS was designed to be a low complexity codec and for the first time explicitly demanded GPU-friendliness already in the call for proposals. Starting at bit rates above 1 bpp, the decoder is around 2x faster compared to the original JPEG 2000 and 1.5x faster compared to JPEG 2000 with the fastest evaluated entropy coder (BPC-PaCo with single-pass mode). With a GeForce GTX 1080, a decoding throughput of around 200 fps is achieved for a UHD 4:4:4 sequence

    Underwater radio frequency image sensor using progressive image compression and region of interest

    Get PDF
    The increasing demand for underwater robotic intervention systems around the world in several application domains requires more versatile and inexpensive systems. By using a wireless communication system, supervised semi-autonomous robots have freedom of movement; however, the limited and varying bandwidth of underwater radio frequency (RF) channels is a major obstacle for the operator to get camera feedback and supervise the intervention. This paper proposes the use of progressive (embedded) image compression and region of interest (ROI) for the design of an underwater image sensor to be installed in an autonomous underwater vehicle, specially when there are constraints on the available bandwidth, allowing a more agile data exchange between the vehicle and a human operator supervising the underwater intervention. The operator can dynamically decide the size, quality, frame rate, or resolution of the received images so that the available bandwidth is utilized to its fullest potential and with the required minimum latency. The paper focuses first on the description of the system, which uses a camera, an embedded Linux system, and an RF emitter installed in an OpenROV housing cylinder. The RF receiver is connected to a computer on the user side, which controls the camera monitoring parameters, including the compression inputs, such as region of interest (ROI), size of the image, and frame rate. The paper focuses on the compression subsystem and does not attempt to improve the communications physical media for better underwater RF links. Instead, it proposes a unified system that uses well-integrated modules (compression and transmission) to provide the scientific community with a higher-level protocol for image compression and transmission in sub-sea robotic interventions

    Rinnakkainen toteutus H.265 videokoodaus standardille

    Get PDF
    The objective of this study was to research the scalability of the parallel features in the new H.265 video compression standard, also know as High Efficiency Video Coding (HEVC). Compared to its predecessor, the H.264 standard, H.265 typically achieves around 50% bitrate reduction for the same subjective video quality. Especially videos with higher resolution (Full HD and beyond) achieve better compression ratios. Also a better utilization of parallel computing resources is provided. H.265 introduces two novel parallelization features: Tiles and Wavefront Parallel Processing (WPP). In Tiles, each video frame is divided into areas that can be decoded without referencing to other areas in the same frame. In WPP, the relations between code blocks in a frame are encoded so that the decoding process can progress through the frame as a front using multiple threads. In this study, the reference implementation for the H.265 decoder was augmented to support both of these parallelization features. The performance of the parallel implementations was measured using three different setups. From the measurement results it could be seen that the introduction of more CPU cores reduced the total decode time of the video frames to a certain point. When using the Tiles feature, it was observed that the encoding geometry, i.e. how each frame was divided into individually decodable areas, had a noticeable effect on the decode times with certain thread counts. When using WPP, it was observed that what was mostly synchronization overhead, sometimes had a negative effect on the decode times when using larger (4-12) amounts of threads.Tämän tutkimuksen aiheena oli tutkia uuden H.265 videonpakkausstandardin (tunnetaan myös nimellä HEVC (engl. High Efficiency Video Coding)) rinnakkaisuusominaisuuksien skaalautuvuutta. Verrattuna edeltäjäänsä, H.264 videonpakkaustandardiin, H.265 tyypillisesti saavuttaa samalla kuvanlaadulla noin 50% pienemmän pakkauskoon. Erityisesti suuren resoluution videoilla (Full HD ja suuremmat) pakkaustehokkuuden paremmuus korostuu. Huomiota on kiinnitetty myös moniydinprosessoreiden hyödyntämiseen videokoodauksessa. H.265 tarjoaa kaksi uutta rinnakkaisuusominaisuutta: niin kutsutut Tiles- ja WPP-menetelmät (engl. \emph{Wavefront Parallel Processing}). Tiles-menetelmässä jokainen videon kuva jaetaan alueisiin, jotka voidaan purkaa viittaamatta saman kuvan muihin alueisiin. WPP-menetelmässä suhteet kuvan lohkoihin pakataan siten että purkamisprosessi pystyy etenemään kuvan läpi rintamana hyödyntäen useampia säikeitä. Tässä tutkimuksessa H.265 videodekooderin referenssitoteutusta laajennettiin tukemaan molempia näistä rinnakkaisuusominaisuuksista. Suorituskykyä mitattiin käyttäen kolmea eri mittausasetelmaa. Mittaustuloksista ilmeni, että prosessoriydinten lukumäärän kasvattaminen nopeutti videoiden purkamista tiettyyn pisteeseen asti. Tiles-menetelmää mitatessa havaittiin, että alueiden geometrialla, eli kuinka kuva jaettiin riippumattomiin alueisiin, on huomattava vaikutus purkamisnopeuteen tietyillä säiemäärillä. WPP-menetelmää mitattaessa havaittiin että korkeampiin säiemääriin (4-12) siirryttäessä purkamisnopeus alkoi hidastua. Tämä johtui pääasiassa säikeiden keskinäiseen synkronointiin kuluvasta ajasta

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization

    Implementation of BMA based motion estimation hardware accelerator in HDL

    Full text link
    Motion Estimation in MPEG (Motion Pictures Experts Group) video is a temporal prediction technique. The basic principle of motion estimation is that in most cases, consecutive video frames will be similar except for changes induced by objects moving within the frames. Motion Estimation performs a comprehensive 2-dimensional spatial search for each luminance macroblock (16x16 pixel block). MPEG does not define how this search should be performed. This is a detail that the system designer can choose to implement in one of many possible ways. It is well known that a full, exhaustive search over a wide 2-dimensional area yields the best matching results in most cases, but this performance comes at an extreme computational cost to the encoder. Some lower cost encoders might choose to limit the pixel search range, or use other techniques usually at some cost to the video quality which gives rise to a trade-off; Such algorithms used in image processing are generally computationally expensive. FPGAs are capable of running graphics algorithms at the speed comparable to dedicated graphics chips. At the same time they are configurable through high-level programming languages, e.g. Verilog, VHDL. The work presented entirely focuses upon a Hardware Accelerator capable of performing Motion Estimation, based upon Block Matching Algorithm. The SAD based Full Search Motion Estimation coded using Verilog HDL, relies upon a 32x32 pixel search area to find the best match for single 16x16 macroblock; Keywords. Motion Estimation, MPEG, macroblock, FPGA, SAD, Verilog, VHDL

    Reliable Design of Three-Dimensional Integrated Circuits

    Get PDF

    Extensões para a compressão Base-Delta-Imediato

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Memórias cache há muito têm sido utilizadas para reduzir os problemas decorrentes da discrepância de desempenho entre a memória e o processador: muitos níveis de caches on-chip reduzem a latência média de memória ao custo de área e energia extra no die. Para diminuir o dispêndio desses componentes extras, técnicas de compressão de cache são usadas para armazenar dados comprimidos e permitir um aumento de capacidade de cache. Este projeto apresenta extensões para a Compressão Base-Delta-Imediato, várias modificações da técnica original que minimizam a quantidade de bits de preenchimento numa compressão através da flexibilização dos tamanhos de delta permitidos para cada base e do aumento do número de bases. As extensões foram testadas utilizando ZSim, avaliadas contra métodos estado da arte, e os resultados de desempenho foram comparados e avaliados para determinar a validade de utilização das técnicas propostas. Foi constatado um aumento do fator de compressão médio de 1.37x para 1.58x com um aumento de energia tão baixo quanto 27%Abstract: Cache memories have long been used to reduce problems deriving from the memory-processor performance discrepancy: many levels of on-chip cache reduce the average memory latency at the cost of extra die area and power. To decrease the outlay of these extra components, cache compression techniques are used to store compressed data and allow a cache capacity boost. This project introduces extensions to the Base-Delta-Immediate Compression, many modifications of the original technique that minimize the quantity of padding bits by relaxing the allowed delta sizes for each base and increasing number of bases. The extensions were tested using ZSim, evaluated against state-of-the-art methods, and the performance results were compared and evaluated to determine the validity of the proposed techniques. We verified an improvement of the original BDI compression factor from 1.37x to 1.58x at a energy increase as low as 27%MestradoCiência da ComputaçãoMestre em Ciência da Computação1564395CAPE
    • …
    corecore