Search CORE

37,560 research outputs found

Sample-Parallel Execution of EBCOT in Fast Mode

Author: Bruns Volker
Martínez del Amor Miguel Ángel
Publication venue: IEEE Computer Society
Publication date: 01/01/2016
Field of study

JPEG 2000’s most computationally expensive building block is the Embedded Block Coder with Optimized Truncation (EBCOT). This paper evaluates how encoders targeting a parallel architecture such as a GPU can increase their throughput in use cases where very high data rates are used. The compression efficiency in the less significant bit-planes is then often poor and it is beneficial to enable the Selective Arithmetic Coding Bypass style (fast mode) in order to trade a small loss in compression efficiency for a reduction of the computational complexity. More importantly, this style exposes a more finely grained parallelism that can be exploited to execute the raw coding passes, including bit-stuffing, in a sample-parallel fashion. For a latency- or memory critical application that encodes one frame at a time, EBCOT’s tier-1 is sped up between 1.1x and 2.4x compared to an optimized GPU-based implementation. When a low GPU occupancy has already been addressed by encoding multiple frames in parallel, the throughput can still be improved by 5% for high-entropy images and 27% for low-entropy images. Best results are obtained when enabling the fast mode after the fourth significant bit-plane. For most of the test images the compression rate is within 1% of the original

idUS. Depósito de Investigación Universidad de Sevilla

A very high speed lossless compression/decompression chip set

Author: Liu Kathy
Liu Norley
Merrell Randy
Venbrux Jack
Vincent Peter
Publication venue
Publication date
Field of study

A chip is described that will perform lossless compression and decompression using the Rice Algorithm. The chip set is designed to compress and decompress source data in real time for many applications. The encoder is designed to code at 20 M samples/second at MIL specifications. That corresponds to 280 Mbits/second at maximum quantization or approximately 500 Mbits/second under nominal conditions. The decoder is designed to decode at 10 M samples/second at industrial specifications. A wide range of quantization levels is allowed (4...14 bits) and both nearest neighbor prediction and external prediction are supported. When the pre and post processors are bypassed, the chip set performs high speed entropy coding and decoding. This frees the chip set from being tied to one modeling technique or specific application. Both the encoder and decoder are being fabricated in a 1.0 micron CMOS process that has been tested to survive 1 megarad of total radiation dosage. The CMOS chips are small, only 5 mm on a side, and both are estimated to consume less than 1/4 of a Watt of power while operating at maximum frequency

NASA Technical Reports Server

Decoding billions of integers per second through vectorization

Author: Aksyonoff A
Büttcher S
Jones DM
Witten IH
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap

arXiv.org e-Print Archive

R-libre

Crossref

Recommended from our members

Parallel H.263 Encoder in Normal Coding Mode

Author: Cosmas J
Paker Y
Pearmain A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

A parallel H.263 video encoder, which utilises spatial para1 elism, has been modelled using a multi-threaded program. Spatial parallelism is a technique where an image is subdivided into equal parts (as far as physically possible) and each part is proces!;ed by a separate processor by computing motion and texture mding with all processors cach acting on a different part of thc ]mag. This method leads to a performance increase, which is roughly in proportion to the number of parallel processors used

Brunel University Research Archive

Implementation of JPEG compression and motion estimation on FPGA hardware

Author: Gopalakrishnan Ramakrishna
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2008
Field of study

A hardware implementation of JPEG allows for real-time compression in data intensivve applications, such as high speed scanning, medical imaging and satellite image transmission. Implementation options include dedicated DSP or media processors, FPGA boards, and ASICs. Factors that affect the choice of platform selection involve cost, speed, memory, size, power consumption, and case of reconfiguration. The proposed hardware solution is based on a Very high speed integrated circuit Hardware Description Language (VHDL) implememtation of the codec with prefered realization using an FPGA board due to speed, cost and flexibility factors; The VHDL language is commonly used to model hardware impletations from a top down perspective. The VHDL code may be simulated to correct mistakes and subsequently synthesized into hardware using a synthesis tool, such as the xilinx ise suite. The same VHDL code may be synthesized into a number of sifferent hardware architetcures based on constraints given. For example speed was the major constraint when synthesizing the pipeline of jpeg encoding and decoding, while chip area and power consumption were primary constraints when synthesizing the on-die memory because of large area. Thus, there is a trade off between area and speed in logic synthesis

University of Nevada, Las Vegas Repository