Search CORE

5,187 research outputs found

Sample-Parallel Execution of EBCOT in Fast Mode

Author: Bruns Volker
Martínez del Amor Miguel Ángel
Publication venue: IEEE Computer Society
Publication date: 01/01/2016
Field of study

JPEG 2000’s most computationally expensive building block is the Embedded Block Coder with Optimized Truncation (EBCOT). This paper evaluates how encoders targeting a parallel architecture such as a GPU can increase their throughput in use cases where very high data rates are used. The compression efficiency in the less significant bit-planes is then often poor and it is beneficial to enable the Selective Arithmetic Coding Bypass style (fast mode) in order to trade a small loss in compression efficiency for a reduction of the computational complexity. More importantly, this style exposes a more finely grained parallelism that can be exploited to execute the raw coding passes, including bit-stuffing, in a sample-parallel fashion. For a latency- or memory critical application that encodes one frame at a time, EBCOT’s tier-1 is sped up between 1.1x and 2.4x compared to an optimized GPU-based implementation. When a low GPU occupancy has already been addressed by encoding multiple frames in parallel, the throughput can still be improved by 5% for high-entropy images and 27% for low-entropy images. Best results are obtained when enabling the fast mode after the fourth significant bit-plane. For most of the test images the compression rate is within 1% of the original

idUS. Depósito de Investigación Universidad de Sevilla

Reducing branch delay to zero in pipelined processors

Author: González Colás Antonio María
Llaberia Griñó José M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1993
Field of study

A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Transparent code authentication at the processor level

Author: A.O. Durahim
Aoki
B. Sunar
Bellare
Black
Boneh
Brassard
Carter
Chevallier-Mames
Choukri
Clarke
Clarke
E. Savaş
Gaj
Gassend
Gaubatz
Hodjat
Hopkins
Joye
Kaps
Krawczyk
Lee
Lim
McCune
Ö. Kocabaş
Reyhani-Masoleh
Satoh
Suh
Sunar
T.B. Pedersen
Yan
Yang
Zhang
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2009
Field of study

The authors present a lightweight authentication mechanism that verifies the authenticity of code and thereby addresses the virus and malicious code problems at the hardware level eliminating the need for trusted extensions in the operating system. The technique proposed tightly integrates the authentication mechanism into the processor core. The authentication latency is hidden behind the memory access latency, thereby allowing seamless on-the-fly authentication of instructions. In addition, the proposed authentication method supports seamless encryption of code (and static data). Consequently, while providing the software users with assurance for authenticity of programs executing on their hardware, the proposed technique also protects the software manufacturers’ intellectual property through encryption. The performance analysis shows that, under mild assumptions, the presented technique introduces negligible overhead for even moderate cache sizes

Crossref

Sabanci University Research Database

Distributed coding using punctured quasi-arithmetic codes for memory and memoryless sources

Author: Artigas Roca Javier
Guillemot Christine
Malinowski Simon
Torres Urgell Lluís
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

This correspondence considers the use of punctured quasi-arithmetic (QA) codes for the Slepian–Wolf problem. These entropy codes are defined by finite state machines for memoryless and first-order memory sources. Puncturing an entropy coded bit-stream leads to an ambiguity at the decoder side. The decoder makes use of a correlated version of the original message in order to remove this ambiguity. A complete distributed source coding (DSC) scheme based on QA encoding with side information at the decoder is presented, together with iterative structures based on QA codes. The proposed schemes are adapted to memoryless and first-order memory sources. Simulation results reveal that the proposed schemes are efficient in terms of decoding performance for short sequences compared to well-known DSC solutions using channel codes.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Author: Kanamori Issaku
Matsufuru Hideo
Publication venue
Publication date: 05/12/2017
Field of study

We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel Xeon Phi Knights Landing (KNL). The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that represents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of KNL, such as using intrinsics and manual prefetching, to the matrix multiplication and iterative solver algorithms. Based on the performance measured on the Oakforest-PACS system, we discuss the performance tuning on KNL as well as the code design for facilitating such tuning on SIMD architecture and massively parallel machines.Comment: 8 pages, 12 figures. Talk given at LHAM'17 "5th International Workshop on Legacy HPC Application Migration" in CANDAR'17 "The Fifth International Symposium on Computing and Networking" and to appear in the proceeding

arXiv.org e-Print Archive

Crossref