Search CORE

130 research outputs found

A highly scalable parallel implementation of H.264

Author: Azevedo Arnaldo
Hoogerbrugge Jan
Juurlink Ben
Meenderinck Cor
Ramírez Bellido Alejandro
Terechko Andrei
Valero Cortés Mateo
Álvarez Mesa Mauricio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding. This work was performed while the fourth author was with NXP Semiconductors.Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

On the design of multimedia architectures : proceedings of a one-day workshop, Eindhoven, December 18, 2003

Author
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2003
Field of study

Pure OAI Repository

An evaluation of different DLP alternatives for the embedded media domain

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Salamí San Juan Esther
Valero Cortés Mateo
Publication venue
Publication date: 01/01/1999
Field of study

The importance of media processing has produced a revolution in the design of embedded processors. In order to face the high computational and technological demands of near future media applications, new embedded processors are including features that were commonly restricted to the general purpose and the supercomputing domains. In this paper we have evaluated the performance of various DLP (Data Level Parallelism) oriented embedded architectures and analyzed quantitative data in order to determine the highlights and disadvantages of each approach. Additionally we have analyzed the differences between the explicit parallel versions of code (often based on the standard algorithms) and the high-tuned, non-vectorizable versions usually found in real multimedia programs. We will show that sub-word SIMD architectures (like MMX) are a very costeffective solution, and that, while long vector architectures provide few improvements at a very high cost, a smart combination between vector and SIMD-like architectures is the alternative that leverages best performance at a reasonable cost. We will also show that the memory latency tolerance, typical of vector architectures, is partially compensated by the worse spatial locality found when executing vector code.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

On the design of multimedia architectures : proceedings of a one-day workshop, Eindhoven, December 18, 2003

Author
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2003
Field of study

Pure OAI Repository

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

Design and Test Space Exploration of Transport-Triggered Architectures

Author: Kerkhoff H.G.
Tangelder R.J.W.T.
Zivkovic V.A.
Publication venue: IEEE
Publication date: 01/01/2000
Field of study

This paper describes a new approach in the high level design and test of transport-triggered architectures (TTA), a special type of application specific instruction processors (ASIP). The proposed method introduces the test as an additional constraint, besides throughput and circuit area. The method, that calculates the testability of the system, helps the designer to assess the obtained architectures with respect to test, area and throughput in the early phase of the design and selects the most suitable one. In order to create the templated TTA, the ¿MOVE¿ framework has been addressed. The approach is validated with respect to the ¿Crypt¿ Unix applicatio

CiteSeerX

University of Twente Research Information

Variable-based multi-module data caches for clustered VLIW processors

Author: Abella Ferrer Jaume
Gibert Codina Enric
González Colás Antonio María
Sánchez Navarro Jesús
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Constraint analysis for DSP code generation

Author: Mesman B.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2001
Field of study

+113hlm.;24c

Repository TU/e

Pure OAI Repository

uilis.unsyiah.ac.id

VLSI architecture design approaches for real-time video processing

Author: Ahmad A.
Ahmad A.
Cosmas J.
Cosmas J.
Loo J.
Loo J.
Publication venue: 'World Scientific and Engineering Academy and Society (WSEAS)'
Publication date: 01/01/2008
Field of study

This paper discusses the programmable and dedicated approaches for real-time video processing applications. Various VLSI architecture including the design examples of both approaches are reviewed. Finally, discussions of several practical designs in real-time video processing applications are then considered in VLSI architectures to provide significant guidelines to VLSI designers for any further real-time video processing design works

Middlesex University Research Repository