Search CORE

91 research outputs found

Register Saturation in Superscalar and VLIW Codes

Author: Touati Sid
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/04/2001
Field of study

International audienceThe registers constraints can be taken into account during the scheduling phase of an acyclic data dependence graph (DAG): any schedule must minimize the register requirement. In this work, we mathematically study and extend the approach which consists of computing the exact upper-bound of the register need for all the valid schedules, independently of the functional unit constraints. A previous work (URSA) was presented in [5,4]. Its aim was to add some serial arcs to the original DAG such that the worst register need does not exceed the number of available registers. We write an appropriate mathematical formalism for this problem and extend the DAG model to take into account delayed read from and write into registers with multiple registers types. This formulation permits us to provide in this paper better heuristics and strategies (nearly optimal), and we prove that the URSA technique is not sufficient to compute the maximal register requirement, even if its solution is optimal

INRIA a CCSD electronic archive server

An evaluation of different DLP alternatives for the embedded media domain

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Salamí San Juan Esther
Valero Cortés Mateo
Publication venue
Publication date: 01/01/1999
Field of study

The importance of media processing has produced a revolution in the design of embedded processors. In order to face the high computational and technological demands of near future media applications, new embedded processors are including features that were commonly restricted to the general purpose and the supercomputing domains. In this paper we have evaluated the performance of various DLP (Data Level Parallelism) oriented embedded architectures and analyzed quantitative data in order to determine the highlights and disadvantages of each approach. Additionally we have analyzed the differences between the explicit parallel versions of code (often based on the standard algorithms) and the high-tuned, non-vectorizable versions usually found in real multimedia programs. We will show that sub-word SIMD architectures (like MMX) are a very costeffective solution, and that, while long vector architectures provide few improvements at a very high cost, a smart combination between vector and SIMD-like architectures is the alternative that leverages best performance at a reasonable cost. We will also show that the memory latency tolerance, typical of vector architectures, is partially compensated by the worse spatial locality found when executing vector code.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Huffman-based Code Compression Techniques for Embedded Systems

Author: Bonny Talal
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2009
Field of study

KITopen

A configurable vector processor for accelerating speech coding algorithms

Author: Konstantia Koutsomyti (7201031)
Publication venue
Publication date: 01/01/2007
Field of study

The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

Loughborough University Institutional Repository

Exploring Processor and Memory Architectures for Multimedia

Author: Iranpour Ali
Publication venue
Publication date: 01/01/2012
Field of study

Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

CiteSeerX

Lund University Publications

ディジタル　シンゴウ　ショリ　システム　ノ　セイノウ　ヒョウカ　ト　セッケイ　コウリツカ　ニ　カンスル　ケンキュウ

Author: Kumura Takahiro
クムラタカヒロ
久村孝寛
Publication venue
Publication date
Field of study

Osaka University Knowledge Archive

Low-power architecture for CIL-code hardware processor

Author: Chapyzhenka A.
Ragozin D.
Umnov A.
Publication venue: Інститут програмних систем НАН України
Publication date: 01/01/2005
Field of study

In the article the authors present the architecture of a hardware CIL processor, which is capable to execute CIL instructions as native code. The CIL hardware engine is implemented on the top of the low-power DSP architecture, and the CIL processor has two execution cores: DSP and CIL. Such solution allows to execute both CIL and DSP instruction sets as native instructions sets and gain performance in common multimedia tasks. Therefore, the DSP-based CIL processor may be targeted for multimedia digital home and even embedded applications. The research was sponsored by RFP 2 Microsoft Corp. grant

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)

SIRA: Schedule Independent Register Allocation for Software Pipelining

Author: Eisenbeis Christine
Touati Sid
Publication venue: HAL CCSD
Publication date: 27/06/2001
Field of study

International audienceThe register allocation in loops is generally carried out after or during the software pipelining process. This is because doing the register allocation at first step without assuming a schedule lacks the information of interferences between values live ranges. The register allocator introduces extra false dependencies which reduces dramatically the original ILP (Instruction Level Parallelism). In this paper, we give a new formulation to carry out the register allocation before the scheduling process, directly on the data dependence graph by inserting some anti dependencies arcs (reuse edges). This graph extension is first constrained by minimizing the critical cycle and hence minimizing the ILP loss due to the register pressure. The second constraint is to ensure that there is always a cyclic register allocation with the set of available registers, and this for any software pipelining of the new graph. We give the exact formulation of this problem with linear integer programming

INRIA a CCSD electronic archive server

A programmable DSP for low-power, low-complexity baseband processing

Author: Næss Hallvard
Publication venue: Institutt for elektronikk og telekommunikasjon
Publication date: 01/01/1976
Field of study

Software defined radio (SDR) is an emerging trend of radio technology. The idea is basically to move software as close to the antenna of a radio system as possible, to improve flexibility, adaptability and time-to-market. This thesis covers the description of a DSP architecture especially optimized for modulation / demodulation algorithms of low-complexity, low-power radio standards. The DSP allows software processing of these algorithms, making SDR possible. To make the DSP competitive to traditional ASIC modems, tough constraints are given for area and power consumption. Estimates done to indicate the power consumption, area and computational power of the DSP, shows that a software implementation of the studied physical layer should be possible within the given constraints

Croatian Digital Thesis Repository

University of Zagreb Repository

Digital Repository of Music Academy

Doctor of Philosophy

Author: Ramani Karthik
Publication venue: University of Utah
Publication date: 01/12/2012
Field of study

dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

The University of Utah: J. Willard Marriott Digital Library