Search CORE

801 research outputs found

Cycle Accurate Energy and Throughput Estimation for Data Cache

Author: McDonald-Maier KD
Qadri M
Publication venue: 'TECSI'
Publication date: 01/01/2009
Field of study

Resource optimization in energy constrained real-time adaptive embedded systems highly depends on accurate energy and throughput estimates of processor peripherals. Such applications require lightweight, accurate mathematical models to profile energy and timing requirements on the go. This paper presents enhanced mathematical models for data cache energy and throughput estimation. The energy and throughput models were found to be within 95% accuracy of per instruction energy model of a processor, and a full system simulator?s timing model respectively. Furthermore, the possible application of these models in various scenarios is discussed in this paper

University of Essex Research Repository

Recommended from our members

A performance comparison of several superscalar processsor [sic] models with a VLIW processor

Author: Bagherzadeh Nader
Lenell John
Publication venue: eScholarship, University of California
Publication date: 01/01/1992
Field of study

Superscalar and VLIW processors can both execute multiple instructions each cycle. Each employs a different instruction scheduling method to achieve multiple instruction execution. Superscalar processors schedule instructions dynamically, and VLIW processors execute statically scheduled instructions. This paper quantitatively compares various superscalar processor architectures with a Very Long Instruction Word architecture developed at the University of California, Irvine. An architectural overview and performance analysis of the superscalar processor models and VIPER, a VLIW processor designed to take advantage of the parallelizing capabilities of Percolation Scheduling, are presented. The motivation for this comparison is to study the capability of a dynamically scheduled processor to obtain the same performance achieved by a statically scheduled processor, and examine the hardware resources required by each

eScholarship - University of California

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

A framework for FPGA functional units in high performance computing

Author: Koltes A.
O'Donnell J.T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2010
Field of study

FPGAs make it practical to speed up a program by defining hardware functional units that perform calculations faster than can be achieved in software. Specialised digital circuits avoid the overhead of executing sequences of instructions, and they make available the massive parallelism of the components. The FPGA operates as a coprocessor controlled by a conventional computer. An application that combines software with hardware in this way needs an interface between a communications port to the processor and the signals connected to the functional units. We present a framework that supports the design of such systems. The framework consists of a generic controller circuit defined in VHDL that can be configured by the user according to the needs of the functional units and the I/O channel. The controller contains a register file and a pipelined programmable register transfer machine, and it supports the design of both stateless and stateful functional units. Two examples are described: the implementation of a set of basic stateless arithmetic functional units, and the implementation of a stateful algorithm that exploits circuit parallelism

Enlighten

Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

Author: Gujarathi Hemal S
McDonald-Maier Klaus D
Qadri Muhammad Yasir
Publication venue: 'Academy Publisher'
Publication date: 01/01/2009
Field of study

The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

University of Essex Research Repository

CiteSeerX

Crossref

Методы и средства повышения эффективности решения задач на основе перестраиваемых вычислительных средств на ПЛИС

Author
Publication venue: Київ
Publication date: 01/01/2016
Field of study

Розроблено теоретичні основи побудови дворівневої матричної структури на ПЛІС, керованою обмеженою схемою потоку даних. Створено та досліджено моделі дворівневої матричної структури на ПЛІС, керованою обмеженою схемою потоку даних. Розроблено нову концепцію побудови проблемно-орієнтованих обчислювачів, реалізація котрих орієнтована на використання множини ПЛІС. Розроблено нову методику створення дворівневої матричної структури на ПЛІС, керованою обмеженою схемою потоку даних. В залежності від виконуваної задачі дворівнева матрична структура на ПЛІС може складатися з декількох сотень тисяч реконфігурованих логічних блоків, що об’єднуються комутаційною мережею та утворюють спеціалізований конвеєрний обчислювач, або суперскалярний процесор з множиною спеціалізованих обчислювальних блоків під керівництвом обмеженої схеми потоку даних. Ці спеціалізовані обчислювальні блоки можна програмувати на будь-які складні математичні операції на відміну від обмеженого набору RISC-операції, що можуть виконуватися функціональними блоками процесорного ядра з традиційною суперскалярною архітектурою. Для програмування матричної структура на ПЛІС та її реконфігурації застосовується центральна платформа, що управляє, на основі сучасного стандартного ПК. Досліджені апаратні засоби, що реалізують обмежену архітектуру потоку даних в сучасних суперскалярних мікропроцесорах та розроблено конфігураційну бібліотеку окремих обчислювальних модулів для обчислювача із конвеєрною архітектурою та для мікроархітектури ядра із суперскалярною архітектурою. Методику побудови дворівневої матричної структури на ПЛІС апробовано на прикладі розробки множини багатоканальних і багатосмугових цифрових КІХ-фільтрів, кожен з яких налаштовується на свою вузьку смугу.Developed theoretical foundations for creation of multilayered FPGA-based matrix structure managed by restricted dataflow model. Created and investigated models of multilayered FPGA-based matrix structure managed by restricted dataflow. Developed new concept of building problem-oriented processor, implementation of which is based on using multiple FPGA. Developed new methodology of creation of multilayered FPGA matrix managed by restricted dataflow model. Depending on task which is needed to be executed, multilayered FPGA-based matrix structure can contain hundreds of thousands reconfigurable logical elements interconnected with a communication network and form specialized pipeline processor or superscalar processor with multiple specialized computation elements managed by restricted dataflow model. The specialized computation elements can be programmed on any complex mathematical operations in a contrast to restricted number of RISC-operations that can be executed by functional elements of processor core with traditional superscalar architecture. Centralized management platform based on a standard PC is used for programming and reconfigurations of FPGA-based matrix structure. Investigated hardware that implements restricted dataflow model in modern superscalar microprocessors. Developed a configuration library of computational modules for processor with pipeline architecture and for microarchitecture of processor core with superscalar architecture. Methodology of creation of multilayered FPGA matrix was tested on development of multichannel FIR-filters each of which targets its own narrow channel.Разработаны теоретические основы построения двумерной матричной структуры на ПЛИС, управляемой ограниченной схемой потока данных. Созданы и исследованы модели двумерной матричной структуры на ПЛИС, управляемой ограниченной схемой потока данных. Разработана новая концепция построения проблемно-ориентированных вычислителей, реализация которых ориентирована на использование множества ПЛИС. Разработана новая методика создания двумерной матричной структуры на ПЛИС, управляемой ограниченной схемой потока данных. В зависимости от выполняемой задачи двумерная матричная структура на ПЛИС может состоять из нескольких сотен тысяч реконфигурированных логических блоков, которые объединяются коммутационною сетью и создают специализированный конвейерный вычислитель, или суперскалярный процессор с множеством специализированных вычислительных блоков под управлением ограниченной схемы потока данных. Эти специализированные вычислительные блоки можно программировать на математические операции произвольной сложности в отличие от ограниченного набора RISC-операций, которые могут выполняться функциональными блоками процессорного ядра с традиционной суперскалярной архитектурой. Для программирования матричной структуры на ПЛИС и её реконфигурирования используется централизованная управляющая платформа, на основе современного стандартного ПК. Исследованы аппаратные средства, которые реализуют ограниченную архитектуру потока данных в современных суперскалярных микропроцессорах и разработана конфигурационная библиотека отдельных вычислительных модулей для вычислителя с конвейерной архитектурой и для микроархитектуры ядра с суперскалярной архитектурой. Методику построения двумерной матричной структуры на ПЛИС апробировано на примере разработки множества многоканальных и многополосных цифровых КИХ-фильтров, каждый из которых настраивается на свою узкую полосу

Electronic Archive of Kyiv Polytechnic Institute

A pipelined configurable gate array for embedded processors

Author: Andrea Lodi
Fabio Campi
Mario Toma
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2003
Field of study

In recent years the challenge of high performance, low power retargettable embedded system has been faced with different technological and architectural solutions. In this paper we present a new configurable unit explicitly designed to imple-ment additional reconfigurable pipelined datapaths, suitable for the design of reconfigurable processors. A VLIW recon-figurable processor has been implemented on silicon in a standard 0.18 µm CMOS technology to prove the effective-ness of the proposed unit. Testing on a signal processing algorithms benchmark showed speedups from 4.3x to 13.5x and energy consumption reduction up to 92%

CiteSeerX

Crossref

PolyPublie

Exploring Processor and Memory Architectures for Multimedia

Author: Iranpour Ali
Publication venue
Publication date: 01/01/2012
Field of study

Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

CiteSeerX

Lund University Publications