Search CORE

13 research outputs found

재구성형 연산 구조를 위한 부동소수점 지원

Author: 조만휘
Publication venue: 서울대학교 대학원
Publication date: 01/02/2014
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 최기영.With a huge increase in demand for various kinds of compute-intensive applications in electronic systems, researchers have focused on coarse-grained reconfigurable architectures because of their advantages: high performance and flexibility. Besides, supporting floating-point operations on coarse-grained reconfigurable architecture becomes essential as the increase of demands on various floating-point inclusive applications such as multimedia processing, 3D graphics, augmented reality, or object recognition. This thesis presents FloRA, a coarse-grained reconfigurable architecture with floating-point support. Two-dimensional array of integer processing elements in FloRA is configured at run-time to perform floating-point operations as well as integer operations. More specifically, each floating-point operation is performed by two integer processing elements, one for mantissa and the other for exponent. Fabricated using 130nm process, the total area overhead due to additional hardware for floating-point operations is about 7.4% compared to the previous architecture which does not support floating-point operations. The fabricated chip runs at 125MHz clock frequency and 1.2V power supply. Experiments show 11.6x speedup on average compared to ARM9 with a vector-floating-point unit for integer-only benchmark programs as well as programs containing floating-point operations. Compared with other similar approaches including XPP and Butter, the proposed architecture shows much higher performance for integer applications, while maintaining about half the performance of Butter for floating-point applications. This thesis also proposes novel techniques to enhance utilization of integer units for high-throughput floating-point operations on CGRA. The approach to implementing floating-point operations on CGRA presented in this thesis enables floating-point functionality with less area overhead compared to the traditional approach of employing separate floating-point units (FPUs). However the total latency of a floating-point operation is larger than that of the traditional approach and the data dependency between split integer operations restricts further enhancement in terms of utilization of integer functional units in an operation. In order to overcome such inefficiency, two techniques are proposed in this thesis. One is overlapping two distinct floating-point operations, which increases the efficiency in terms of utilizations of integer functional units in the architecture. Free integer functional units in a floating-point operation can be used for another floating-point operation with this technique. The other is forwarding between two data-dependent floating-point operations, which decreases effective latency of the floating-point operations. The basic idea is to remove unnecessary calculations such as formatting which is normally done in between the two data-dependent floating-point operations. To implement the overlapping or forwarding, FSMs and control paths in each PE are modified and temporal/communication registers are added. Light-weight sub-module such as increment units and registers for intermediate values are added for releasing resource conflict. Experiment is done with several arithmetic functions that are widely used in floating-point applications. The base architecture and the new architecture implementing the proposed technique are compared in terms of throughput and area overhead. The experimental result shows that the proposed technique increases the throughput by 33.9% on average with 20.9% of area overhead.Abstract i Contents v List of Figures ix List of Tables xv Chapter 1 INTRODUCTION 1 Chapter 2 TARGET ARCHITECTURE 7 2.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Reconfigurable Computing Module . . . . . . . . . . . . . . . . . 8 Chapter 3 DEGISN OF FLOATING-POINT OPERATIONS 15 3.1 Floating-point Numbers . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Representation of floating-point numbers . . . . . . . . . . 15 3.1.2 Floating-point operations . . . . . . . . . . . . . . . . . . . 19 3.2 FPU-PE Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Construction of FPU-PE Cluster . . . . . . . . . . . . . . . 20 3.2.2 Construction of Array of FPU-PE Clusters . . . . . . . . . 21 3.2.3 Comparing Different FPU-PE Clusters . . . . . . . . . . . 23 3.3 Implementation of Multi-Cycle Operations . . . . . . . . . . . . 26 3.4 Implementation of Floating-Point Operations . . . . . . . . . . . 30 3.5 Implementation of Floating-Point Operations Using Shared Modules . . . 32 Chapter 4 Chip Implementation 35 4.1 Specification of Chip Implementation . . . . . . . . . . . . . . . . 35 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Experimantal Results . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 39 4.3.2 Power Consumption Comparison . . . . . . . . . . . . . . 42 Chapter 5 Comparison with Other Architectures 45 5.1 Preparation for the comparison . . . . . . . . . . . . . . . . . . . 45 5.2 Comparison with PACT XPP . . . . . . . . . . . . . . . . . . . . . 47 5.3 Comparison with Butter Architecture . . . . . . . . . . . . . . . . 50 5.4 Implication of the proposed architecture . . . . . . . . . . . . . . 57 Chapter 6 Enhancement Techniques 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Conventional Approach . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.2 Utilization of Floating-Point Operations . . . . . . . . . . 65 6.3 Proposed Enhancement Techniques . . . . . . . . . . . . . . . . . 66 6.3.1 Overlapping Technique . . . . . . . . . . . . . . . . . . . . 66 6.3.2 Forwarding Technique . . . . . . . . . . . . . . . . . . . . . 71 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 76 6.4.2 Hardware Cost of the Proposed Techniques . . . . . . . . . 77 6.4.3 Utilization Enhancement by the Proposed Techniques . . . 80 6.5 Comparison with Other Architecture . . . . . . . . . . . . . . . . 87 Chapter 7 Conclusion 93 Bibliography 95 국문초록 103 감사의 글 105Docto

SNU Open Repository and Archive

Design and development from single core reconfigurable accelerators to a heterogeneous accelerator-rich platform

Author: Hussain M. Waqar
Publication venue: Tampere University of Technology
Publication date: 01/01/2014
Field of study

The performance of a platform is evaluated based on its ability to deal with the processing of multiple applications of different nature. In this context, the platform under evaluation can be of homogeneous, heterogeneous or of hybrid architecture. The selection of an architecture type is generally based on the set of different target applications and performance parameters, where the applications can be of serial or parallel nature. The evaluation is normally based on different performance metrics, e.g., resource/area utilization, execution time, power and energy consumption. This process can also include high-level performance metrics, e.g., Operations Per Second (OPS), OPS/Watt, OPS/Hz, Watt/Area etc. An example of architecture selection can be related to a wireless communication system where the processing of computationally-intensive signal-processing algorithms has strict execution-time constraints and in this case, a platform with special-purpose accelerators is relatively more suitable than a typical homogeneous platform. A couple of decades ago, it was expensive to plant many special-purpose accelerators on a chip as the cost per unit area was relatively higher than today. The utilization wall is also becoming a limiting factor in homogeneous multicore scaling which means that all the cores on a platform cannot be operated at their maximum frequency due to a possible thermal meltdown. In this case, some of the processing cores have to be turned-off or to be operated at very low frequencies making most of the part of the chip to stay underutilized. A possible solution lies in the use of heterogeneous multicore platforms where many application-specific cores operate at lower frequencies, therefore reducing power dissipation density and increasing other performance parameters. However, to achieve maximum flexibility in processing, a general-purpose flavor can also be introduced by adding a few Reduced Instruction-Set Computing (RISC) cores. A power class of heterogeneous multicore platforms is an accelerator-rich platform where many application-specific accelerators are loosely connected with each other for work load distribution or to execute the tasks independently. This research work spans from the design and development of three different types of template-based Coarse-Grain Reconfigurable Arrays (CGRAs), i.e., CREMA, AVATAR and SCREMA to a Heterogeneous Accelerator-Rich Platform (HARP). The accelerators generated from the three CGRAs could perform different lengths and types of Fast Fourier Transform (FFT), real and complex Matrix-Vector Multiplication (MVM) algorithms. CREMA and AVATAR were fixed CGRAs with eight and sixteen number of Processing Element (PE) columns, respectively. SCREMA could flex between four, eight, sixteen and thirty two number of PE columns. Many case studies were conducted to evaluate the performance of the reconfigurable accelerators generated from these CGRA templates. All of these CGRAs work in a processor/coprocessor model tightly integrated with a Direct Memory Access (DMA) device. Apart from these platforms, a reconfigurable Application-Specific Instruction-set Processor (rASIP) is also designed, tested for FFT execution under IEEE-802.11n timing constraints and evaluated against a processor/coprocessor model. It was designed by integrating AVATAR generated radix-(2, 4) FFT accelerator into the datapath of a RISC processor. The instruction set of the RISC processor was extended to perform additional operations related to AVATAR. As mentioned earlier, the underutilized part of the chip, now-a-days called Dark Silicon is posing many challenges for the designers. Apart from software optimizations, clock gating, dynamic voltage/frequency scaling and other high-level techniques, one way of dealing with this problem is to use many application-specific cores. In an effort to maximize the number of reconfigurable processing resources on a platform, the accelerator-rich architecture HARP was designed and evaluated in terms of different performance metrics. HARP is constructed on a Network-on-Chip (NoC) of 3x3 nodes where with every node, a CGRA of application-specific size is integrated other than the central node which is attached to a RISC processor. The RISC establishes synchronization between the nodes for data transfer and also performs the supervisory control. While using the NoC as the backbone of communication between the cores, it becomes possible for all the cores to address each other and also perform execution simultaneously and independently of each other. The performance of accelerators generated from CREMA, AVATAR and SCREMA templates were evaluated individually and also when attached to HARP's NoC nodes. The individual CGRAs show promising results in their own capacity but when integrated all together in the framework of HARP, interesting comparisons were established in terms of overall execution times, resource utilization, operating frequencies, power and energy consumption. In evaluating HARP, estimates and measurements were also made in some advanced performance metrics, e.g., in MOPS/mW and MOPS/MHz. The overall research work promotes the idea of heterogeneous accelerator-rich platform as a solution to current problems and future needs of industry and academia

Trepo - Institutional Repository of Tampere University

Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

Author: Batten Christopher Francis
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

CiteSeerX

DSpace@MIT

재구성형 구조에서의 효율적인 조건실행 기법

Author: 한규승
Publication venue: 서울대학교 대학원
Publication date: 01/08/2013
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 최기영.재구성형 구조는 연산량이 많은 프로그램을 내장형 시스템에서 가속시키는 데 적합한 방법 중 하나이다. 이는 일반적으로 많은 연산유닛들과 하나의 컨트롤러로 구성되어 고성능, 유연성, 저전력을 동시에 달성할 수 있도록 해준다. 많은 연산유닛을 바탕으로 한 병렬처리는 응용프로그램의 실행속도를 빠르게 하며, 재구성 기능은 다양한 응용프로그램에의 활용을 가능하게 해준다. 또한, 명령어와 데이터에 대한 스케쥴을 미리 정해놓음으로써 제어구조를 단순화시킬 수 있으며 이는 연산량 대비 전력소모를 최소한으 로 줄여준다. 하지만 응용프로그램이 복잡해짐에 따라 연산량이 많은 부분들에 분기문이 생기게 되었으며 이는 재구성형 구조를 사용함에 있어 큰 위협이 되고 있다. 분기문을 다룰 수 있는 컨트롤러가 하나이기 때문에 컨트롤러에 병목현상이 발생하거나 동시에 서로 다른 제어를 요구하게 되면 해당 프로그램은 가속이 불가능해진다. 조건실행이라는 기술을 사용할 경우 이를 부분적으로 해소할 수 있지만 기존에 개발되어 있는 조건실행 기술들은 재구성형 구조에 성능 및 전력소모 면에서 부정적인 영향을 끼친다. 따라서 본 논문에서는 연산량이 많지만 분기문을 가진 응용프로그램에서 조건실행이 성능과 전력 면에서 어떠한 영향을 미치는지 밝히며 이를 바탕으로 고성능과 저전력을 가진 조건실행 방법을 제안한다. 실험 결과에 따르면 제안한 방식은 기존의 세가지 방식보다 성능과 전력소모를 곱으로 표현한 수치에 있어서 11.9%, 14.7%, 23.8% 만큼의 이득을 보였다. 또한, 제안한 조건실행 방법에 적합한 컴파일 체계도 제안하였다. 제안한 조건실행은 절전모드를 사용함에 따라 전력을 아낄 수 있지만 기존의 컴파일방식으로는 여러 조건문을 병렬적으로 수행하도록 컴파일할 수 없는 문제가 생긴다. 따라서 본 논문에서는 이런 문제를 밝히고 조건문들을 서로 다른 연산유닛에 할당함으로써 문제를 해결하는 방식을 제안하고 있다. 제안한 방식을 사용할 경우 단순하고 직관적인 방법에 비하여 평균적으로 2.21배의 높은 성능을 얻을 수 있었다.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naïve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 국문초록 93 감사의 글 95Docto

SNU Open Repository and Archive

High level compilation for gate reconfigurable architectures

Author: Anant Agarwal
Jonathan William Babb
Jonathan William Babb
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2001
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 205-215).A continuing exponential increase in the number of programmable elements is turning management of gate-reconfigurable architectures as "glue logic" into an intractable problem; it is past time to raise this abstraction level. The physical hardware in gate-reconfigurable architectures is all low level - individual wires, bit-level functions, and single bit registers - hence one should look to the fetch-decode-execute machinery of traditional computers for higher level abstractions. Ordinary computers have machine-level architectural mechanisms that interpret instructions - instructions that are generated by a high-level compiler. Efficiently moving up to the next abstraction level requires leveraging these mechanisms without introducing the overhead of machine-level interpretation. In this dissertation, I solve this fundamental problem by specializing architectural mechanisms with respect to input programs. This solution is the key to efficient compilation of high-level programs to gate reconfigurable architectures. My approach to specialization includes several novel techniques. I develop, with others, extensive bitwidth analyses that apply to registers, pointers, and arrays. I use pointer analysis and memory disambiguation to target devices with blocks of embedded memory. My approach to memory parallelization generates a spatial hierarchy that enables easier-to-synthesize logic state machines with smaller circuits and no long wires.(cont.) My space-time scheduling approach integrates the techniques of high-level synthesis with the static routing concepts developed for single-chip multiprocessors. Using DeepC, a prototype compiler demonstrating my thesis, I compile a new benchmark suite to Xilinx Virtex FPGAs. Resulting performance is comparable to a custom MIPS processor, with smaller area (40 percent on average), higher evaluation speeds (2.4x), and lower energy (18x) and energy-delay (45x). Specialization of advanced mechanisms results in additional speedup, scaling with hardware area, at the expense of power. For comparison, I also target IBM's standard cell SA-27E process and the RAW microprocessor. Results include sensitivity analysis to the different mechanisms specialized and a grand comparison between alternate targets.by Jonathan William Babb.Ph.D

CiteSeerX

DSpace@MIT

Architecture and Analysis for Next Generation Mobile Signal Processing.

Author: Woh Mark
Publication venue
Publication date
Field of study

Mobile devices have proliferated at a spectacular rate, with more than 3.3 billion active cell phones in the world. With sales totaling hundreds of billions every year, the mobile phone has arguably become the dominant computing platform, replacing the personal computer. Soon, improvements to today’s smart phones, such as high-bandwidth internet access, high-definition video processing, and human-centric interfaces that integrate voice recognition and video-conferencing will be commonplace. Cost effective and power efficient support for these applications will be required. Looking forward to the next generation of mobile computing, computation requirements will increase by one to three orders of magnitude due to higher data rates, increased complexity algorithms, and greater computation diversity but the power requirements will be just as stringent to ensure reasonable battery lifetimes. The design of the next generation of mobile platforms must address three critical challenges: efficiency, programmability, and adaptivity. The computational efficiency of existing solutions is inadequate and straightforward scaling by increasing the number of cores or the amount of data-level parallelism will not suffice. Programmability provides the opportunity for a single platform to support multiple applications and even multiple standards within each application domain. Programmability also provides: faster time to market as hardware and software development can proceed in parallel; the ability to fix bugs and add features after manufacturing; and, higher chip volumes as a single platform can support a family of mobile devices. Lastly, hardware adaptivity is necessary to maintain efficiency as the computational characteristics of the applications change. Current solutions are tailored specifically for wireless signal processing algorithms, but lose their efficiency when other application domains like high definition video are processed. This thesis addresses these challenges by presenting analysis of next generation mobile signal processing applications and proposing an advanced signal processing architecture to deal with the stringent requirements. An application-centric design approach is taken to design our architecture. First, a next generation wireless protocol and high definition video is analyzed and algorithmic characterizations discussed. From these characterizations, key architectural implications are presented, which form the basis for the advanced signal processor architecture, AnySP.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86344/1/mwoh_1.pd

Deep Blue Documents at the University of Michigan

THE VLIW-SUPERCISC COMPILER: EXPLOITINGPARALLELISM FROM C-BASED APPLICATIONS

Author: Fazekas Joshua David
Publication venue
Publication date: 01/01/2006
Field of study

A common approach to decreasing embedded application execution time is creating a homogeneous parallel processor architecture. The parallelism of any such architecture is limited to the number of instructions that can be scheduled in the same cycle. This number of instructions scheduled in a cycle, or instruction-level parallelism (ILP), is limited by the ability to extract parallelism from the application. Other techniques attempt to improve performance with hardware acceleration. Often, segments of highly computational extensive code are extracted and custom hardware is created to replace the software execution. This technique requires many resources and still does not address the segments of code outside of the computationally extensive kernel.To solve this problem, hardware acceleration for computationally intensive segments of code in addition to accelerating the entire application with very long instruction word, VLIW, techniques is proposed. (1) A compilation flow that targets a 4-wide VLIW processor architecture is presented. This system was used to investigate the available speed-up of VLIW architectures. The architecture was modified to combine the VLIW processor with the capability to execute application specific customized instructions. To create the custom instruction hardware, a control and data flow graph (CDFG) framework was created. The CDFG framework was created to provide a framework for compiler transformations and hardware generation. In order to remove control flow from segments of code selected for hardware generation, (2) the technique of hardware predication was developed. Hardware predication allows if-then and if-then-else control flow constructs to be transformed into strict data flow through the use of multiplexors. From the transformed CDFGs, (3) a VHDL generation pass was created that translates the compiler data structures into synthesizable VHDL. The resulting architecture contains the VLIW processor and tightly coupled application specific hardware. This architecture was analyzed for performance changes comparedto the initial VLIW architecture, and a traditional processor. Lastly, (4) the architecture was analyzed for power and energy savings. A post static timing pass was added to the compilation flow for the insertion of hardware to delay early switching of operations.By measuring only the execution of the hardware function and comparing the performance to the equivalent code executed in software, a performance multiplier of up to 322 times is seen when synthesized onto an Altera Stratix II ES2S180F1508C4 FPGA. The average performance increase seen was 63 times faster. For the entire application, the speedup reached nearly 30X and was on average 12X better than a single processor implementation. The power and energy required by the VLIW processor core and the hardware functions for the computational kernels after 160nm OKI standard cell ASIC synthesis show a maximum power savings of 417 times that of execution on the processor with an average of 133 times savings in power consumption. With the increased execution time and the savings in power the energy savings will see a multiplicative effect. The energy improvement is therefore several orders of magnitude for the hardware functions, the savings range from over 1,000X to approximately 60,000X

CiteSeerX

D-Scholarship@Pitt

Improving Compute & Data Efficiency of Flexible Architectures

Author: Waeijen Luc Johannes Wilhelmus
Publication venue: Eindhoven University of Technology
Publication date: 22/09/2022
Field of study

Pure OAI Repository

Reconfigurable Computing Based on Commercial FPGAs. Solutions for the Design and Implementation of Partially Reconfigurable Systems = Computación reconfigurable basada en FPGAs comerciales. Soluciones para el diseño e implementación de sistemas parcialmente reconfigurables.

Author: Esteves Krasteva Yana
Publication venue: E.T.S.I. Industriales (UPM)
Publication date: 01/01/2009
Field of study

Esta tesis doctoral está enmarcada en el campo de investigación de la computación reconfigurable. Este campo ha experimentado un crecimiento abrumador en los últimos años como resultado de la evolución de los dispositivos reconfigurables, donde las Field Programmable Gate Arrays (FPGAs) son el máximo exponente desde el punto de vista comercial. De forma tradicional las empresas de electrónica han seleccionado las FPGAs como prototipos iníciales de productos de altas prestaciones. Luego el sistema final es integrado en Application Specific Integrated Circuits (ASICs) que se producen en grandes volúmenes perimiendo amortiza su alto coste de diseño y producción y aprovechando la ventaja del bajo coste por unidad. Por otro lado, los DSPs (Digital Signal Processing) y los microprocesadores han sido preferidos por su bajo coste ante las FPGAs el campo de los dispositivos con menores requisitos de cómputo. En los últimos años, este panorama está sufriendo una serie de cambios. Ahora el mercado busca mas soluciones “reconfigurables” ya que permiten reducir el tiempo de salida del producto al mercado (time-to-market), aumentar el tiempo del producto en el mercado (time-in-market) y además cubren los amplios requisitos de cómputo. El cambio que se observa, se debe a que los dispositivos programables han evolucionado de simples estructuras programables a complejas plataformas reconfigurables. Las FPGAs del estado de la técnica han alcanzado un grado de integración muy alto y además ahora contienen, dentro de su arquitectura programable, microprocesadores y lógica específica de procesamiento digital de señal. Otro factor sumamente importante para el cambio es que las FPGAs permiten el diseño de dispositivos cuyo hardware pueden ser adaptado, o actualizado, una vez que el producto ya esta entregado e instalado, obteniendo así una flexibilidad en el hardware comparable con la del software, donde la actualización postventa de los sistemas es una práctica muy explotada de cara a la reducción de costes y la salida rápida al mercado. Por otro lado, y sobre todo en el ámbito académico, existen dispositivos reconfigurables con distinta granularidad que permites alcanzar altas prestaciones en comparación con las FPGAs comerciales de grano fino (comparable con la de los ASICs), pero están restringidas a una aplicación o grupo de aplicaciones. A pesar de que los dispositivos reconfigurables propietarios ofrecen muchas ventajas, esta opción ha sido descartada en la presente tesis debido a que, desde el punto de vista industrial requieren, aparte del diseño del ASIC reconfigurable, el desarrollo de un entorno de diseño completo. Todo esto conlleva a un elevado coste de recursos, además del alejamiento de las propuestas de la industria. La presente tesis se ha centrado en proporcionar soluciones para dispositivos comerciales, FPGAs de grano fino, con la finalidad de aprovechar las herramientas existentes y mantener las soluciones propuestas lo más cerca posible de la industria. Los dispositivos reconfigurables proporcionan diversos métodos de reconfiguración, siendo el más atractivo la reconfiguración parcial y dinámica, ya que permite adaptar el dispositivo sin interrumpir su funcionamiento y crear dispositivos auto-adaptables. Este tipo de reconfiguración será el objeto de estudio de la tesis doctoral. La reconfiguración parcial permite tener una serie de tareas hardware (módulos que se ubican en la estructura reconfigurable) ejecutándose paralelamente en la FPGA y sustituir un bloque por otro, dependiendo de las necesidades del sistema, sin alterar el funcionamiento del resto de bloques. Esta idea básica en teoría brinda la flexibilidad del software al hardware, que combinado con su paralelismo implícito hace del sistema reconfigurable una potente herramienta que puede dar pie a la creación de sistemas adaptables o incluso autoadaptativos, supercomputadores reconfigurables y hardware bio-inspirado entre otros. Por otro lado, a pesar que algunos proveedores de FPGAs permiten la reconfiguración parcial, el uso de esta técnica aún está restringido al ámbito académico y a sistemas muy básicos. El trabajo de investigación descrito dentro de la presente tesis doctoral ha tenido por objeto el estudio de diversos aspectos de los sistemas parcialmente reconfigurables, la identificación de las principales deficiencias de las soluciones existentes y la propuesta de nuevas soluciones originales. Como resultado del estudio del estado del arte se ha visto que las soluciones existentes son poco flexibles y la escalabilidad de los sistemas que se pueden diseñar es reducida. Por ello las propuestas originales de esta tesis tienen como objetivo permitir el diseño e implementación de sistemas parcialmente reconfigurables con alta escalabilidad y flexibilidad. La tesis principal del trabajo de investigación ha sido basada en la idea que para obtener una mayor flexibilidad de los sistemas se debe desligar el diseño del sistema reconfigurable del diseño de los cores que serán consumidos por dicho sistema. La tesis doctoral ha contribuido proponiendo mejores soluciones a nivel de arquitectura, flujos de diseño y herramientas que han permitido el diseño e implementación de diversos sistemas parcialmente reconfigurables con distinto grado de flexibilidad y escalabilidad. La flexibilidad y la escalabilidad son términos que en los sistemas reconfigurables se pueden asociar a diversos aspectos. Dentro de esta tesis la flexibilidad está asociada principalmente a la diversidad de cores o tareas hardware que pueden ser consumidos o integrados en un sistema ya definido, mientras que la escalabilidad está referida al número de cores que pueden coexistir en el sistema y ser reconfigurados independientemente. Para poder diseñar sistemas flexibles y escalables, estas características deben estar cubiertas en distintos niveles. Más en detalle dentro de la presente tesis, desde el punto de vista de la arquitectura, la flexibilidad está cubierta por la posibilidad de posicionar libremente cores en una arquitectura escalable predefinida. Desde el punto de vista del sistema, la flexibilidad está reflejada por la posibilidad de no sólo de modificar o reconfigurar un core del sistema hardware, sino también de modificar las comunicaciones internas del mismo. Desde el punto de vista del dispositivo, la flexibilidad está garantizada por la transparencia en el proceso de reconfiguración. Por último, la flexibilidad en el proceso de diseño está cubierta por la definición de herramientas y flujos de diseño que permiten por un lado desligar el diseño del sistema reconfigurable del diseño de los cores para el sistema, y por otro lado que diseñadores sin conocimientos detallados de reconfiguración parcial puedan diseñar cores. Dentro de la tesis doctoral se presentan cuatro dispositivos reconfigurable integrados en distintos entornos y con distinto grado de flexibilidad que corresponde al grado de aprovechamiento de las aportaciones originales de la tesis. Las principales aportaciones de la tesis doctoral, relacionadas a cada uno de los aspectos mencionados en el párrafo anterior, y tratados en distintas partes de la tesis se resumen a continuación destacando en la medida de lo posible las diferencias con respecto al estado del arte: Se ha definido una metodología de diseño de Arquitecturas Virtuales (abstracción de la arquitectura física de la FPGA que incluye la distribución de los recursos programables en slots y la forma de interconexión de los slots). La metodología, propuesta originalmente en esta tesis, permite el diseño de sistemas reconfigurables con alta flexibilidad y escalabilidad comparadas con el estado del arte. Una solución a la adaptación de las comunicaciones internas en los sistemas reconfigurables llamada DRNoC (Dynamic Reconfigurable NoC). La solución original abarca diversos aspectos e incluye la definición de una arquitectura de interconexión para los sistemas reconfigurables basada en redes en chip (Network on Chip - NoC), la definición de métodos de reconfiguración y el direccionamiento interno del sistema, y de forma más específica para las comunicaciones basadas en redes, la definición de un formato de tramas y la arquitectura de los enrutadores. La principal diferencia de la solución propuesta con el estado del arte es que DRNoC no restringe la comunicación únicamente a NoCs y permite la definición de cualquier tipo de esquema de comunicación (NoC, punto a punto, punto a multipunto, bus, o una combinación de las anteriores) y además, permite que varios esquemas de comunicación coexistan en el mismo sistema y que funcionen de forma independiente. De esta forma la solución propuesta brinda una mayor flexibilidad que las ya existentes. Se ha propuesto una solución para la manipulación de los ficheros de configuración para las FPGA del tipo Virtex II/Pro que es la más completa comparada con el estado del arte. Asimismo, una serie de herramientas que permiten la generación y extracción de cores para sistemas reconfigurables basados en FPGA Virtex II que ha sido la primera solución existente para estas FPGA. Un flujo de diseño para cores basado en plantillas que permite el diseño de cores hardware sin ser un experto en reconfiguración parcial y sin conocer los detalles del sistema final en el que se implementará el core. El diseño, implementación y prueba de un sistema parcialmente reconfigurable basado en FPGAs comerciales de grano fino para redes de sensores. La primera aproximación existente en el estado del arte al uso de los sistemas parcialmente reconfigurables en las redes de sensores. La integración de un sistema reconfigurable en un entorno cliente-servidor que incluye un original sistema de control de la reconfiguración. Una solución para la depuración de los sistemas reconfigurables. Un sistema de emulación y prototipado rápido de las comunicaciones dentro de un chip basado originalmente en la idea de la reutilización de cores hardware por medio de la técnica de reconfiguración parcial. Como conclusión global del trabajo de investigación realizado cabe destacar que la presente tesis ha dado lugar a la creación y consolidación de una línea de investigación en el grupo de electrónica digital del Centro de Electrónica Industrial que actualmente se encuentra entre las más activas y de mayor importancia. Además, el trabajo de investigación y la divulgación de las aportaciones originales han permitido que el centro de investigación pase a formar parte del estado del arte de los sistemas parcialmente reconfigurables. The thesis is enclosed in the research area of reconfigurable computing which, in the last years, has experienced a remarkable growth as a result of the impressive evolution of reconfigurable devices. In this area, Field Programmable Gate Arrays (FPGAs) are the most outstanding representative from the commercial point of view. Traditionally FPGAs have been used for prototyping, in previous to the final Application Specific Integrated Circuit (ASIC) design stages. However, the interest in the integration of FPGAs in final products has been growing in the last years. FPGAs are preferred for small production volumes, where the ASIC masks high cost is unaffordable and also in products where time-to-market is a priority, and waiting for a complete ASIC design cycle is not desirable. State of the art FPGAs are highly integrated electronic circuits, composed of tens of millions of system gates, with competitive speed, performance and configurability. These devices have evolved from simple gate arrays to complex platforms that include embedded memory, multipliers and even microprocessors and digital signal processing elements. Additionally, the fine grain nature of the reconfigurable arrays, make FPGAs suitable for a broad set of application domains. On the other side, and mostly in the academic community, there are custom reconfigurable devices with different granularity levels that permit to achieve higher performance, compared to commercial FPGAs, but for a certain application domain. Although there are very good solutions in the academic state of the art, their main drawback from the industry point of view is that they require specific design environments and also, that the efforts and resources needed for designing such solutions are very high. This thesis work is focused on providing solutions that target commercial fine grain reconfigurable devices, FPGAs, in order to take advantage of existing tools and to keep the proposed solutions closer to the industry. Today FPGAs provide different reconfiguration options. Among them, the most challenging one is partial reconfiguration. This feature has special interest, as it permits system updates on the fly once the device is deployed, without the need of stopping it and without theoretical loss of performance. Partial reconfiguration is also an attractive feature because it permits to allocate different tasks/cores running in parallel in the device and change them on the fly as needed without disturbing other tasks/cores. This basic idea, brings software-like flexibility to hardware which, in combination with its inherited parallelism, opens the door for a broad amount of possibilities and applications, like runtime adaptive super-computing, adaptive embedded software ii accelerators, bio-inspired, self-reconfigurable and self-arrangeable systems. However, even though some commercial FPGAs provide partial reconfiguration features, its utilization is still in its early stages and it is not well supported by FPGA vendors, making its exploitation in real electronic systems very difficult. Therefore, there are several academic groups working to provide alternative solutions for the design and implementation of partially reconfigurable systems based on commercial FPGAs that intend to stimulate their integration and use in the industry. This research work intents to study different aspects of partially reconfigurable system on-chip and contribute with flexibility improvements. The main idea that will be followed along the thesis is that the design of reconfigurable systems will be considered an independent process from the design of cores that will be consumed by the system. This approach involves the design of flexible and scalable partial runtime reconfigurable systems, where most of the thesis contribution will be focused. More in detail, this thesis will contribute to improve architecture solutions, design tools and design flows of partially reconfigurable systems for commercial FPGAs and provide systems with higher flexibility and scalability. Flexibility and scalability in a reconfigurable system are terms that can be related to several aspects. In this thesis flexibility is mainly related to the diversity of tasks or cores a system can consume, while scalability is connected to the number of cores that can run in parallel and be independently reconfigured. Flexibility and scalability have to be covered by the system at different levels and the work presented in this thesis will contribute in all the specific levels. More in detail, from an architectural point of view, flexibility is reflected by the possibility of freely loading tasks or cores in a defined, scalable architecture. From the system point of view, flexibility is related to the possibility of modifying not only the system functionality by loading different tasks, but also to adapt the on-chip communications. From the device point of view, flexibility is reflected by the reconfiguration process transparency and, from the design point of view, it is oriented to the definition of design tools and flows that will permit, as far as possible, non specialized designers to design cores for a partially reconfigurable system and without knowing the system details. All the original proposed solutions, in each individual aspect, will be compared with the state of the art and complete systems solutions will be designed and will be integrated in different application domains in order to validate the thesis proposals. In order to achieve better understanding of the thesis and to facilitate the comparison with some, selected, related work, the thesis structure is not traditional. Instead of including a state of the art and a result Chapter, each Chapter is focused on a specific aspect of partially reconfigurable system design and includes state of the art and result sections. The first chapter, Chapter 1, introduces the main concepts to be used in the thesis. Chapter 2 is focused on reconfigurable systems architectures, contributing with architecture solutions and a design method. Chapter 3 proposes a solution that enhances the features of the architectures defined in Chapter 2, and provides more flexibility to the entire system by extending reconfiguration to the on-chip communication. Chapter 4 is related to the design flows and tools, where contributions are made in both aspects and the proposed solutions are compared with the state of the Abstract and Thesis Organization iii art. Complete systems, with different independency levels, are presented in Chapter 5 in order to validate the thesis contributions. Conclusions, a summary of contributions and the future work are included in Chapter 6. A more detailed description of each Chapter content is presented below: Chapter 1 provides an introduction to the reconfigurable systems based on FPGAs topic, by first defining the place of FPGAs in the electronic industry and afterwards, introducing the main concepts to be used along the thesis. Although the focus is put on commercial reconfigurable devices, some custom reconfigurable systems are also described in order to have a complete view of the options in reconfigurable devices. The Chapter discuses the thesis main topic, related to partial runtime reconfigurable systems, highlighting its main advantages and disadvantages and, introducing some of the approaches to be followed in this thesis. The main term introduced in this Chapter, associated to reconfigurable systems architectures, is ”Virtual Architecture”. The term defines the architecture of the partially reconfigurable system and how the different regions it is composed of are interconnected. A brief summary of the thesis main goals is included at the end of the Chapter. The main topic of Chapter 2 is related to reconfigurable systems architectures design. The Chapter includes a specific state of the art section that reviews some existing architecture solutions. After that, a general method for virtual architectures design, an original thesis contribution, is presented in detail. Afterwards, the method is applied to the design of general one dimensional (1D) and two dimensional (2D) architectures for Xilinx Virtex II/Pro FPGAs and, as an example, following the specific steps of the method, two 1D, bus based, architectures are designed. The architecture buses are compared with two state of the art solutions in terms of area and performance in the results section of the same Chapter. Chapter 3 is focused on reconfigurable systems on-chip communication issues, where the need of adaptability is the main topic. Again, a state of the art description of some 2D reconfigurable systems is presented at the beginning of the Chapter and, afterwards, an original solution, called Dynamic Reconfigurable NoC (DRNoC), is proposed. This solution covers different aspects. First, an architecture oriented to support adaptability in the on-chip communications is originally proposed. The architecture is mapped to a Virtex II FPGA by modifying a virtual architecture from the general ones presented in Chapter 2. Second, two types of reconfigurations that span through different levels of the OSI communication model are originally proposed. Third, a set of Network on Chip models, focused on the communication adaptability are designed and/or adapted and presented in the Chapter, along with an original NoC packet format and router architecture. These models are mapped to the DRNoC architecture and implementation cost parameters are defined and used to evaluate different implementation options. Regarding the architecture reconfigurability, it is important to remark that along the entire Chapter, intermediate test of possible partial reconfigurations and test results are included. At the end of the Chapter, the proposed architecture is compared with the state of the art using a set of structural parameters taken from a reference work and complemented with others defined in the Chapter. Chapter 4, focuses on the design tools and flows for partially reconfigurable systems. Again, an overview of the state of the art is included at the beginning of the Chapter. Abstract and Thesis Organization iv Afterwards, an original software solution for Virtex II configuration files (bitstreams) manipulations is presented. The first part of the solution is a study of the Virtex II/Pro FPGA bitstream format, used to define a set of equations for accessing a specific bitstream resource (at register or block level). Based on these equations, a set of tools for bitstream manipulation that target resource restricted devices are originally presented. Also, a design flow, based on systems and virtual architectures templates, which permits a straightforward core design by non partial reconfiguration experts and without knowing the system details, is originally proposed. In Chapter 5 four reconfigurable systems with different flexibility level, which corresponds to the level of the thesis proposals exploitation, are presented. The selected application domains attempt to demonstrate different advantages of the use of partial runtime reconfigurable systems and therefore are mainly a proof of concept. The first domain belongs to the wireless sensor networks, where t

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Design Techniques for Energy-Quality Scalable Digital Systems

Author: JAHIER PAGLIARI Daniele
Publication venue: Politecnico di Torino
Publication date: 18/05/2018
Field of study

Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing “dynamic” systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than “static” solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)