165 research outputs found
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
DESIGNING COST-EFFECTIVE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE
Application-specific optimization of embedded systems becomes inevitable to satisfy the
market demand for designers to meet tighter constraints on cost, performance and power.
On the other hand, the flexibility of a system is also important to accommodate the short
time-to-market requirements for embedded systems. To compromise these incompatible
demands, coarse-grained reconfigurable architecture (CGRA) has emerged as a suitable
solution. A typical CGRA requires many processing elements (PEs) and a configuration
cache for reconfiguration of its PE array. However, such a structure consumes significant
area and power. Therefore, designing cost-effective CGRA has been a serious concern
for reliability of CGRA-based embedded systems.
As an effort to provide such cost-effective design, the first half of this work
focuses on reducing power in the configuration cache. For power saving in the configuration
cache, a low power reconfiguration technique is presented based on reusable context
pipelining achieved by merging the concept of context reuse into context pipelining.
In addition, we propose dynamic context compression capable of supporting only required
bits of the context words set to enable and the redundant bits set to disable. Finally, we provide dynamic context management capable of reducing reduce power consumption
in configuration cache by controlling a read/write operation of the redundant
context words
In the second part of this dissertation, we focus on designing a cost-effective PE array
to reduce area and power. For area and power saving in a PE array, we devise a costeffective
array fabric addresses novel rearrangement of processing elements and their
interconnection designs to reduce area and power consumption. In addition, hierarchical
reconfigurable computing arrays are proposed consisting of two reconfigurable computing
blocks with two types of communication structure together. The two computing
blocks have shared critical resources and such a sharing structure provides efficient
communication interface between them with reducing overall area.
Based on the proposed design approaches, a CGRA combining the multiple design
schemes is shown to verify the synergy effect of the integrated approach. Experimental
results show that the integrated approach reduces area by 23.07% and power by up to
72% when compared with the conventional CGRA
Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures
This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast
A coarse-grained dynamically reconfigurable MAC processor for power-sensitive multi-standard devices
DRMP, a Dynamically Reconfigurable MAC Processor, is an innovative, dynamically reconfigurable System-on-Chip architecture. The architecture exploits substantial overlaps in the functionality of different wireless MAC layers. Its flexibility is specialized for addressing the requirements of the MAC layer of wireless standards. It is targeted at consumer, multi-standard, handheld devices, and its design is meant to address the balance of flexibility and power-efficiency that this target market demands. The DRMP reconfigures packet-by-packet on the fly, allowing execution of concurrent protocol modes on a single hardware co-processor. An interrupt-driven programming model has also been presented and shown to implement the protocol state-machine of the three protocols on a CPU. These features will allow the DRMP to replace three MAC processors in a hand-held device. The most innovative component of the DRMP architecture is its Interface and Reconfiguration Controller. It uses a combination of asynchronous controllers to dynamically reconfigure the functional units in the architecture and delegate MAC tasks to them. The architecture has been modeled in Simulink at cycle-approximate abstraction. Results of simulations involving transmission and reception of packets have been presented, showing that the platform concurrently handles three protocol streams, reconfigures dynamically, yet meets and exceeds the protocol timing constraints, all at a moderate frequency. Its heterogeneous and coarse-grained functional units, limited connectivity requirements between these units, and proportionally large time that these resources are idle, promise a very modest power-consumption, suitable for mobile devices, while offering flexibility to implement different MAC protocols
재구성형 연산 구조를 위한 부동소수점 지원
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 최기영.With a huge increase in demand for various kinds of compute-intensive applications in electronic systems, researchers have focused on coarse-grained reconfigurable architectures because of their advantages: high performance and flexibility. Besides, supporting floating-point operations on coarse-grained reconfigurable architecture becomes essential as the increase of demands on various floating-point inclusive applications such as multimedia processing, 3D graphics, augmented reality, or object recognition.
This thesis presents FloRA, a coarse-grained reconfigurable architecture with floating-point support. Two-dimensional array of integer processing elements in FloRA is configured at run-time to perform floating-point operations as well as integer operations. More specifically, each floating-point operation is performed by two integer processing elements, one for mantissa and the other for exponent. Fabricated using 130nm process, the total area overhead due to additional hardware for floating-point operations is about 7.4% compared to the previous architecture which does not support floating-point operations. The fabricated chip runs at 125MHz clock frequency and 1.2V power supply. Experiments show 11.6x speedup on average compared to ARM9 with a vector-floating-point unit for integer-only benchmark programs as well as programs containing floating-point operations. Compared with other similar approaches including XPP and Butter, the proposed architecture shows much higher performance for integer applications, while maintaining about half the performance of Butter for floating-point applications.
This thesis also proposes novel techniques to enhance utilization of integer units for high-throughput floating-point operations on CGRA.
The approach to implementing floating-point operations on CGRA presented in this thesis enables floating-point functionality with less area overhead compared to the traditional approach of employing separate floating-point units (FPUs). However the total latency of a floating-point operation is larger than that of the traditional approach and the data dependency between split integer operations restricts further enhancement in terms of utilization of integer functional units in an operation. In order to overcome such inefficiency, two techniques are proposed in this thesis. One is overlapping two distinct floating-point operations, which increases the efficiency in terms of utilizations of integer functional units in the architecture. Free integer functional units in a floating-point operation can be used for another floating-point operation with this technique. The other is forwarding between two data-dependent floating-point operations, which decreases effective latency of the floating-point operations. The basic idea is to remove unnecessary calculations such as formatting which is normally done in between the two data-dependent floating-point operations. To implement the overlapping or forwarding, FSMs and control paths in each PE are modified and temporal/communication registers are added. Light-weight sub-module such as increment units and registers for intermediate values are added for releasing resource conflict.
Experiment is done with several arithmetic functions that are widely used in floating-point applications. The base architecture and the new architecture implementing the proposed technique are compared in terms of throughput and area overhead. The experimental result shows that the proposed technique increases the throughput by 33.9% on average with 20.9% of area overhead.Abstract i
Contents v
List of Figures ix
List of Tables xv
Chapter 1 INTRODUCTION 1
Chapter 2 TARGET ARCHITECTURE 7
2.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Reconfigurable Computing Module . . . . . . . . . . . . . . . . . 8
Chapter 3 DEGISN OF FLOATING-POINT OPERATIONS 15
3.1 Floating-point Numbers . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Representation of floating-point numbers . . . . . . . . . . 15
3.1.2 Floating-point operations . . . . . . . . . . . . . . . . . . . 19
3.2 FPU-PE Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Construction of FPU-PE Cluster . . . . . . . . . . . . . . . 20
3.2.2 Construction of Array of FPU-PE Clusters . . . . . . . . . 21
3.2.3 Comparing Different FPU-PE Clusters . . . . . . . . . . . 23
3.3 Implementation of Multi-Cycle Operations . . . . . . . . . . . . 26
3.4 Implementation of Floating-Point Operations . . . . . . . . . . . 30
3.5 Implementation of Floating-Point Operations Using Shared Modules . . . 32
Chapter 4 Chip Implementation 35
4.1 Specification of Chip Implementation . . . . . . . . . . . . . . . . 35
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Experimantal Results . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 39
4.3.2 Power Consumption Comparison . . . . . . . . . . . . . . 42
Chapter 5 Comparison with Other Architectures 45
5.1 Preparation for the comparison . . . . . . . . . . . . . . . . . . . 45
5.2 Comparison with PACT XPP . . . . . . . . . . . . . . . . . . . . . 47
5.3 Comparison with Butter Architecture . . . . . . . . . . . . . . . . 50
5.4 Implication of the proposed architecture . . . . . . . . . . . . . . 57
Chapter 6 Enhancement Techniques 63
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Conventional Approach . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Utilization of Floating-Point Operations . . . . . . . . . . 65
6.3 Proposed Enhancement Techniques . . . . . . . . . . . . . . . . . 66
6.3.1 Overlapping Technique . . . . . . . . . . . . . . . . . . . . 66
6.3.2 Forwarding Technique . . . . . . . . . . . . . . . . . . . . . 71
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 76
6.4.2 Hardware Cost of the Proposed Techniques . . . . . . . . . 77
6.4.3 Utilization Enhancement by the Proposed Techniques . . . 80
6.5 Comparison with Other Architecture . . . . . . . . . . . . . . . . 87
Chapter 7 Conclusion 93
Bibliography 95
국문초록 103
감사의 글 105Docto
RISPP: A Run-time Adaptive Reconfigurable Embedded Processor
This Ph.D. thesis describes a new approach for adaptive processors using a reconfigurable fabric (embedded FPGA) to implement application-specific accelerators. A novel modular Special Instruction composition is presented along with a run-time system that exploits the provided adaptivity. The approach was simulated and prototyped using and FPGA. Comparisons with state-of-the-art appl.-specific and reconf. processors demonstrate significant improvements according the performance and efficiency
MULTI-OBJECTIVE DESIGN AUTOMATION FOR RECONFIGURABLE MULTI-PROCESSOR SYSTEMS
Ph.DDOCTOR OF PHILOSOPH
- …