6 research outputs found

    Efficient performance scaling of future CGRAs for mobile applications

    Full text link

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Recon๏ฌgurable Array (CGRA) architectures accelerate the same inner loops that bene๏ฌt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more ef๏ฌciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ๏ฌ‚exibility, performance, and power-ef๏ฌciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ๏ฌne-tuning of source code

    ์žฌ๊ตฌ์„ฑํ˜• ์—ฐ์‚ฐ ๊ตฌ์กฐ๋ฅผ ์œ„ํ•œ ๋ถ€๋™์†Œ์ˆ˜์  ์ง€์›

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2014. 2. ์ตœ๊ธฐ์˜.With a huge increase in demand for various kinds of compute-intensive applications in electronic systems, researchers have focused on coarse-grained reconfigurable architectures because of their advantages: high performance and flexibility. Besides, supporting floating-point operations on coarse-grained reconfigurable architecture becomes essential as the increase of demands on various floating-point inclusive applications such as multimedia processing, 3D graphics, augmented reality, or object recognition. This thesis presents FloRA, a coarse-grained reconfigurable architecture with floating-point support. Two-dimensional array of integer processing elements in FloRA is configured at run-time to perform floating-point operations as well as integer operations. More specifically, each floating-point operation is performed by two integer processing elements, one for mantissa and the other for exponent. Fabricated using 130nm process, the total area overhead due to additional hardware for floating-point operations is about 7.4% compared to the previous architecture which does not support floating-point operations. The fabricated chip runs at 125MHz clock frequency and 1.2V power supply. Experiments show 11.6x speedup on average compared to ARM9 with a vector-floating-point unit for integer-only benchmark programs as well as programs containing floating-point operations. Compared with other similar approaches including XPP and Butter, the proposed architecture shows much higher performance for integer applications, while maintaining about half the performance of Butter for floating-point applications. This thesis also proposes novel techniques to enhance utilization of integer units for high-throughput floating-point operations on CGRA. The approach to implementing floating-point operations on CGRA presented in this thesis enables floating-point functionality with less area overhead compared to the traditional approach of employing separate floating-point units (FPUs). However the total latency of a floating-point operation is larger than that of the traditional approach and the data dependency between split integer operations restricts further enhancement in terms of utilization of integer functional units in an operation. In order to overcome such inefficiency, two techniques are proposed in this thesis. One is overlapping two distinct floating-point operations, which increases the efficiency in terms of utilizations of integer functional units in the architecture. Free integer functional units in a floating-point operation can be used for another floating-point operation with this technique. The other is forwarding between two data-dependent floating-point operations, which decreases effective latency of the floating-point operations. The basic idea is to remove unnecessary calculations such as formatting which is normally done in between the two data-dependent floating-point operations. To implement the overlapping or forwarding, FSMs and control paths in each PE are modified and temporal/communication registers are added. Light-weight sub-module such as increment units and registers for intermediate values are added for releasing resource conflict. Experiment is done with several arithmetic functions that are widely used in floating-point applications. The base architecture and the new architecture implementing the proposed technique are compared in terms of throughput and area overhead. The experimental result shows that the proposed technique increases the throughput by 33.9% on average with 20.9% of area overhead.Abstract i Contents v List of Figures ix List of Tables xv Chapter 1 INTRODUCTION 1 Chapter 2 TARGET ARCHITECTURE 7 2.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Reconfigurable Computing Module . . . . . . . . . . . . . . . . . 8 Chapter 3 DEGISN OF FLOATING-POINT OPERATIONS 15 3.1 Floating-point Numbers . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Representation of floating-point numbers . . . . . . . . . . 15 3.1.2 Floating-point operations . . . . . . . . . . . . . . . . . . . 19 3.2 FPU-PE Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Construction of FPU-PE Cluster . . . . . . . . . . . . . . . 20 3.2.2 Construction of Array of FPU-PE Clusters . . . . . . . . . 21 3.2.3 Comparing Different FPU-PE Clusters . . . . . . . . . . . 23 3.3 Implementation of Multi-Cycle Operations . . . . . . . . . . . . 26 3.4 Implementation of Floating-Point Operations . . . . . . . . . . . 30 3.5 Implementation of Floating-Point Operations Using Shared Modules . . . 32 Chapter 4 Chip Implementation 35 4.1 Specification of Chip Implementation . . . . . . . . . . . . . . . . 35 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Experimantal Results . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 39 4.3.2 Power Consumption Comparison . . . . . . . . . . . . . . 42 Chapter 5 Comparison with Other Architectures 45 5.1 Preparation for the comparison . . . . . . . . . . . . . . . . . . . 45 5.2 Comparison with PACT XPP . . . . . . . . . . . . . . . . . . . . . 47 5.3 Comparison with Butter Architecture . . . . . . . . . . . . . . . . 50 5.4 Implication of the proposed architecture . . . . . . . . . . . . . . 57 Chapter 6 Enhancement Techniques 63 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Conventional Approach . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.2 Utilization of Floating-Point Operations . . . . . . . . . . 65 6.3 Proposed Enhancement Techniques . . . . . . . . . . . . . . . . . 66 6.3.1 Overlapping Technique . . . . . . . . . . . . . . . . . . . . 66 6.3.2 Forwarding Technique . . . . . . . . . . . . . . . . . . . . . 71 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 76 6.4.2 Hardware Cost of the Proposed Techniques . . . . . . . . . 77 6.4.3 Utilization Enhancement by the Proposed Techniques . . . 80 6.5 Comparison with Other Architecture . . . . . . . . . . . . . . . . 87 Chapter 7 Conclusion 93 Bibliography 95 ๊ตญ๋ฌธ์ดˆ๋ก 103 ๊ฐ์‚ฌ์˜ ๊ธ€ 105Docto

    ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์—์„œ์˜ ํšจ์œจ์ ์ธ ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2013. 8. ์ตœ๊ธฐ์˜.์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋‚ด์žฅํ˜• ์‹œ์Šคํ…œ์—์„œ ๊ฐ€์†์‹œํ‚ค๋Š” ๋ฐ ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›๋“ค๊ณผ ํ•˜๋‚˜์˜ ์ปจํŠธ๋กค๋Ÿฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ๊ณ ์„ฑ๋Šฅ, ์œ ์—ฐ์„ฑ, ์ €์ „๋ ฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋Š” ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๋ฉฐ, ์žฌ๊ตฌ์„ฑ ๊ธฐ๋Šฅ์€ ๋‹ค์–‘ํ•œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์˜ ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค. ๋˜ํ•œ, ๋ช…๋ น์–ด์™€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์Šค์ผ€์ฅด์„ ๋ฏธ๋ฆฌ ์ •ํ•ด๋†“์Œ์œผ๋กœ์จ ์ œ์–ด๊ตฌ์กฐ๋ฅผ ๋‹จ์ˆœํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ์—ฐ์‚ฐ๋Ÿ‰ ๋Œ€๋น„ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ์ตœ์†Œํ•œ์œผ ๋กœ ์ค„์—ฌ์ค€๋‹ค. ํ•˜์ง€๋งŒ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด ๋ณต์žกํ•ด์ง์— ๋”ฐ๋ผ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๋ถ€๋ถ„๋“ค์— ๋ถ„๊ธฐ๋ฌธ์ด ์ƒ๊ธฐ๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ ์ด๋Š” ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์— ์žˆ์–ด ํฐ ์œ„ํ˜‘์ด ๋˜๊ณ  ์žˆ๋‹ค. ๋ถ„๊ธฐ๋ฌธ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ํ•˜๋‚˜์ด๊ธฐ ๋•Œ๋ฌธ์— ์ปจํŠธ๋กค๋Ÿฌ์— ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ๋™์‹œ์— ์„œ๋กœ ๋‹ค๋ฅธ ์ œ์–ด๋ฅผ ์š”๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด ํ•ด๋‹น ํ”„๋กœ๊ทธ๋žจ์€ ๊ฐ€์†์ด ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ์กฐ๊ฑด์‹คํ–‰์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ด๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์— ๊ฐœ๋ฐœ๋˜์–ด ์žˆ๋Š” ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ์ˆ ๋“ค์€ ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์— ์„ฑ๋Šฅ ๋ฐ ์ „๋ ฅ์†Œ๋ชจ ๋ฉด์—์„œ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์ง€๋งŒ ๋ถ„๊ธฐ๋ฌธ์„ ๊ฐ€์ง„ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์„œ ์กฐ๊ฑด์‹คํ–‰์ด ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ ๋ฉด์—์„œ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ฐํžˆ๋ฉฐ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ ์„ฑ๋Šฅ๊ณผ ์ €์ „๋ ฅ์„ ๊ฐ€์ง„ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด ์ œ์•ˆํ•œ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ ์„ธ๊ฐ€์ง€ ๋ฐฉ์‹๋ณด๋‹ค ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ํ•œ ์ˆ˜์น˜์— ์žˆ์–ด์„œ 11.9%, 14.7%, 23.8% ๋งŒํผ์˜ ์ด๋“์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์— ์ ํ•ฉํ•œ ์ปดํŒŒ์ผ ์ฒด๊ณ„๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰์€ ์ ˆ์ „๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•จ์— ๋”ฐ๋ผ ์ „๋ ฅ์„ ์•„๋‚„ ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์˜ ์ปดํŒŒ์ผ๋ฐฉ์‹์œผ๋กœ๋Š” ์—ฌ๋Ÿฌ ์กฐ๊ฑด๋ฌธ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ปดํŒŒ์ผํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๋ฐํžˆ๊ณ  ์กฐ๊ฑด๋ฌธ๋“ค์„ ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์œ ๋‹›์— ํ• ๋‹นํ•จ์œผ๋กœ์จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋‹จ์ˆœํ•˜๊ณ  ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์— ๋น„ํ•˜์—ฌ ํ‰๊ท ์ ์œผ๋กœ 2.21๋ฐฐ์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naรฏve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 ๊ตญ๋ฌธ์ดˆ๋ก 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.

    Full text link
    Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when itโ€™s available, but fall back to ILP on the same hardware when necessary. In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators. Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd
    corecore