21 research outputs found

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Evaluator-Executor Transformation for Efficient Conditional Statements on CGRA

    Get PDF
    Computer EngineeringControl divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches of a conditional statement regardless of their execution frequencies. For those loops with unbalanced conditionals, we propose a software transformation that divides a loop into two or three smaller loops so that the condition is evaluated only in the first loop while the less frequent branch is executed in the second loop in a way that is much more efficient than in the original loop. To reduce the overhead of extra data transfer caused by the loop fission, we also present a hardware extension for a class of coarse-grained reconfigurable architectures (CGRAs). Our experiments using MiBench and computer vision benchmarks on a CGRA demonstrate that our techniques can improve the performance of loops over predicated execution by up to 65%, or 38.0% on average when the hardware extension is enabled. Without any hardware modification, our software-only version can improve performance by up to 64%, or 33.2% on average, while simultaneously reducing the energy consumption of the entire CGRA including configuration and data memory by 22.0% on average.ope

    ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์—์„œ์˜ ํšจ์œจ์ ์ธ ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2013. 8. ์ตœ๊ธฐ์˜.์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋‚ด์žฅํ˜• ์‹œ์Šคํ…œ์—์„œ ๊ฐ€์†์‹œํ‚ค๋Š” ๋ฐ ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›๋“ค๊ณผ ํ•˜๋‚˜์˜ ์ปจํŠธ๋กค๋Ÿฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ๊ณ ์„ฑ๋Šฅ, ์œ ์—ฐ์„ฑ, ์ €์ „๋ ฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋Š” ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๋ฉฐ, ์žฌ๊ตฌ์„ฑ ๊ธฐ๋Šฅ์€ ๋‹ค์–‘ํ•œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์˜ ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค. ๋˜ํ•œ, ๋ช…๋ น์–ด์™€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์Šค์ผ€์ฅด์„ ๋ฏธ๋ฆฌ ์ •ํ•ด๋†“์Œ์œผ๋กœ์จ ์ œ์–ด๊ตฌ์กฐ๋ฅผ ๋‹จ์ˆœํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ์—ฐ์‚ฐ๋Ÿ‰ ๋Œ€๋น„ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ์ตœ์†Œํ•œ์œผ ๋กœ ์ค„์—ฌ์ค€๋‹ค. ํ•˜์ง€๋งŒ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด ๋ณต์žกํ•ด์ง์— ๋”ฐ๋ผ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๋ถ€๋ถ„๋“ค์— ๋ถ„๊ธฐ๋ฌธ์ด ์ƒ๊ธฐ๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ ์ด๋Š” ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์— ์žˆ์–ด ํฐ ์œ„ํ˜‘์ด ๋˜๊ณ  ์žˆ๋‹ค. ๋ถ„๊ธฐ๋ฌธ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ํ•˜๋‚˜์ด๊ธฐ ๋•Œ๋ฌธ์— ์ปจํŠธ๋กค๋Ÿฌ์— ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ๋™์‹œ์— ์„œ๋กœ ๋‹ค๋ฅธ ์ œ์–ด๋ฅผ ์š”๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด ํ•ด๋‹น ํ”„๋กœ๊ทธ๋žจ์€ ๊ฐ€์†์ด ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ์กฐ๊ฑด์‹คํ–‰์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ด๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์— ๊ฐœ๋ฐœ๋˜์–ด ์žˆ๋Š” ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ์ˆ ๋“ค์€ ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์— ์„ฑ๋Šฅ ๋ฐ ์ „๋ ฅ์†Œ๋ชจ ๋ฉด์—์„œ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์ง€๋งŒ ๋ถ„๊ธฐ๋ฌธ์„ ๊ฐ€์ง„ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์„œ ์กฐ๊ฑด์‹คํ–‰์ด ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ ๋ฉด์—์„œ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ฐํžˆ๋ฉฐ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ ์„ฑ๋Šฅ๊ณผ ์ €์ „๋ ฅ์„ ๊ฐ€์ง„ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด ์ œ์•ˆํ•œ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ ์„ธ๊ฐ€์ง€ ๋ฐฉ์‹๋ณด๋‹ค ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ํ•œ ์ˆ˜์น˜์— ์žˆ์–ด์„œ 11.9%, 14.7%, 23.8% ๋งŒํผ์˜ ์ด๋“์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์— ์ ํ•ฉํ•œ ์ปดํŒŒ์ผ ์ฒด๊ณ„๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰์€ ์ ˆ์ „๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•จ์— ๋”ฐ๋ผ ์ „๋ ฅ์„ ์•„๋‚„ ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์˜ ์ปดํŒŒ์ผ๋ฐฉ์‹์œผ๋กœ๋Š” ์—ฌ๋Ÿฌ ์กฐ๊ฑด๋ฌธ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ปดํŒŒ์ผํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๋ฐํžˆ๊ณ  ์กฐ๊ฑด๋ฌธ๋“ค์„ ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์œ ๋‹›์— ํ• ๋‹นํ•จ์œผ๋กœ์จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋‹จ์ˆœํ•˜๊ณ  ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์— ๋น„ํ•˜์—ฌ ํ‰๊ท ์ ์œผ๋กœ 2.21๋ฐฐ์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naรฏve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 ๊ตญ๋ฌธ์ดˆ๋ก 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.

    Full text link
    Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when itโ€™s available, but fall back to ILP on the same hardware when necessary. In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators. Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd

    Optimization of the Memory Subsystem of a Coarse Grained Reconfigurable Hardware Accelerator

    Get PDF
    Fast and energy efficient processing of data has always been a key requirement in processor design. The latest developments in technology emphasize these requirements even further. The widespread usage of mobile devices increases the demand of energy efficient solutions. Many new applications like advanced driver assistance systems focus more and more on machine learning algorithms and have to process large data sets in hard real time. Up to the 1990s the increase in processor performance was mainly achieved by new and better manufacturing technologies for processors. That way, processors could operate at higher clock frequencies, while the processor microarchitecture was mainly the same. At the beginning of the 21st century this development stopped. New manufacturing technologies made it possible to integrate more processor cores onto one chip, but almost no improvements were achieved anymore in terms of clock frequencies. This required new approaches in both processor microarchitecture and software design. Instead of improving the performance of a single processor, the current problem has to be divided into several subtasks that can be executed in parallel on different processing elements which speeds up the application. One common approach is to use multi-core processors or GPUs (Graphic Processing Units) in which each processing element calculates one subtask of the problem. This approach requires new programming techniques and legacy software has to be reformulated. Another approach is the usage of hardware accelerators which are coupled to a general purpose processor. For each problem a dedicated circuit is designed which can solve the problem fast and efficiently. The actual computation is then executed on the accelerator and not on the general purpose processor. The disadvantage of this approach is that a new circuit has to be designed for each problem. This results in an increased design effort and typically the circuit can not be adapted once it is deployed. This work covers reconfigurable hardware accelerators. They can be reconfigured during runtime so that the same hardware is used to accelerate different problems. During runtime, time consuming code fragments can be identified and the processor itself starts a process that creates a configuration for the hardware accelerator. This configuration can now be loaded and the code will then be executed on the accelerator faster and more efficient. A coarse grained reconfigurable architecture was chosen because creating a configuration for it is much less complex than creating a configuration for a fine grained reconfigurable architecture like an FPGA (Field Programmable Gate Array). Additionally, the smaller overhead for the reconfigurability results in higher clock frequencies. One advantage of this approach is that programmers don't need any knowledge about the underlying hardware, because the acceleration is done automatically during runtime. It is also possible to accelerate legacy code without user interaction (even when no source code is available anymore). One challenge that is relevant for all approaches, is the efficient and fast data exchange between processing elements and main memory. Therefore, this work concentrates on the optimization of the memory interface between the coarse grained reconfigurable hardware accelerator and the main memory. To achieve this, a simulator for a Java processor coupled with a coarse grained reconfigurable hardware accelerator was developed during this work. Several strategies were developed to improve the performance of the memory interface. The solutions range from different hardware designs to software solutions that try to optimize the usage of the memory interface during the creation of the configuration of the accelerator. The simulator was used to search the design space for the best implementation. With this optimization of the memory interface a performance improvement of 22.6% was achieved. Apart from that, a first prototype of this kind of accelerator was designed and implemented on an FPGA to show the correct functionality of the whole approach and the simulator

    State of the art baseband DSP platforms for Software Defined Radio: A survey

    Get PDF
    Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.Peer reviewe

    Integrated Programmable-Array accelerator to design heterogeneous ultra-low power manycore architectures

    Get PDF
    There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating-point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area

    Polymorphic Pipeline Array: A Flexible Multicore Accelerator for Mobile Multimedia Applications.

    Full text link
    Mobile computing in the form of smart phones, netbooks, and PDAs has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this dissertation, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array (PPA). The PPA design is inspired by coarse-grain reconfigurable architectures (CGRAs) that consist of an array of function units interconnected by a mesh style interconnect. The PPA improves upon CGRAs by attacking two major limitations: scalability and acceleration limited to innermost loops. The large number of resources are fully utilized by exploiting both Lne-grain instruction-level and coarse-grain pipeline parallelism, and the acceleration is extended beyond innermost loops to encompass the whole region of applications. Various compiler and architectural optimizations are presented for CGRAs that form the basic building blocks of PPA. Two compiler techniques are presented that systematically construct the schedule with intelligent heuristics. Modulo graph embedding leverages graph embedding technique for scheduling in CGRAs and edgecentric modulo scheduling provides a communication-oriented way to address the scheduling problem. For architectural improvement, a novel control path design is presented that leverages the token network of dataflow machines to reduce the instructionmemory power. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. A PPA exploit pipeline parallelism found in streaming applications to create a coarsegrain hardware pipeline to execute streaming media applications.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64732/1/parkhc_1.pd
    corecore