64 research outputs found

    Evaluator-Executor Transformation for Efficient Conditional Statements on CGRA

    Get PDF
    Computer EngineeringControl divergence poses many problems in parallelizing loops. While predicated execution is commonly used to convert control dependence into data dependence, it often incurs high overhead because it allocates resources equally for both branches of a conditional statement regardless of their execution frequencies. For those loops with unbalanced conditionals, we propose a software transformation that divides a loop into two or three smaller loops so that the condition is evaluated only in the first loop while the less frequent branch is executed in the second loop in a way that is much more efficient than in the original loop. To reduce the overhead of extra data transfer caused by the loop fission, we also present a hardware extension for a class of coarse-grained reconfigurable architectures (CGRAs). Our experiments using MiBench and computer vision benchmarks on a CGRA demonstrate that our techniques can improve the performance of loops over predicated execution by up to 65%, or 38.0% on average when the hardware extension is enabled. Without any hardware modification, our software-only version can improve performance by up to 64%, or 33.2% on average, while simultaneously reducing the energy consumption of the entire CGRA including configuration and data memory by 22.0% on average.ope

    Polymorphic Pipeline Array: A Flexible Multicore Accelerator for Mobile Multimedia Applications.

    Full text link
    Mobile computing in the form of smart phones, netbooks, and PDAs has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this dissertation, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array (PPA). The PPA design is inspired by coarse-grain reconfigurable architectures (CGRAs) that consist of an array of function units interconnected by a mesh style interconnect. The PPA improves upon CGRAs by attacking two major limitations: scalability and acceleration limited to innermost loops. The large number of resources are fully utilized by exploiting both Lne-grain instruction-level and coarse-grain pipeline parallelism, and the acceleration is extended beyond innermost loops to encompass the whole region of applications. Various compiler and architectural optimizations are presented for CGRAs that form the basic building blocks of PPA. Two compiler techniques are presented that systematically construct the schedule with intelligent heuristics. Modulo graph embedding leverages graph embedding technique for scheduling in CGRAs and edgecentric modulo scheduling provides a communication-oriented way to address the scheduling problem. For architectural improvement, a novel control path design is presented that leverages the token network of dataflow machines to reduce the instructionmemory power. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. A PPA exploit pipeline parallelism found in streaming applications to create a coarsegrain hardware pipeline to execute streaming media applications.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64732/1/parkhc_1.pd

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.

    Full text link
    Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when it’s available, but fall back to ILP on the same hardware when necessary. In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators. Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd

    재구성형 구조에서의 효율적인 조건실행 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 최기영.재구성형 구조는 연산량이 많은 프로그램을 내장형 시스템에서 가속시키는 데 적합한 방법 중 하나이다. 이는 일반적으로 많은 연산유닛들과 하나의 컨트롤러로 구성되어 고성능, 유연성, 저전력을 동시에 달성할 수 있도록 해준다. 많은 연산유닛을 바탕으로 한 병렬처리는 응용프로그램의 실행속도를 빠르게 하며, 재구성 기능은 다양한 응용프로그램에의 활용을 가능하게 해준다. 또한, 명령어와 데이터에 대한 스케쥴을 미리 정해놓음으로써 제어구조를 단순화시킬 수 있으며 이는 연산량 대비 전력소모를 최소한으 로 줄여준다. 하지만 응용프로그램이 복잡해짐에 따라 연산량이 많은 부분들에 분기문이 생기게 되었으며 이는 재구성형 구조를 사용함에 있어 큰 위협이 되고 있다. 분기문을 다룰 수 있는 컨트롤러가 하나이기 때문에 컨트롤러에 병목현상이 발생하거나 동시에 서로 다른 제어를 요구하게 되면 해당 프로그램은 가속이 불가능해진다. 조건실행이라는 기술을 사용할 경우 이를 부분적으로 해소할 수 있지만 기존에 개발되어 있는 조건실행 기술들은 재구성형 구조에 성능 및 전력소모 면에서 부정적인 영향을 끼친다. 따라서 본 논문에서는 연산량이 많지만 분기문을 가진 응용프로그램에서 조건실행이 성능과 전력 면에서 어떠한 영향을 미치는지 밝히며 이를 바탕으로 고성능과 저전력을 가진 조건실행 방법을 제안한다. 실험 결과에 따르면 제안한 방식은 기존의 세가지 방식보다 성능과 전력소모를 곱으로 표현한 수치에 있어서 11.9%, 14.7%, 23.8% 만큼의 이득을 보였다. 또한, 제안한 조건실행 방법에 적합한 컴파일 체계도 제안하였다. 제안한 조건실행은 절전모드를 사용함에 따라 전력을 아낄 수 있지만 기존의 컴파일방식으로는 여러 조건문을 병렬적으로 수행하도록 컴파일할 수 없는 문제가 생긴다. 따라서 본 논문에서는 이런 문제를 밝히고 조건문들을 서로 다른 연산유닛에 할당함으로써 문제를 해결하는 방식을 제안하고 있다. 제안한 방식을 사용할 경우 단순하고 직관적인 방법에 비하여 평균적으로 2.21배의 높은 성능을 얻을 수 있었다.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naïve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 국문초록 93 감사의 글 95Docto

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

    Optimizing energy-efficiency for multi-core packet processing systems in a compiler framework

    Get PDF
    Network applications become increasingly computation-intensive and the amount of traffic soars unprecedentedly nowadays. Multi-core and multi-threaded techniques are thus widely employed in packet processing system to meet the changing requirement. However, the processing power cannot be fully utilized without a suitable programming environment. The compilation procedure is decisive for the quality of the code. It can largely determine the overall system performance in terms of packet throughput, individual packet latency, core utilization and energy efficiency. The thesis investigated compilation issues in networking domain first, particularly on energy consumption. And as a cornerstone for any compiler optimizations, a code analysis module for collecting program dependency is presented and incorporated into a compiler framework. With that dependency information, a strategy based on graph bi-partitioning and mapping is proposed to search for an optimal configuration in a parallel-pipeline fashion. The energy-aware extension is specifically effective in enhancing the energy-efficiency of the whole system. Finally, a generic evaluation framework for simulating the performance and energy consumption of a packet processing system is given. It accepts flexible architectural configuration and is capable of performingarbitrary code mapping. The simulation time is extremely short compared to full-fledged simulators. A set of our optimization results is gathered using the framework

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability
    corecore