637 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines

    Full text link
    Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. In this paper, we analyse several scheduling techniques, namely ad hoc overlapped execution, modulo scheduling and modulo scheduling with unrolling, all of which aim to efficiently utilize the special architecture design. Our investigation focuses on improving throughput while analysing other metrics that are important for streaming applications, such as register pressure, buffer sizes and code size. Through experiments conducted on several media benchmarks, we present and discuss trade-offs involved when selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    Computation Enhancement using Reconfigurable Computing

    Get PDF
    In light of the industry’s constant need for better computer performance, this project aims to choose and evaluate an approach for facing this issue. The targeted category of computers is single board computers (e.g. Raspberry Pi). The approach utilized for enhancing performance is the use of reconfigurable computing as to execute computationally expensive calculations on a runtime custom-tailored hardware. The objective of this project is the test of the potential this approach has for increasing computers performance through comparing a software implementation of an algorithm with an FPGA assisted implementation of the same algorithm. The platforms chosen for this project are the Rapsberry Pi and the Parallella P1602 board with its Zynq SoC for the software implementation and the FPGA assisted implementation in that order. The chosen algorithm is Fourier Fast Transform due to its part in many DSP applications and its suitability for the project objective. While the software solution worked successfully resulting in an asymptotic cost of O(N log N); the reconfigurable computing solution couldn’t be completed due to time constraints and lack of experience of the student. Future work should complete the experiment and add a multicore implementation of the same algorithm to add yet another class to the comparison

    Automatic synthesis of reconfigurable instruction set accelerators

    Get PDF

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Re-targetable tools and methodologies for the efficient deployment of high-level source code on coarse-grained dynamically reconfigurable architectures

    Get PDF
    Reconfigurable computing traditionally consists of a data path machine (such as an FPGA) acting as a co-processor to a conventional microprocessor. This involves partitioning the application such that the data path intensive parts are implemented on the reconfigurable fabric, and the control flow intensive parts are implemented on the microprocessor. Often the two parts have to be written in different languages. New highly parallel data path architectures allow parallelism approaching that of FPGAs, but are able to be reconfigured very rapidly. As a result, it is possible to use these architectures to perform control flow in a manner similar to a microprocessor, and thus a complete program can be described from an unmodified high-level language (in particular C). This overcomes the historical instruction-level parallelism (ILP) wall.To make full use of the available parallelism , existing microprocessor tool flows are insufficient. Data path machines are typically programmed via HDL tools from the ASIC design world. This expresses algorithm s at a low er level than the application algorithm s are typically developed in. The work in this thesis builds upon earlier work to allow applications to be described from high-level languages, by employing low-level optimisations in the compiler back-end and working from the assembly, to maximise parallel efficiency. This consists of scheduling, where known techniques are used to pack instructions into basic blocks that map well to the reconfigurable core (optimising spatial efficiency); then automatic pipelining is applied to dramatically improve the achievable throughput (optimising temporal efficiency). Together these can be thought of as “instruction-level parallelism done right”. Speed-ups of more than an order of magnitude were achieved, yielding throughputs of 180-380M Pixels/s on typical image signal processing tasks, matching the performance of hard-wired ASICs.Furthermore, conventional software-based simulation technologies for data path machines are too slow for use in application verification. This thesis demonstrates how a high-speed software emulator can be created for self-controlled dynamically reconfigurable data path machines, using a static serialisation of the data paths in each configuration context. This yields run-time performance several orders of magnitude higher than existing techniques, making it suitable for use in feedback-directed optimisation

    Are coarse-grained overlays ready for general purpose application acceleration on FPGAs?

    Get PDF
    Combining processors with hardware accelerators has become a norm with systems-on-chip (SoCs) ever present in modern compute devices. Heterogeneous programmable system on chip platforms sometimes referred to as hybrid FPGAs, tightly couple general purpose processors with high performance reconfigurable fabrics, providing a more flexible alternative. We can now think of a software application with hardware accelerated portions that are reconfigured at runtime. While such ideas have been explored in the past, modern hybrid FPGAs are the first commercial platforms to enable this move to a more software oriented view, where reconfiguration enables hardware resources to be shared by multiple tasks in a bigger application. However, while the rapidly increasing logic density and more capable hard resources found in modern hybrid FPGA devices should make them widely deployable, they remain constrained within specialist application domains. This is due to both design productivity issues and a lack of suitable hardware abstraction to eliminate the need for working with platform-specific details, as server and desktop virtualization has done in a more general sense. To allow mainstream adoption of FPGA based accelerators in general purpose computing, there is a need to virtualize FPGAs and make them more accessible to application developers who are accustomed to software API abstractions and fast development cycles. In this paper, we discuss the role of overlay architectures in enabling general purpose FPGA application acceleration

    Reconfigurable architectures for beyond 3G wireless communication systems

    Get PDF
    corecore