48 research outputs found

    Generic Connectivity-Based CGRA Mapping via Integer Linear Programming

    Full text link
    Coarse-grained reconfigurable architectures (CGRAs) are programmable logic devices with large coarse-grained ALU-like logic blocks, and multi-bit datapath-style routing. CGRAs often have relatively restricted data routing networks, so they attract CAD mapping tools that use exact methods, such as Integer Linear Programming (ILP). However, tools that target general architectures must use large constraint systems to fully describe an architecture's flexibility, resulting in lengthy run-times. In this paper, we propose to derive connectivity information from an otherwise generic device model, and use this to create simpler ILPs, which we combine in an iterative schedule and retain most of the exactness of a fully-generic ILP approach. This new approach has a speed-up geometric mean of 5.88x when considering benchmarks that do not hit a time-limit of 7.5 hours on the fully-generic ILP, and 37.6x otherwise. This was measured using the set of benchmarks used to originally evaluate the fully-generic approach and several more benchmarks representing computation tasks, over three different CGRA architectures. All run-times of the new approach are less than 20 minutes, with 90th percentile time of 410 seconds. The proposed mapping techniques are integrated into, and evaluated using the open-source CGRA-ME architecture modelling and exploration framework.Comment: 8 pages of content; 8 figures; 3 tables; to appear in FCCM 2019; Uses the CGRA-ME framework at http://cgra-me.ece.utoronto.ca

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹ‚exibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code

    Rewriting History: Repurposing Domain-Specific CGRAs

    Full text link
    Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices promising both the flexibility of FPGAs and the performance of ASICs. However, with restricted domains comes a danger: designing chips that cannot accelerate enough current and future software to justify the hardware cost. We introduce FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to operations they do not natively support. FlexC uses dataflow rewriting, replacing unsupported regions of code with equivalent operations that are supported by the CGRA. We use equality saturation, a technique enabling efficient exploration of a large space of rewrite rules, to effectively search through the program-space for supported programs. We applied FlexC to over 2,000 loop kernels, compiling to four different research CGRAs and 300 generated CGRAs and demonstrate a 2.2×\times increase in the number of loop kernels accelerated leading to 3×\times speedup compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the accelerator

    High-level Modelling and Exploration of Coarse-grained Re-configurable Architectures

    Full text link

    Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs

    Get PDF
    Coarse-grained reconfigurable array (CGRA) overlays can improve dataflow kernel throughput by an order of magnitude over Vivado HLS on Xilinx Alveo U280. This is possible with a combination of carefully floorplanned high-frequency (645 - 768 MHz Torus, 788 - 856 MHz Mesh, 583 - 746 MHz BFT) design and a scalable, communication-aware compiler. Our CGRA architecture supports configurable Processing Element (PE) functionality supported by a configurable number of communication channels to match application demands. Compared to recent FPGA overlays like 4×4 ADRES and HyCUBE implementations in CGRA-ME, our design operates at a faster clock frequency by up to 3.4×, while scaling to an orders-of-magnitude larger array size of 19×69 on Xilinx Alveo U280. We propose a novel topology agnostic ILP placer that formulates the CGRA placement problem into an ILP problem. Our ILP placer optimizes placement regardless of topology and even for non-linear objective functions by using pre-computed placement costs as inputs to the ILP problem formulation. Using the ILP placer reduces placement quadratic wirelength up to 37% compared to the commonly used simulated annealing approach but increases runtime from less than a minute to hours. Our communication-aware compiler targets HLS objectives such as initiation interval (II) and minimizes communication cost using an integer linear programming (ILP) formulation. Unlike SDC schedulers in FPGA HLS tools, we treat data movement as a first-class citizen by encoding the space and time resources of the communication network in the ILP formulation. Given the same constraints on operational resources as Vivado HLS, we can retain our target II and achieve up to 9.2× higher frequency. We compare Torus and Mesh topologies, and show Mesh has less latency per area compared to Torus for the same benchmarks

    Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware

    Get PDF
    RÉSUMÉ Les architectures reconfigurables Ă  large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accĂ©lĂ©rer une large gamme d’applications. Ces architectures utilisent la nature parallĂšle de l’architecture matĂ©rielle pour accĂ©lĂ©rer les calculs. Les architectures reconfigurables Ă  large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable Ă  grain fin) et le processeur. Elles contrastent gĂ©nĂ©ralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilitĂ© (meilleures). La programmation d’architectures reconfigurables est un dĂ©fi qui date depuis longtemps et pose plusieurs problĂšmes. Les programmeurs doivent ĂȘtre avisĂ©s des caractĂ©ristiques du matĂ©riel sur lequel ils travaillent et connaĂźtre des langages de description matĂ©riels tels que VHDL et Verilog au lieu de langages de programmation sĂ©quentielle. L’implĂ©mentation d’un algorithme sur FPGA s’avĂšre plus difficile que de le faire sur des CPU ou des GPU. Les implĂ©mentations Ă  base de processeurs ont dĂ©jĂ  leur chemin de donnĂ©es prĂ© synthĂ©tisĂ© et ont besoin uniquement d’un programme pour le contrĂŽler. Par contre, dans un FPGA, le dĂ©veloppeur doit crĂ©er autant le chemin de donnĂ©es que le contrĂŽleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmĂ©tiques dĂ©diĂ©es qui sont disponibles dans une FPGA est une tĂąche difficile qui requiert beaucoup de temps. Seulement les spĂ©cialistes dans le design de circuits peuvent le faire. Ce projet est fondĂ© sur un tissu de calcul gĂ©nĂ©rique contrĂŽlĂ© par les donnĂ©es qui a Ă©tĂ© proposĂ© par le professeur J.P David et a dĂ©jĂ  Ă©tĂ© implĂ©mentĂ© par un Ă©tudiant Ă  la maĂźtrise M. Allard. Cette architecture est principalement formĂ©e de trois composants: l’unitĂ© arithmĂ©tique et logique partagĂ©e (Shared Arithmetic Logic Unit –SALU-), la machine Ă  Ă©tat pour le jeton des donnĂ©es (Token State Machine –TSM-) et la banque de FIFO (FIFO Bank –FB-). Cette architecture est semblable aux architectures reconfigurables Ă  large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrĂŽlĂ©e par les donnĂ©es.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility. Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations “only” require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs

    Acceleration for the many, not the few

    Get PDF
    Although specialized hardware promises orders of magnitude performance gains, their uptake has been limited by how challenging it is to program them. Hardware accelerators present challenges programmers are not used to, exposing details of the hardware that are often hidden and requiring new programming styles to use them effectively. Existing programming models often involve learning complex and hardware-specific APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a significant challenge to uptake: a steep, unforgiving, and untransferable learning curve. However, programming hardware accelerators using traditional programming models presents a challenge: mapping code not written with hardware accelerators in mind to accelerators with restricted behaviour. This thesis presents these challenges in the context of the acceleration equation, and it presents solutions to it in three different contexts: for regular expression accelerators, for API-programmable accelerators (with Fourier Transforms as a key case-study) and for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows that automatically morphing software written in traditional manners to fit hardware accelerators is possible with no programmer effort and that huge potential speedups are available

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    Algorithm-Architecture Co-Design for Digital Front-Ends in Mobile Receivers

    Get PDF
    The methodology behind this work has been to use the concept of algorithm-hardware co-design to achieve efficient solutions related to the digital front-end in mobile receivers. It has been shown that, by looking at algorithms and hardware architectures together, more efficient solutions can be found; i.e., efficient with respect to some design measure. In this thesis the main focus have been placed on two such parameters; first reduced complexity algorithms to lower energy consumptions at limited performance degradation, secondly to handle the increasing number of wireless standards that preferably should run on the same hardware platform. To be able to perform this task it is crucial to understand both sides of the table, i.e., both algorithms and concepts for wireless communication as well as the implications arising on the hardware architecture. It is easier to handle the high complexity by separating those disciplines in a way of layered abstraction. However, this representation is imperfect, since many interconnected "details" belonging to different layers are lost in the attempt of handling the complexity. This results in poor implementations and the design of mobile terminals is no exception. Wireless communication standards are often designed based on mathematical algorithms with theoretical boundaries, with few considerations to actual implementation constraints such as, energy consumption, silicon area, etc. This thesis does not try to remove the layer abstraction model, given its undeniable advantages, but rather uses those cross-layer "details" that went missing during the abstraction. This is done in three manners: In the first part, the cross-layer optimization is carried out from the algorithm perspective. Important circuit design parameters, such as quantization are taken into consideration when designing the algorithm for OFDM symbol timing, CFO, and SNR estimation with a single bit, namely, the Sign-Bit. Proof-of-concept circuits were fabricated and showed high potential for low-end receivers. In the second part, the cross-layer optimization is accomplished from the opposite side, i.e., the hardware-architectural side. A SDR architecture is known for its flexibility and scalability over many applications. In this work a filtering application is mapped into software instructions in the SDR architecture in order to make filtering-specific modules redundant, and thus, save silicon area. In the third and last part, the optimization is done from an intermediate point within the algorithm-architecture spectrum. Here, a heterogeneous architecture with a combination of highly efficient and highly flexible modules is used to accomplish initial synchronization in at least two concurrent OFDM standards. A demonstrator was build capable of performing synchronization in any two standards, including LTE, WiFi, and DVB-H
    corecore