331 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹ‚exibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Flip: Data-Centric Edge CGRA Accelerator

    Full text link
    Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators due to the outstanding balance in flexibility, performance, and energy efficiency. Classic CGRAs statically map compute operations onto the processing elements (PE) and route the data dependencies among the operations through the Network-on-Chip. However, CGRAs are designed for fine-grained static instruction-level parallelism and struggle to accelerate applications with dynamic and irregular data-level parallelism, such as graph processing. To address this limitation, we present Flip, a novel accelerator that enhances traditional CGRA architectures to boost the performance of graph applications. Flip retains the classic CGRA execution model while introducing a special data-centric mode for efficient graph processing. Specifically, it exploits the natural data parallelism of graph algorithms by mapping graph vertices onto processing elements (PEs) rather than the operations, and supporting dynamic routing of temporary data according to the runtime evolution of the graph frontier. Experimental results demonstrate that Flip achieves up to 36×\times speedup with merely 19% more area compared to classic CGRAs. Compared to state-of-the-art large-scale graph processors, Flip has similar energy efficiency and 2.2×\times better area efficiency at a much-reduced power/area budget

    Reconfigurable Instruction Cell Architecture Reconfiguration and Interconnects

    Get PDF

    Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies

    Get PDF
    Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support

    Application-Level Performance Improvement for Stream Program on CGRA-based systems

    Get PDF
    Department of Computer EngineeringCoarse-Grained Reconfigurable Architectures (CGRAs), often used as coprocessors for DSP and multimedia kernels, can deliver highly energy-effcient execution for compute-intensive kernels. Simultaneously, stream applications, which consist of many actors and channels connecting them, can provide natural representations for DSP applications, and therefore be a good match for CGRAs. We present our results of mapping DSP applications written in StreamIt language to CGRAs, along with our mapping flow. One important challenge in mapping is how to manage the multitude of kernels in the application for the limited local memory of a CGRA, for which we present a novel integer linear programming-based solution. Our evaluation results demonstrate that our software and hardware optimizations can help generate highly effcient mapping of stream applications to CGRAs, enabling far more energy-effcient executions (7x worse to 50x better) compared to using state-of-theart GP-GPUs. Further, we eliminate communication overhead and reduce computation overhead using combination of sychronous/asynchronous processors and DMA. This optimization also improve performance by 17.1% on average comparing to baseline system.ope

    Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware

    Get PDF
    RÉSUMÉ Les architectures reconfigurables Ă  large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accĂ©lĂ©rer une large gamme d’applications. Ces architectures utilisent la nature parallĂšle de l’architecture matĂ©rielle pour accĂ©lĂ©rer les calculs. Les architectures reconfigurables Ă  large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable Ă  grain fin) et le processeur. Elles contrastent gĂ©nĂ©ralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilitĂ© (meilleures). La programmation d’architectures reconfigurables est un dĂ©fi qui date depuis longtemps et pose plusieurs problĂšmes. Les programmeurs doivent ĂȘtre avisĂ©s des caractĂ©ristiques du matĂ©riel sur lequel ils travaillent et connaĂźtre des langages de description matĂ©riels tels que VHDL et Verilog au lieu de langages de programmation sĂ©quentielle. L’implĂ©mentation d’un algorithme sur FPGA s’avĂšre plus difficile que de le faire sur des CPU ou des GPU. Les implĂ©mentations Ă  base de processeurs ont dĂ©jĂ  leur chemin de donnĂ©es prĂ© synthĂ©tisĂ© et ont besoin uniquement d’un programme pour le contrĂŽler. Par contre, dans un FPGA, le dĂ©veloppeur doit crĂ©er autant le chemin de donnĂ©es que le contrĂŽleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmĂ©tiques dĂ©diĂ©es qui sont disponibles dans une FPGA est une tĂąche difficile qui requiert beaucoup de temps. Seulement les spĂ©cialistes dans le design de circuits peuvent le faire. Ce projet est fondĂ© sur un tissu de calcul gĂ©nĂ©rique contrĂŽlĂ© par les donnĂ©es qui a Ă©tĂ© proposĂ© par le professeur J.P David et a dĂ©jĂ  Ă©tĂ© implĂ©mentĂ© par un Ă©tudiant Ă  la maĂźtrise M. Allard. Cette architecture est principalement formĂ©e de trois composants: l’unitĂ© arithmĂ©tique et logique partagĂ©e (Shared Arithmetic Logic Unit –SALU-), la machine Ă  Ă©tat pour le jeton des donnĂ©es (Token State Machine –TSM-) et la banque de FIFO (FIFO Bank –FB-). Cette architecture est semblable aux architectures reconfigurables Ă  large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrĂŽlĂ©e par les donnĂ©es.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility. Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations “only” require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs

    Description and Specialization of Coarse-grained Reconfigurable Architectures

    Get PDF
    The functionality of electronic embedded systems, such as mobile phones and digital cameras, becomes more complex at each product generation. This increasing complexity implies great challenges at the design phase of these devices, as designers have to deal with high performance and low energy requirements at a low production budget. In the last years, coarse-grained, dynamically reconfigurable computer systems have increasingly gain in importance as an alternative to cope with these challenges because they provide an optimal trade-off between flexibility-after-production and performance. Like generic purpose processors, coarse-grained reconfigurable systems can be quickly reprogrammed to perform new tasks, but they keep their performance and energy consumption near to ASIC standards. The design of coarse-grained reconfigurable processors is the main theme in this work. In the first part of this dissertation, I present a new architecture description language that was designed for the description of coarse-grained, reconfigurable systems. This language allows an efficient specification of processor arrays and the description of scalable interconnection networks. The second part of this dissertation investigates the specialization of coarse-grained reconfigurable processors towards an application domain by using custom instruction sets. This work presents methods, techniques, and tools to recognize and extract clusters of operations from a set of application. These clusters serve as patterns for the design of an optimal custom instruction set. Experiments and results are presented, which analyze and assess the impact of custom instructions on coarse-grained processor arrays.Die FunktionalitĂ€t eingebetteter Systeme wie Mobiltelefone und digitale Foto-Kameras wird zunehmend umfangreicher und bĂŒrdet dem Entwurf dieser GerĂ€te hohe Herausforderungen auf, wie z.B. hohe AusfĂŒhrungsgeschwindigkeit, niedrige Herstellungskosten und geringeren Energieverbrauch. Um diese Herausforderungen zu bewĂ€ltigen, gewinnen grobgranulare dynamische rekonfigurierbare Rechnersysteme schnell an Bedeutung, denn sie bieten einen optimalen trade-off zwischen FlexibilitĂ€t nach der Herstellung und Performanz. Wie allgemeine Prozessoren, können grobgranulare rekonfigurierbare Systeme wĂ€hrend der AusfĂŒhrungszeit schnell umprogrammiert werden, um neue FunktionalitĂ€ten auszufĂŒhren, behalten aber immer noch eine ASIC-Ă€hnliche Performanz und Verlustleistungsverbrauch. Der Entwurf grobgranularer rekonfigurierbarer Bausteine ist das Thema dieser Dissertation. Im ersten Teil dieser Dissertation wird eine Sprache vorgestellt, die fĂŒr die Beschreibung grobgranularer rekonfigurierbarer Systeme entwickelt wurde. Diese Sprache ermöglicht eine effiziente Spezifikation von Prozessoren-Arrays und die Beschreibung skalierbarer Netzwerkverbindungen. Der zweite Teil untersucht die Anpassung grobgranularer rekonfigurierbarer Bausteine an AnwendungssĂ€tze mittels spezialisierter Befehle. Methoden werden vorgestellt zur Erkennung und Extraktion von Operationsmustern aus einem Anwendungssatz. Diese Operationsmuster dienen dann zum Entwurf eines optimalen spezialisierten Befehlsatzes. Als Ergebnisse werden die Wirkungen von spezialisierten BefehlsĂ€tzen in grobgranularen Arrays analysiert und bewertet

    Memory hierarchy and data communication in heterogeneous reconfigurable SoCs

    Get PDF
    The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented
    • 

    corecore