19 research outputs found

    Generic Connectivity-Based CGRA Mapping via Integer Linear Programming

    Full text link
    Coarse-grained reconfigurable architectures (CGRAs) are programmable logic devices with large coarse-grained ALU-like logic blocks, and multi-bit datapath-style routing. CGRAs often have relatively restricted data routing networks, so they attract CAD mapping tools that use exact methods, such as Integer Linear Programming (ILP). However, tools that target general architectures must use large constraint systems to fully describe an architecture's flexibility, resulting in lengthy run-times. In this paper, we propose to derive connectivity information from an otherwise generic device model, and use this to create simpler ILPs, which we combine in an iterative schedule and retain most of the exactness of a fully-generic ILP approach. This new approach has a speed-up geometric mean of 5.88x when considering benchmarks that do not hit a time-limit of 7.5 hours on the fully-generic ILP, and 37.6x otherwise. This was measured using the set of benchmarks used to originally evaluate the fully-generic approach and several more benchmarks representing computation tasks, over three different CGRA architectures. All run-times of the new approach are less than 20 minutes, with 90th percentile time of 410 seconds. The proposed mapping techniques are integrated into, and evaluated using the open-source CGRA-ME architecture modelling and exploration framework.Comment: 8 pages of content; 8 figures; 3 tables; to appear in FCCM 2019; Uses the CGRA-ME framework at http://cgra-me.ece.utoronto.ca

    Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware

    Get PDF
    RÉSUMÉ Les architectures reconfigurables à large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accélérer une large gamme d’applications. Ces architectures utilisent la nature parallèle de l’architecture matérielle pour accélérer les calculs. Les architectures reconfigurables à large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable à grain fin) et le processeur. Elles contrastent généralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilité (meilleures). La programmation d’architectures reconfigurables est un défi qui date depuis longtemps et pose plusieurs problèmes. Les programmeurs doivent être avisés des caractéristiques du matériel sur lequel ils travaillent et connaître des langages de description matériels tels que VHDL et Verilog au lieu de langages de programmation séquentielle. L’implémentation d’un algorithme sur FPGA s’avère plus difficile que de le faire sur des CPU ou des GPU. Les implémentations à base de processeurs ont déjà leur chemin de données pré synthétisé et ont besoin uniquement d’un programme pour le contrôler. Par contre, dans un FPGA, le développeur doit créer autant le chemin de données que le contrôleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmétiques dédiées qui sont disponibles dans une FPGA est une tâche difficile qui requiert beaucoup de temps. Seulement les spécialistes dans le design de circuits peuvent le faire. Ce projet est fondé sur un tissu de calcul générique contrôlé par les données qui a été proposé par le professeur J.P David et a déjà été implémenté par un étudiant à la maîtrise M. Allard. Cette architecture est principalement formée de trois composants: l’unité arithmétique et logique partagée (Shared Arithmetic Logic Unit –SALU-), la machine à état pour le jeton des données (Token State Machine –TSM-) et la banque de FIFO (FIFO Bank –FB-). Cette architecture est semblable aux architectures reconfigurables à large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrôlée par les données.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility. Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations “only” require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs

    Polymorphic Pipeline Array: A Flexible Multicore Accelerator for Mobile Multimedia Applications.

    Full text link
    Mobile computing in the form of smart phones, netbooks, and PDAs has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this dissertation, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array (PPA). The PPA design is inspired by coarse-grain reconfigurable architectures (CGRAs) that consist of an array of function units interconnected by a mesh style interconnect. The PPA improves upon CGRAs by attacking two major limitations: scalability and acceleration limited to innermost loops. The large number of resources are fully utilized by exploiting both Lne-grain instruction-level and coarse-grain pipeline parallelism, and the acceleration is extended beyond innermost loops to encompass the whole region of applications. Various compiler and architectural optimizations are presented for CGRAs that form the basic building blocks of PPA. Two compiler techniques are presented that systematically construct the schedule with intelligent heuristics. Modulo graph embedding leverages graph embedding technique for scheduling in CGRAs and edgecentric modulo scheduling provides a communication-oriented way to address the scheduling problem. For architectural improvement, a novel control path design is presented that leverages the token network of dataflow machines to reduce the instructionmemory power. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. A PPA exploit pipeline parallelism found in streaming applications to create a coarsegrain hardware pipeline to execute streaming media applications.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64732/1/parkhc_1.pd

    Optimization of the Memory Subsystem of a Coarse Grained Reconfigurable Hardware Accelerator

    Get PDF
    Fast and energy efficient processing of data has always been a key requirement in processor design. The latest developments in technology emphasize these requirements even further. The widespread usage of mobile devices increases the demand of energy efficient solutions. Many new applications like advanced driver assistance systems focus more and more on machine learning algorithms and have to process large data sets in hard real time. Up to the 1990s the increase in processor performance was mainly achieved by new and better manufacturing technologies for processors. That way, processors could operate at higher clock frequencies, while the processor microarchitecture was mainly the same. At the beginning of the 21st century this development stopped. New manufacturing technologies made it possible to integrate more processor cores onto one chip, but almost no improvements were achieved anymore in terms of clock frequencies. This required new approaches in both processor microarchitecture and software design. Instead of improving the performance of a single processor, the current problem has to be divided into several subtasks that can be executed in parallel on different processing elements which speeds up the application. One common approach is to use multi-core processors or GPUs (Graphic Processing Units) in which each processing element calculates one subtask of the problem. This approach requires new programming techniques and legacy software has to be reformulated. Another approach is the usage of hardware accelerators which are coupled to a general purpose processor. For each problem a dedicated circuit is designed which can solve the problem fast and efficiently. The actual computation is then executed on the accelerator and not on the general purpose processor. The disadvantage of this approach is that a new circuit has to be designed for each problem. This results in an increased design effort and typically the circuit can not be adapted once it is deployed. This work covers reconfigurable hardware accelerators. They can be reconfigured during runtime so that the same hardware is used to accelerate different problems. During runtime, time consuming code fragments can be identified and the processor itself starts a process that creates a configuration for the hardware accelerator. This configuration can now be loaded and the code will then be executed on the accelerator faster and more efficient. A coarse grained reconfigurable architecture was chosen because creating a configuration for it is much less complex than creating a configuration for a fine grained reconfigurable architecture like an FPGA (Field Programmable Gate Array). Additionally, the smaller overhead for the reconfigurability results in higher clock frequencies. One advantage of this approach is that programmers don't need any knowledge about the underlying hardware, because the acceleration is done automatically during runtime. It is also possible to accelerate legacy code without user interaction (even when no source code is available anymore). One challenge that is relevant for all approaches, is the efficient and fast data exchange between processing elements and main memory. Therefore, this work concentrates on the optimization of the memory interface between the coarse grained reconfigurable hardware accelerator and the main memory. To achieve this, a simulator for a Java processor coupled with a coarse grained reconfigurable hardware accelerator was developed during this work. Several strategies were developed to improve the performance of the memory interface. The solutions range from different hardware designs to software solutions that try to optimize the usage of the memory interface during the creation of the configuration of the accelerator. The simulator was used to search the design space for the best implementation. With this optimization of the memory interface a performance improvement of 22.6% was achieved. Apart from that, a first prototype of this kind of accelerator was designed and implemented on an FPGA to show the correct functionality of the whole approach and the simulator

    Metoda projektovanja namenskih programabilnih hardverskih akceleratora

    Get PDF
    Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti, preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora. Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim, ovo ima negativne posledice na efikasnost sistema. često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja, neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije, pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target applications. To efficiently execute those applications, they may employ Application Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal when designing ISEs is efficiency, i.e. the highest possible speedup, which implies synthesizing all critical computational kernels of the application dataflow graphs as an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip area dedicated solely to ISEs. One existing approach to decrease this area by paying a reasonable price of decreased efficiency is to perform datapath merging on input dataflow graphs (DFGs) prior to generating the ASIC. It is often neglected that even higher costs can be accidentally incurred due to the lack of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen, significant time-to-market delays and nonrecurrent costs for redesigning the ISEs and the corresponding ASIPs become inevitable. This thesis shows that flexibility and efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a limited amount of hardware flexibility during the design process, such that the resulting datapath is in fact reconfigurable and thus can execute not only the applications known at design time, but also other applications belonging to the same application-domain. In other words, it proposes a methodology for designing domain-specific reconfigurable arrays out of a limited set of input applications. The experimental results show that resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using datapath merging, which have practically null flexibility beyond the design set of DFGs

    Description and Specialization of Coarse-grained Reconfigurable Architectures

    Get PDF
    The functionality of electronic embedded systems, such as mobile phones and digital cameras, becomes more complex at each product generation. This increasing complexity implies great challenges at the design phase of these devices, as designers have to deal with high performance and low energy requirements at a low production budget. In the last years, coarse-grained, dynamically reconfigurable computer systems have increasingly gain in importance as an alternative to cope with these challenges because they provide an optimal trade-off between flexibility-after-production and performance. Like generic purpose processors, coarse-grained reconfigurable systems can be quickly reprogrammed to perform new tasks, but they keep their performance and energy consumption near to ASIC standards. The design of coarse-grained reconfigurable processors is the main theme in this work. In the first part of this dissertation, I present a new architecture description language that was designed for the description of coarse-grained, reconfigurable systems. This language allows an efficient specification of processor arrays and the description of scalable interconnection networks. The second part of this dissertation investigates the specialization of coarse-grained reconfigurable processors towards an application domain by using custom instruction sets. This work presents methods, techniques, and tools to recognize and extract clusters of operations from a set of application. These clusters serve as patterns for the design of an optimal custom instruction set. Experiments and results are presented, which analyze and assess the impact of custom instructions on coarse-grained processor arrays.Die Funktionalität eingebetteter Systeme wie Mobiltelefone und digitale Foto-Kameras wird zunehmend umfangreicher und bürdet dem Entwurf dieser Geräte hohe Herausforderungen auf, wie z.B. hohe Ausführungsgeschwindigkeit, niedrige Herstellungskosten und geringeren Energieverbrauch. Um diese Herausforderungen zu bewältigen, gewinnen grobgranulare dynamische rekonfigurierbare Rechnersysteme schnell an Bedeutung, denn sie bieten einen optimalen trade-off zwischen Flexibilität nach der Herstellung und Performanz. Wie allgemeine Prozessoren, können grobgranulare rekonfigurierbare Systeme während der Ausführungszeit schnell umprogrammiert werden, um neue Funktionalitäten auszuführen, behalten aber immer noch eine ASIC-ähnliche Performanz und Verlustleistungsverbrauch. Der Entwurf grobgranularer rekonfigurierbarer Bausteine ist das Thema dieser Dissertation. Im ersten Teil dieser Dissertation wird eine Sprache vorgestellt, die für die Beschreibung grobgranularer rekonfigurierbarer Systeme entwickelt wurde. Diese Sprache ermöglicht eine effiziente Spezifikation von Prozessoren-Arrays und die Beschreibung skalierbarer Netzwerkverbindungen. Der zweite Teil untersucht die Anpassung grobgranularer rekonfigurierbarer Bausteine an Anwendungssätze mittels spezialisierter Befehle. Methoden werden vorgestellt zur Erkennung und Extraktion von Operationsmustern aus einem Anwendungssatz. Diese Operationsmuster dienen dann zum Entwurf eines optimalen spezialisierten Befehlsatzes. Als Ergebnisse werden die Wirkungen von spezialisierten Befehlsätzen in grobgranularen Arrays analysiert und bewertet

    Automatic Design of Efficient Application-centric Architectures.

    Full text link
    As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    ACiS: smart switches with application-level acceleration

    Full text link
    Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches. In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities. In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems. In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs). To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method. In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes. We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
    corecore