331 research outputs found
Coarse-grained reconfigurable array architectures
Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹexibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code
Compiler and Architecture Design for Coarse-Grained Programmable Accelerators
abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels.
At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism.
Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201
Flip: Data-Centric Edge CGRA Accelerator
Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators
due to the outstanding balance in flexibility, performance, and energy
efficiency. Classic CGRAs statically map compute operations onto the processing
elements (PE) and route the data dependencies among the operations through the
Network-on-Chip. However, CGRAs are designed for fine-grained static
instruction-level parallelism and struggle to accelerate applications with
dynamic and irregular data-level parallelism, such as graph processing. To
address this limitation, we present Flip, a novel accelerator that enhances
traditional CGRA architectures to boost the performance of graph applications.
Flip retains the classic CGRA execution model while introducing a special
data-centric mode for efficient graph processing. Specifically, it exploits the
natural data parallelism of graph algorithms by mapping graph vertices onto
processing elements (PEs) rather than the operations, and supporting dynamic
routing of temporary data according to the runtime evolution of the graph
frontier. Experimental results demonstrate that Flip achieves up to 36
speedup with merely 19% more area compared to classic CGRAs. Compared to
state-of-the-art large-scale graph processors, Flip has similar energy
efficiency and 2.2 better area efficiency at a much-reduced power/area
budget
Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies
Broadband Wireless Access technologies have significant market potential, especially the
WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high
performance WiMAX solutions is forcing designers to seek help from multi-core processors
that offer competitive advantages in terms of all performance metrics, such as speed, power
and area. Through the provision of a degree of flexibility similar to that of a DSP and
performance and power consumption advantages approaching that of an ASIC,
coarse-grained dynamically reconfigurable processors are proving to be strong candidates
for processing cores used in future high performance multi-core processor systems.
This thesis investigates multi-core architectures with a newly emerging dynamically
reconfigurable processor â RICA, targeting WiMAX physical layer applications. A novel
master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC
based simulator, called MRPSIM, is devised to model this multi-core architecture. This
simulator provides fast simulation speed and timing accuracy, offers flexible architectural
options to configure the multi-core architecture, and enables the analysis and investigation
of multi-core architectures. Meanwhile a profiling-driven mapping methodology is
developed to partition the WiMAX application into multiple tasks as well as schedule and
map these tasks onto the multi-core architecture, aiming to reduce the overall system
execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly
integrated with the existing RICA tool flow.
Based on the proposed master-slave multi-core architecture, a series of diverse
homogeneous and heterogeneous multi-core solutions are designed for different fixed
WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM
simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at
relatively low area costs. Meanwhile a design space exploration methodology is developed
to search the design space for multi-core systems to find suitable solutions under certain
system constraints. Finally, laying a foundation for future multithreading exploration on the
proposed multi-core architecture, this thesis investigates the porting of a real-time operating
system â Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is
implemented on a single RICA processor with the operating system support
Application-Level Performance Improvement for Stream Program on CGRA-based systems
Department of Computer EngineeringCoarse-Grained Reconfigurable Architectures (CGRAs), often used as coprocessors for DSP and multimedia kernels, can deliver highly energy-effcient execution for compute-intensive kernels. Simultaneously, stream applications, which consist of many actors and channels connecting them, can provide natural representations for DSP applications, and therefore be a good match for CGRAs. We present our results of mapping DSP applications written in StreamIt language to CGRAs, along with our mapping flow. One important challenge in mapping is how to manage the multitude of kernels in the application for the limited local memory of a CGRA, for which we present a novel integer linear programming-based solution. Our evaluation results demonstrate that our software and hardware optimizations can help generate highly effcient mapping of stream applications to CGRAs, enabling far more energy-effcient executions (7x worse to 50x better) compared to using state-of-theart GP-GPUs. Further, we eliminate communication overhead and reduce computation overhead using combination of sychronous/asynchronous processors and DMA. This optimization also improve performance by 17.1% on average comparing to baseline system.ope
Implementation of Data-Driven Applications on Two-Level Reconfigurable Hardware
RĂSUMĂ Les architectures reconfigurables Ă large grain sont devenues un sujet important de recherche en raison de leur haut potentiel pour accĂ©lĂ©rer une large gamme dâapplications. Ces architectures utilisent la nature parallĂšle de lâarchitecture matĂ©rielle pour accĂ©lĂ©rer les calculs. Les architectures reconfigurables Ă large grain sont en mesure de combler les lacunes existantes entre le FPGA (architecture reconfigurable Ă grain fin) et le processeur. Elles contrastent gĂ©nĂ©ralement avec les Application Specific Integrated Circuits (ASIC) en ce qui concerne la performance (moins bonnes) et la flexibilitĂ© (meilleures). La programmation dâarchitectures reconfigurables est un dĂ©fi qui date depuis longtemps et pose plusieurs problĂšmes. Les programmeurs doivent ĂȘtre avisĂ©s des caractĂ©ristiques du matĂ©riel sur lequel ils travaillent et connaĂźtre des langages de description matĂ©riels tels que VHDL et Verilog au lieu de langages de programmation sĂ©quentielle. LâimplĂ©mentation dâun algorithme sur FPGA sâavĂšre plus difficile que de le faire sur des CPU ou des GPU. Les implĂ©mentations Ă base de processeurs ont dĂ©jĂ leur chemin de donnĂ©es prĂ© synthĂ©tisĂ© et ont besoin uniquement dâun programme pour le contrĂŽler. Par contre, dans un FPGA, le dĂ©veloppeur doit crĂ©er autant le chemin de donnĂ©es que le contrĂŽleur. Cependant, concevoir une nouvelle architecture pour exploiter efficacement les millions de cellules logiques et les milliers de ressources arithmĂ©tiques dĂ©diĂ©es qui sont disponibles dans une FPGA est une tĂąche difficile qui requiert beaucoup de temps. Seulement les spĂ©cialistes dans le design de circuits peuvent le faire. Ce projet est fondĂ© sur un tissu de calcul gĂ©nĂ©rique contrĂŽlĂ© par les donnĂ©es qui a Ă©tĂ© proposĂ© par le professeur J.P David et a dĂ©jĂ Ă©tĂ© implĂ©mentĂ© par un Ă©tudiant Ă la maĂźtrise M. Allard. Cette architecture est principalement formĂ©e de trois composants: lâunitĂ© arithmĂ©tique et logique partagĂ©e (Shared Arithmetic Logic Unit âSALU-), la machine Ă Ă©tat pour le jeton des donnĂ©es (Token State Machine âTSM-) et la banque de FIFO (FIFO Bank âFB-). Cette architecture est semblable aux architectures reconfigurables Ă large grain (Coarse-Grained Reconfigurable Architecture-CGRAs-), mais contrĂŽlĂ©e par les donnĂ©es.----------ABSTRACT Coarse-grained reconfigurable computing architectures have become an important research topic because of their high potential to accelerate a wide range of applications. These architectures apply the concurrent nature of hardware architecture to accelerate computations. Substantially, coarse-grained reconfigurable computing architectures can fill up existing gaps between FPGAs and processor. They typically contrast with Application Specific Integrated Circuits (ASICs) in connection with performance and flexibility.
Programming reconfigurable computing architectures is a long-standing challenge, and it is yet extremely inconvenient. Programmers must be aware of hardware features and also it is assumed that they have a good knowledge of hardware description languages such as VHDL and Verilog, instead of the sequential programming paradigm. Implementing an algorithm on FPGA is intrinsically more difficult than programming a processor or a GPU. Processor-based implementations âonlyâ require a program to control their pre-synthesized data path, while an FPGA requires that a designer creates a new data path and a new controller for each application. Nevertheless, conceiving an architecture that best exploits the millions of logic cells and the thousands of dedicated arithmetic resources available in an FPGA is a time-consuming challenge that only talented experts in circuit design can handle. This project is founded on the generic data-driven compute fabric proposed by Prof. J.P. David and implemented by M. Allard, a previous master student. This architecture is composed of three main individual components: the Shared Arithmetic Logic Unit (SALU), the Token State Machine (TSM) and the FIFO Bank (FB). The architecture is somewhat similar to Coarse-Grained Reconfigurable Architectures (CGRAs), but it is data-driven. Indeed, in that architecture, register banks are replaced by FBs and the controllers are TSMs. The operations start as soon as the operands are available in the FIFOs that contain the operands. Data travel from FBs to FBs through the SALU, as programmed in the configuration memory of the TSMs. Final results return in FIFOs
Description and Specialization of Coarse-grained Reconfigurable Architectures
The functionality of electronic embedded systems, such as mobile phones and digital cameras, becomes more complex at each product generation. This increasing complexity implies great challenges at the design phase of these devices, as designers have to deal with high performance and low energy requirements at a low production budget. In the last years, coarse-grained, dynamically reconfigurable computer systems have increasingly gain in importance as an alternative to cope with these challenges because they provide an optimal trade-off between flexibility-after-production and performance. Like generic purpose processors, coarse-grained reconfigurable systems can be quickly reprogrammed to perform new tasks, but they keep their performance and energy consumption near to ASIC standards. The design of coarse-grained reconfigurable processors is the main theme in this work.
In the first part of this dissertation, I present a new architecture description language that was designed for the description of coarse-grained, reconfigurable systems. This language allows an efficient specification of processor arrays and the description of scalable interconnection networks. The second part of this dissertation investigates the specialization of coarse-grained reconfigurable processors towards an application domain by using custom instruction sets. This work presents methods, techniques, and tools to recognize and extract clusters of operations from a set of application. These clusters serve as patterns for the design of an optimal custom instruction set. Experiments and results are presented, which analyze and assess the impact of custom instructions on coarse-grained processor arrays.Die FunktionalitĂ€t eingebetteter Systeme wie Mobiltelefone und digitale Foto-Kameras wird zunehmend umfangreicher und bĂŒrdet dem Entwurf dieser GerĂ€te hohe Herausforderungen auf, wie z.B. hohe AusfĂŒhrungsgeschwindigkeit, niedrige Herstellungskosten und geringeren Energieverbrauch. Um diese Herausforderungen zu bewĂ€ltigen, gewinnen grobgranulare dynamische rekonfigurierbare Rechnersysteme schnell an Bedeutung, denn sie bieten einen optimalen trade-off zwischen FlexibilitĂ€t nach der Herstellung und Performanz. Wie allgemeine Prozessoren, können grobgranulare rekonfigurierbare Systeme wĂ€hrend der AusfĂŒhrungszeit schnell umprogrammiert werden, um neue FunktionalitĂ€ten auszufĂŒhren, behalten aber immer noch eine ASIC-Ă€hnliche Performanz und Verlustleistungsverbrauch. Der Entwurf grobgranularer rekonfigurierbarer Bausteine ist das Thema dieser Dissertation.
Im ersten Teil dieser Dissertation wird eine Sprache vorgestellt, die fĂŒr die Beschreibung grobgranularer rekonfigurierbarer Systeme entwickelt wurde. Diese Sprache ermöglicht eine effiziente Spezifikation von Prozessoren-Arrays und die Beschreibung skalierbarer Netzwerkverbindungen. Der zweite Teil untersucht die Anpassung grobgranularer rekonfigurierbarer Bausteine an AnwendungssĂ€tze mittels spezialisierter Befehle. Methoden werden vorgestellt zur Erkennung und Extraktion von Operationsmustern aus einem Anwendungssatz. Diese Operationsmuster dienen dann zum Entwurf eines optimalen spezialisierten Befehlsatzes. Als Ergebnisse werden die Wirkungen von spezialisierten BefehlsĂ€tzen in grobgranularen Arrays analysiert und bewertet
Memory hierarchy and data communication in heterogeneous reconfigurable SoCs
The miniaturization race in the hardware industry aiming at continuous increasing
of transistor density on a die does not bring respective application performance
improvements any more. One of the most promising alternatives is to
exploit a heterogeneous nature of common applications in hardware. Supported by
reconfigurable computation, which has already proved its efficiency in accelerating
data intensive applications, this concept promises a breakthrough in contemporary
technology development.
Memory organization in such heterogeneous reconfigurable architectures becomes
very critical. Two primary aspects introduce a sophisticated trade-off. On
the one hand, a memory subsystem should provide well organized distributed data
structure and guarantee the required data bandwidth. On the other hand, it should
hide the heterogeneous hardware structure from the end-user, in order to support
feasible high-level programmability of the system.
This thesis work explores the heterogeneous reconfigurable hardware architectures
and presents possible solutions to cope the problem of memory organization
and data structure. By the example of the MORPHEUS heterogeneous platform,
the discussion follows the complete design cycle, starting from decision making
and justification, until hardware realization. Particular emphasis is made on the
methods to support high system performance, meet application requirements, and
provide a user-friendly programmer interface.
As a result, the research introduces a complete heterogeneous platform enhanced
with a hierarchical memory organization, which copes with its task by
means of separating computation from communication, providing reconfigurable
engines with computation and configuration data, and unification of heterogeneous
computational devices using local storage buffers. It is distinguished from the
related solutions by distributed data-flow organization, specifically engineered
mechanisms to operate with data on local domains, particular communication infrastructure
based on Network-on-Chip, and thorough methods to prevent computation
and communication stalls. In addition, a novel advanced technique to accelerate
memory access was developed and implemented
- âŠ