85 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Efficient performance scaling of future CGRAs for mobile applications

    Full text link

    High-Level Design Space and Flexibility Exploration for Adaptive, Energy-Efficient WCDMA Channel Estimation Architectures

    Get PDF
    Due to the fast changing wireless communication standards coupled with strict performance constraints, the demand for flexible yet high-performance architectures is increasing. To tackle the flexibility requirement, software-defined radio (SDR) is emerging as an obvious solution, where the underlying hardware implementation is tuned via software layers to the varied standards depending on power-performance and quality requirements leading to adaptable, cognitive radio. In this paper, we conduct a case study for representatives of two complexity classes of WCDMA channel estimation algorithms and explore the effect of flexibility on energy efficiency using different implementation options. Furthermore, we propose new design guidelines for both highly specialized architectures and highly flexible architectures using high-level synthesis, to enable the required performance and flexibility to support multiple applications. Our experiments with various design points show that the resulting architectures meet the performance constraints of WCDMA and a wide range of options are offered for tuning such architectures depending on power/performance/area constraints of SDR

    High-level Modelling and Exploration of Coarse-grained Re-configurable Architectures

    Full text link

    Power and Energy Aware Heterogeneous Computing Platform

    Get PDF
    During the last decade, wireless technologies have experienced significant development, most notably in the form of mobile cellular radio evolution from GSM to UMTS/HSPA and thereon to Long-Term Evolution (LTE) for increasing the capacity and speed of wireless data networks. Considering the real-time constraints of the new wireless standards and their demands for parallel processing, reconfigurable architectures and in particular, multicore platforms are part of the most successful platforms due to providing high computational parallelism and throughput. In addition to that, by moving toward Internet-of-Things (IoT), the number of wireless sensors and IP-based high throughput network routers is growing at a rapid pace. Despite all the progression in IoT, due to power and energy consumption, a single chip platform for providing multiple communication standards and a large processing bandwidth is still missing.The strong demand for performing different sets of operations by the embedded systems and increasing the computational performance has led to the use of heterogeneous multicore architectures with the help of accelerators for computationally-intensive data-parallel tasks acting as coprocessors. Currently, highly heterogeneous systems are the most power-area efficient solution for performing complex signal processing systems. Additionally, the importance of IoT has increased significantly the need for heterogeneous and reconfigurable platforms.On the other hand, subsequent to the breakdown of the Dennardian scaling and due to the enormous heat dissipation, the performance of a single chip was obstructed by the utilization wall since all cores cannot be clocked at their maximum operating frequency. Therefore, a thermal melt-down might be happened as a result of high instantaneous power dissipation. In this context, a large fraction of the chip, which is switched-off (Dark) or operated at a very low frequency (Dim) is called Dark Silicon. The Dark Silicon issue is a constraint for the performance of computers, especially when the up-coming IoT scenario will demand a very high performance level with high energy efficiency. Among the suggested solution to combat the problem of Dark-Silicon, the use of application-specific accelerators and in particular Coarse-Grained Reconfigurable Arrays (CGRAs) are the main motivation of this thesis work.This thesis deals with design and implementation of Software Defined Radio (SDR) as well as High Efficiency Video Coding (HEVC) application-specific accelerators for computationally intensive kernels and data-parallel tasks. One of the most important data transmission schemes in SDR due to its ability of providing high data rates is Orthogonal Frequency Division Multiplexing (OFDM). This research work focuses on the evaluation of Heterogeneous Accelerator-Rich Platform (HARP) by implementing OFDM receiver blocks as designs for proof-of-concept. The HARP template allows the designer to instantiate a heterogeneous reconfigurable platform with a very large amount of custom-tailored computational resources while delivering a high performance in terms of many high-level metrics. The availability of this platform lays an excellent foundation to investigate techniques and methods to replace the Dark or Dim part of chip with high-performance silicon dissipating very low power and energy. Furthermore, this research work is also addressing the power and energy issues of the embedded computing systems by tailoring the HARP for self-aware and energy-aware computing models. In this context, the instantaneous power dissipation and therefore the heat dissipation of HARP are mitigated on FPGA/ASIC by using Dynamic Voltage and Frequency Scaling (DVFS) to minimize the dark/dim part of the chip. Upgraded HARP for self-aware and energy-aware computing can be utilized as an energy-efficient general-purpose transceiver platform that is cognitive to many radio standards and can provide high throughput while consuming as little energy as possible. The evaluation of HARP has shown promising results, which makes it a suitable platform for avoiding Dark Silicon in embedded computing platforms and also for diverse needs of IoT communications.In this thesis, the author designed the blocks of OFDM receiver by crafting templatebased CGRA devices and then attached them to HARP’s Network-on-Chip (NoC) nodes. The performance of application-specific accelerators generated from templatebased CGRAs, the performance of the entire platform subsequent to integrating the CGRA nodes on HARP and the NoC traffic are recorded in terms of several highlevel performance metrics. In evaluating HARP on FPGA prototype, it delivers a performance of 0.012 GOPS/mW. Because of the scalability and regularity in HARP, the author considered its value as architectural constant. In addition to showing the gain and the benefits of maximizing the number of reconfigurable processing resources on a platform in comparison to the scaled performance of several state-of-the-art platforms, HARP’s architectural constant ensures application-independent figure of merit. HARP is further evaluated by implementing various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for HEVC standard, which showed its ability to sustain Full HD 1080p format at 30 fps on FPGA. The author also integrated self-aware computing model in HARP to mitigate the power dissipation of an OFDM receiver. In the case of FPGA implementation, the total power dissipation of the platform showed 16.8% reduction due to employing the Feedback Control System (FCS) technique with Dynamic Frequency Scaling (DFS). Furthermore, by moving to ASIC technology and scaling both frequency and voltage simultaneously, significant dynamic power reduction (up to 82.98%) was achieved, which proved the DFS/DVFS techniques as one step forward to mitigate the Dark Silicon issue

    Rewriting History: Repurposing Domain-Specific CGRAs

    Full text link
    Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices promising both the flexibility of FPGAs and the performance of ASICs. However, with restricted domains comes a danger: designing chips that cannot accelerate enough current and future software to justify the hardware cost. We introduce FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to operations they do not natively support. FlexC uses dataflow rewriting, replacing unsupported regions of code with equivalent operations that are supported by the CGRA. We use equality saturation, a technique enabling efficient exploration of a large space of rewrite rules, to effectively search through the program-space for supported programs. We applied FlexC to over 2,000 loop kernels, compiling to four different research CGRAs and 300 generated CGRAs and demonstrate a 2.2×\times increase in the number of loop kernels accelerated leading to 3×\times speedup compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the accelerator

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Description and Specialization of Coarse-grained Reconfigurable Architectures

    Get PDF
    The functionality of electronic embedded systems, such as mobile phones and digital cameras, becomes more complex at each product generation. This increasing complexity implies great challenges at the design phase of these devices, as designers have to deal with high performance and low energy requirements at a low production budget. In the last years, coarse-grained, dynamically reconfigurable computer systems have increasingly gain in importance as an alternative to cope with these challenges because they provide an optimal trade-off between flexibility-after-production and performance. Like generic purpose processors, coarse-grained reconfigurable systems can be quickly reprogrammed to perform new tasks, but they keep their performance and energy consumption near to ASIC standards. The design of coarse-grained reconfigurable processors is the main theme in this work. In the first part of this dissertation, I present a new architecture description language that was designed for the description of coarse-grained, reconfigurable systems. This language allows an efficient specification of processor arrays and the description of scalable interconnection networks. The second part of this dissertation investigates the specialization of coarse-grained reconfigurable processors towards an application domain by using custom instruction sets. This work presents methods, techniques, and tools to recognize and extract clusters of operations from a set of application. These clusters serve as patterns for the design of an optimal custom instruction set. Experiments and results are presented, which analyze and assess the impact of custom instructions on coarse-grained processor arrays.Die Funktionalität eingebetteter Systeme wie Mobiltelefone und digitale Foto-Kameras wird zunehmend umfangreicher und bürdet dem Entwurf dieser Geräte hohe Herausforderungen auf, wie z.B. hohe Ausführungsgeschwindigkeit, niedrige Herstellungskosten und geringeren Energieverbrauch. Um diese Herausforderungen zu bewältigen, gewinnen grobgranulare dynamische rekonfigurierbare Rechnersysteme schnell an Bedeutung, denn sie bieten einen optimalen trade-off zwischen Flexibilität nach der Herstellung und Performanz. Wie allgemeine Prozessoren, können grobgranulare rekonfigurierbare Systeme während der Ausführungszeit schnell umprogrammiert werden, um neue Funktionalitäten auszuführen, behalten aber immer noch eine ASIC-ähnliche Performanz und Verlustleistungsverbrauch. Der Entwurf grobgranularer rekonfigurierbarer Bausteine ist das Thema dieser Dissertation. Im ersten Teil dieser Dissertation wird eine Sprache vorgestellt, die für die Beschreibung grobgranularer rekonfigurierbarer Systeme entwickelt wurde. Diese Sprache ermöglicht eine effiziente Spezifikation von Prozessoren-Arrays und die Beschreibung skalierbarer Netzwerkverbindungen. Der zweite Teil untersucht die Anpassung grobgranularer rekonfigurierbarer Bausteine an Anwendungssätze mittels spezialisierter Befehle. Methoden werden vorgestellt zur Erkennung und Extraktion von Operationsmustern aus einem Anwendungssatz. Diese Operationsmuster dienen dann zum Entwurf eines optimalen spezialisierten Befehlsatzes. Als Ergebnisse werden die Wirkungen von spezialisierten Befehlsätzen in grobgranularen Arrays analysiert und bewertet

    Performance evaluation of MPEG-4 Video encoder on Adres

    Get PDF
    Curs 2006-2007Actualment un típic embedded system (ex. telèfon mòbil) requereix alta qualitat per portar a terme tasques com codificar/descodificar a temps real; han de consumir poc energia per funcionar hores o dies utilitzant bateries lleugeres; han de ser el suficientment flexibles per integrar múltiples aplicacions i estàndards en un sol aparell; han de ser dissenyats i verificats en un període de temps curt tot i l’augment de la complexitat. Els dissenyadors lluiten contra aquestes adversitats, que demanen noves innovacions en arquitectures i metodologies de disseny. Coarse-grained reconfigurable architectures (CGRAs) estan emergent com a candidats potencials per superar totes aquestes dificultats. Diferents tipus d’arquitectures han estat presentades en els últims anys. L’alta granularitat redueix molt el retard, l’àrea, el consum i el temps de configuració comparant amb les FPGAs. D’altra banda, en comparació amb els tradicionals processadors coarse-grained programables, els alts recursos computacionals els permet d’assolir un alt nivell de paral•lelisme i eficiència. No obstant, els CGRAs existents no estant sent aplicats principalment per les grans dificultats en la programació per arquitectures complexes. ADRES és una nova CGRA dissenyada per I’Interuniversity Micro-Electronics Center (IMEC). Combina un processador very-long instruction word (VLIW) i un coarse-grained array per tenir dues opcions diferents en un mateix dispositiu físic. Entre els seus avantatges destaquen l’alta qualitat, poca redundància en les comunicacions i la facilitat de programació. Finalment ADRES és un patró enlloc d’una arquitectura concreta. Amb l’ajuda del compilador DRESC (Dynamically Reconfigurable Embedded System Compile), és possible trobar millors arquitectures o arquitectures específiques segons l’aplicació. Aquest treball presenta la implementació d’un codificador MPEG-4 per l’ADRES. Mostra l’evolució del codi per obtenir una bona implementació per una arquitectura donada. També es presenten les característiques principals d’ADRES i el seu compilador (DRESC). Els objectius són de reduir al màxim el nombre de cicles (temps) per implementar el codificador de MPEG-4 i veure les diferents dificultats de treballar en l’entorn ADRES. Els resultats mostren que els cícles es redueixen en un 67% comparant el codi inicial i final en el mode VLIW i un 84% comparant el codi inicial en VLIW i el final en mode CGA.Nowadays, a typical embedded system requires high performance to perform tasks such as video encoding/decoding at run-time. It should consume little energy to work hours or even days using a lightweight battery. It should be flexible enough to integrate multiple applications and standards in one single device. It has to be designed and verified in short time-to-market despite substantially increased complexity. The designers are struggling to meet these huge challenges, which call for innovations of both architectures and design methodology. Coarse-grained reconfigurable architectures (CGRAs) are emerging as potential candidates to meet the above challenges. Many of them were proposed in recent years. This coarse granularity greatly reduces delay, area, power and configuration time compared with FPGAs. On the other hand, compared with traditional "coarse-grained" programmable processors, their massive computational resources enable them to achieve high parallelism and efficiency. However, existing CGRAs have yet been widely adopted mainly because of programming difficulty for such complex architecture. ADRES is a novel CGRA designed by Interuniversity Micro-Electronics Center (IMEC). It tightly couples a very-long instruction word (VLIW) processor and a coarse-grained array by providing two functional views on the same physical resources. It brings advantages such as high performance, low communication overhead and easiness of programming. Finally, ADRES is a template instead of a concrete architecture. With the retargetable compilation support from DRESC (Dynamically Reconfigurable Embedded System Compile), architectural exploration becomes possible to discover better architectures or design domain-specific architectures. In this thesis, a performance of an MPEG-4 encoder in ADRES is presented. The thesis shows the code evolution to obtain a good implementation for a given architecture. The main features of ADRES and its compiler (DRESC) are presented. The objectives are to reduce as much as possible the amount of cycles (time) spent to encode video in MPEG-4 and test different issues working with ADRES environment. The cycles decrease a 67% comparing initial and final code in VLIW and 84% between initial VLIW and CGA mode.Director/a: Moisès Serra i SerraSupervisor: Eric Delfoss

    Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures

    Get PDF
    This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast
    corecore