61 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained ReconïŹgurable Array (CGRA) architectures accelerate the same inner loops that beneïŹt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efïŹciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ïŹ‚exibility, performance, and power-efïŹciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ïŹne-tuning of source code

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    MULTI-OBJECTIVE DESIGN AUTOMATION FOR RECONFIGURABLE MULTI-PROCESSOR SYSTEMS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Energy-efficient hardware design based on high-level synthesis

    Get PDF
    This dissertation describes research activities broadly concerning the area of High-level synthesis (HLS), but more specifically, regarding the HLS-based design of energy-efficient hardware (HW) accelerators. HW accelerators, mostly implemented on FPGAs, are integral to the heterogeneous architectures employed in modern high performance computing (HPC) systems due to their ability to speed up the execution while dramatically reducing the energy consumption of computationally challenging portions of complex applications. Hence, the first activity was regarding an HLS-based approach to directly execute an OpenCL code on an FPGA instead of its traditional GPU-based counterpart. Modern FPGAs offer considerable computational capabilities while consuming significantly smaller power as compared to high-end GPUs. Several different implementations of the K-Nearest Neighbor algorithm were considered on both FPGA- and GPU-based platforms and their performance was compared. FPGAs were generally more energy-efficient than the GPUs in all the test cases. Eventually, we were also able to get a faster (in terms of execution time) FPGA implementation by using an FPGA-specific OpenCL coding style and utilizing suitable HLS directives. The second activity was targeted towards the development of a methodology complementing HLS to automatically derive power optimization directives (also known as "power intent") from a system-level design description and use it to drive the design steps after HLS, by producing a directive file written using the common power format (CPF) to achieve power shut-off (PSO) in case of an ASIC design. The proposed LP-HLS methodology reduces the design effort by enabling designers to infer low power information from the system-level description of a design rather than at the RTL. This methodology required a SystemC description of a generic power management module to describe the design context of a HW module also modeled in SystemC, along with the development of a tool to automatically produce the CPF file to accomplish PSO. Several test cases were considered to validate the proposed methodology and the results demonstrated its ability to correctly extract the low power information and apply it to achieve power optimization in the backend flow

    Applications of reprogrammability in algorithm acceleration

    Get PDF
    This doctoral thesis consists of an introductory part and eight appended publications, which deal with hardware-based reprogrammability in algorithm acceleration with a specific emphasis on the possibilities offered by modern large-scale Field Programmable Gate Arrays (FPGAs) in computationally demanding applications. The historical evolution of both the theoretical and technological paths culminating in the introduction of reprogrammable logic devices is first outlined. This is followed by defining the commonly used terms in the thesis. The reprogrammable logic market is surveyed, and the architectural structures and the technological reasonings behind them are described in detail. As reprogrammable logic lies between Application Specific Integrated Circuits (ASICs) and general-purpose microprocessors in the implementation spectrum of electronics systems, special attention has been paid to differentiate these three implementation approaches. This has been done to emphasize, that reprogrammable logic offers much more than just a low-volume replacement for ASICs. Design systems for reprogrammable logic are investigated, as the learning curve associated with them is the main hurdle for software-oriented designers for using reprogrammable logic devices. The theoretically important topic of partial reprogrammability is described in detail, but it is concluded, that the practical problems in designing viable development platforms for partially reprogrammable systems will hinder its wide-spread adoption. The main technical, design-oriented, and economic applicability factors of reprogrammable logic are laid out. The main advantages of reprogrammable logic are their suitability for fine-grained bit-level parallelizable computing with a short time-to-market and low upfront costs. It is also concluded, that the main opportunities for reprogrammable logic lie in the potential of high-level design systems, and the ever-growing ASIC design gap. On the other hand, most power-conscious mass-market portable products do not seem to offer major new market potential for reprogrammable logic. The appended publications are examined and compared to contemporaneous research at other research institutions. The conclusion is that for relatively wide classes of well-defined computation problems, reprogrammable logic offers a more efficient solution than a software-centered approach, with a much shorter production cycle than is the case with ASICs.reviewe

    Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles

    Get PDF
    RÉSUMÉ De nos jours, les industriels privilĂ©gient les architectures flexibles afin de rĂ©duire le temps et les coĂ»ts de conception d’un systĂšme. Les processeurs Ă  usage spĂ©cifique (ASIP) fournissent beaucoup de flexibilitĂ©, tout en atteignant des performances Ă©levĂ©es. Une tendance qui a de plus en plus de succĂšs dans le processus de conception d’un systĂšme sur puce consiste Ă  spĂ©cifier le comportement du systĂšme en langage Ă©voluĂ© tel que le C, SystemC, etc. La spĂ©cification est ensuite utilisĂ©e durant le partitionement pour dĂ©terminer les composantes logicielles et matĂ©rielles du systĂšme. Avec la maturitĂ© des gĂ©nĂ©rateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boĂźtes Ă  outils un nouveau type d’architecture, Ă  savoir les ASIP, en sachant que ces derniers sont conçus Ă  partir d’une spĂ©cification dĂ©crite en langage Ă©voluĂ©. D’un autre cĂŽtĂ©, dans le monde matĂ©riel, et cela depuis trĂšs longtemps, les chercheurs ont vu l’avantage de baser le processus de conception sur un langage Ă©voluĂ©. Cette recherche a abouti Ă  l’avĂ©nement de gĂ©nĂ©rateurs automatiques de matĂ©riel sur le marchĂ© qui sont des outils d’aide Ă  la conception comme CapatultC, Forte’s Cynthetizer, etc. Ainsi, avec tous ces outils basĂ©s sur le langage C, les concepteurs ont un choix de types de design Ă©largi mais, d’un autre cĂŽtĂ©, les options de designs possibles explosent, ce qui peut allonger au lieu de rĂ©duire le temps de conception. C’est dans ce cadre que notre thĂšse doctorale s’inscrit, puisqu’elle prĂ©sente des mĂ©thodologies d’exploration architecturale de design Ă  usage spĂ©cifique pour l’accĂ©lĂ©ration de boucles afin de rĂ©duire le temps de conception, entre autres. Cette thĂšse a dĂ©butĂ© par l’exploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates Ă  l’accĂ©lĂ©ration, si elles comportent de bonnes possibilitĂ©s de parallĂ©lisme et si ces derniĂšres sont bien exploitĂ©es. Le matĂ©riel est trĂšs efficace Ă  profiter des possibilitĂ©s de parallĂ©lisme au niveau instruction, donc, une mĂ©thode de conception a Ă©tĂ© proposĂ©e. Cette derniĂšre extrait le parallĂ©lisme d’une boucle afin d’exĂ©cuter plus d’opĂ©rations concurrentes dans des instructions spĂ©cialisĂ©es. Notre mĂ©thode se base aussi sur l’optimisation des donnĂ©es dans l’architecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forte’s Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances. Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation

    Optimisation des mémoires dans le flot de conception des systÚmes multiprocesseurs sur puces pour des applications de type multimédia

    Get PDF
    RÉSUMÉ Les systĂšmes multiprocesseurs sur puce (MPSoC) constituent l'un des principaux moteurs de la rĂ©volution industrielle des semi-conducteurs. Les MPSoCs jouissent d’une popularitĂ© grandissante dans le domaine des systĂšmes embarquĂ©s. Leur grande capacitĂ© de parallĂ©lisation Ă  un trĂšs haut niveau d'intĂ©gration, en font de bons candidats pour les systĂšmes et les applications telles que les applications multimĂ©dia. La consommation d’énergie, la capacitĂ© de calcul et l’espace de conception sont les Ă©lĂ©ments dont dĂ©pendent les performances de ce type d’applications. La mĂ©moire est le facteur clĂ© permettant d’amĂ©liorer de façon substantielle leurs performances. Avec l’arrivĂ©e des applications multimĂ©dias embarquĂ©es dans l’industrie, le problĂšme des gains de performances est vital. La masse de donnĂ©es traitĂ©es par ces applications requiert une grande capacitĂ© de calcul et de mĂ©moire. DerniĂšrement, de nouveaux modĂšles de programmation ont fait leur apparition. Ces modĂšles offrent une programmation de plus haut niveau pour rĂ©pondre aux besoins croissants des MPSoCs, d’oĂč la nĂ©cessitĂ© de nouvelles approches d'optimisation et de placement pour les systĂšmes embarquĂ©s et leurs modĂšles de programmation. La conception niveau systĂšme des architectures MPSoCs pour les applications de type multimĂ©dia constitue un vĂ©ritable dĂ©fi technique. L’objectif gĂ©nĂ©ral de cette thĂšse est de relever ce dĂ©fi en trouvant des solutions. Plus spĂ©cifiquement, cette thĂšse se propose d’introduire le concept d’optimisation mĂ©moire dans le flot de conception niveau systĂšme et d’observer leur impact sur diffĂ©rents modĂšles de programmation utilisĂ©s lors de la conception de MPSoCs. Il s’agit, autrement dit, de rĂ©aliser l’unification du domaine de la compilation avec celui de la conception niveau systĂšme pour une meilleure conception globale. La contribution de cette thĂšse est de proposer de nouvelles approches pour les techniques d'optimisation mĂ©moire pour la conception MPSoCs avec diffĂ©rents modĂšles de programmation. Nos travaux de recherche concernent l'intĂ©gration des techniques d’optimisation mĂ©moire dans le flot de conception de MPSoCs pour diffĂ©rents types de modĂšle de programmation. Ces travaux ont Ă©tĂ© exĂ©cutĂ©s en collaboration avec STMicroelectronics.----------ABSTRACT Multiprocessor systems-on-chip (MPSoC) are defined as one of the main drivers of the industrial semiconductors revolution. MPSoCs are gaining popularity in the field of embedded systems. Pursuant to their great ability to parallelize at a very high integration level, they are good candidates for systems and applications such as multimedia. Memory is becoming a key player for significant improvements in these applications (i.e. power, performance and area). With the emergence of more embedded multimedia applications in the industry, this issue becomes increasingly vital. The large amount of data manipulated by these applications requires high-capacity calculation and memory. Lately, new programming models have been introduced. These programming models offer a higher programming level to answer the increasing needs of MPSoCs. This leads to the need of new optimization and mapping approaches suitable for embedded systems and their programming models. The overall objective of this research is to find solutions to the challenges of system level design of applications such as multimedia. This entails the development of new approaches and new optimization techniques. The specific objective of this research is to introduce the concept of memory optimization in the system level conception flow and study its impact on different programming models used for MPSoCs’ design. In other words, it is the unification of the compilation and system level design domains. The contribution of this research is to propose new approaches for memory optimization techniques for MPSoCs’ design in different programming models. This thesis relates to the integration of memory optimization to varying programming model types in the MPSoCs conception flow. Our research was done in collaboration with STMicroelectronics
    • 

    corecore