13 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Static resource models for code generation of embedded processors

    Get PDF
    xii+129hlm.;24c

    Balancing Design Options with Sherpa

    Get PDF
    Application specific processors offer the potential of rapidly designed logic specifically constructed to meet the performance and area demands of the task at hand. Recently, there have been several major projects that attempt to automate the process of transforming a predetermined processor configuration into a low level description for fabrication. These projects either leave the specification of the processor to the designer, which can be a significant engineering burden, or handle it in a fully automated fashion, which completely removes the designer from the loop. In this paper we introduce a technique for guiding the design and optimization of application specific processors. The goal of the Sherpa design framework is to automate certain design tasks and provide early feedback to help the designer navigate their way through the architecture design space. Our approach is to decompose the overall problem of choosing an optimal architecture into a set of sub-problems that are, to the first order, independent. For each subproblem, we create a model that relates performance to area. From this, we build a constraint system that can be solved using integer-linear programming techniques, and arrive at an ideal parameter selection for all architectural components. Our approach only takes a few minutes to explore the design space allowing the designer or compiler to see the potential benefits of optimizations rapidly. We show that the expected performance using our model correlates strongly to detailed pipeline simulations, and present results showing design tradeoffs for several different benchmarks

    Low power architectures for streaming applications

    Get PDF

    Escalonamento de instruções em arquiteturas VLIW particionadas explorando Bypassing de operandos

    Get PDF
    Orientador : Guido Costa Souza de Araujo, Paulo Cesar CentoducatteDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A incansável busca por máquinas mais velozes, aliada aos enormes avanços tecnológicos na concepção de circuitos integrados, retiraram as arquiteturas Very Long Instruction W ord (VLIW) de um estado amórfico para a realidade. Embora tenham surgido como CIs recentemente [1], as máquinas VLIW foram idealizadas há algumas décadas atrás [13, 16, 22, 23]. Os processadores que definem este modelo de processamento não mais obedecem regras clássicas de execução: instruções de um dos possíveis fluxos de controle de um comando de desvio condicional são executadas mesmo antes do término da avaliação da condição, a qual determinará se a transferência de controle deverá ocorrer ou não; executam simultaneamente inúmeras instruções, de diferentes tipos, oriundas do mesmo programa; computam programas que foram compilados de uma forma revolucionária: todo o programa é analisado em busca de operações paralelizáveis, como se fosse um único (macro) bloco. Numa tentativa de contribuição a esta linha de pesquisa, este trabalho visa a detecção e exploração do paralelismo 'escondido' em programas originalmente sequenciais. Esta busca gera resultados que são analisados e quantificados com o intuito de se encontrar uma arquitetura-alvo adequada para uma aplicação específica. Esta metodologia encontra-se inserida no contexto de uma área denominada Embedded Systems, a qual se preocupa em otimizar ao máximo a execução de uma classe restrita de aplicações ou até mesmo uma única aplicação-chave de um sistema dedicado. O modelo de arquitetura considerado neste trabalho é denominado VLIW particionado (do inglês partitioned VLIW). Este modelo difere da máquina VLIW ideal pelo fato de não possuir um único banco de registradores centralizado, mas sim vários bancos de registradores que se comunicam através de barramentos especiais. Com este modelo de arquitetura em mãos, o trabalho desenvolvido nesta dissertação trata da investigação de problemas relacionados com o mapeamento de uma aplicação específica a uma máquina VLIW dedicada. Em um macro-cenário, este trabalho tenta responder a seguinte questão: "Qual é a máquina VLIW adequada para uma dada aplicação ?,'. Ou ainda, "Quantos bancos de registradores e quantas unidades funcionais o processador para esta aplicação deveria ter?"Abstract: The untiring search for faster machines, alIied to the great technological advances in the field of integrated circuits conception, brought out the Very Long Instruction Word architectures from an amorphous status to reality. Although they have appeared recently as real chips [1], the VLIW machines were idealized some decades ago [13, 16, 22, 23]. The microprocessors that define this processing model no longer obey classical rules of execution: instructions coming from one of the possible control flows resulted of a branch instruction are executed even before the finish of the evaluation condition. This evaluation condition will determine if the control transfer should occur or noto Also, these architectures execute simultaneously many instructions, of different kinds, issued from the same programo Moreover, these processors compute programs that were compiled through a revolutionary way: alI the program is analized to search for paralelizable operations. As an attempt to contribute to this research field, this work aim the development of a methodology to detect and exploit the paralelism "hided" in sequential-written programs. The results generated by this search are analized and quantified in order to find a targetarchitecture for a specific application. This work is inserted in the context of an area calIed Embedded Systems. This research field worry about the maximum optimization of an application class or even only one key-application of a embedded system. The architecture model considered in this work is denoted as "Partitioned VLIW Architecture". This model is slightly different of the ideal VLIW architecture model. In the ideal model, there must be only one centralized register file, in order to guarantee the maximum Instruction Levei ParalIelism (ILP). AlI the functional units share the same register file. On the other hand, the architecture model being considered here presents many distributed register files, which have an special bus to communicate data among them. With this architecture model in mind, the work developed in this thesis investigates some of the problems related to mapping one specific application to an embedded VLIW architecture. Roughly speaking, this work tries to answer the following question: "What is the ideal VLIW architecture for a given application'1" or "How many register files and how many functional units the processor for that application should have '1"MestradoMestre em Ciência da Computaçã

    System-level power optimization:techniques and tools

    Get PDF
    This tutorial surveys design methods for energy-efficient system-level design. We consider electronic sytems consisting of a hardware platform and software layers. We consider the three major constituents of hardware that consume energy, namely computation, communication, and storage units, and we review methods of reducing their energy consumption. We also study models for analyzing the energy cost of software, and methods for energy-efficient software design and compilation. This survery is organized around three main phases of a system design: conceptualization and modeling design and implementation, and runtime management. For each phase, we review recent techniques for energy-efficient design of both hardware and software

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Energy-aware synthesis for networks on chip architectures

    Full text link
    The Network on Chip (NoC) paradigm was introduced as a scalable communication infrastructure for future System-on-Chip applications. Designing application specific customized communication architectures is critical for obtaining low power, high performance solutions. Two significant design automation problems are the creation of an optimized configuration, given application requirement the implementation of this on-chip network. Automating the design of on-chip networks requires models for estimating area and energy, algorithms to effectively explore the design space and network component libraries and tools to generate the hardware description. Chip architects are faced with managing a wide range of customization options for individual components, routers and topology. As energy is of paramount importance, the effectiveness of any custom NoC generation approach lies in the availability of good energy models to effectively explore the design space. This thesis describes a complete NoC synthesis flow, called NoCGEN, for creating energy-efficient custom NoC architectures. Three major automation problems are addressed: custom topology generation, energy modeling and generation. An iterative algorithm is proposed to generate application specific point-to-point and packet-switched networks. The algorithm explores the design space for efficient topologies using characterized models and a system-level floorplanner for evaluating placement and wire-energy. Prior to our contribution, building an energy model required careful analysis of transistor or gate implementations. To alleviate the burden, an automated linear regression-based methodology is proposed to rapidly extract energy models for many router designs. The resulting models are cycle accurate with low-complexity and found to be within 10% of gate-level energy simulations, and execute several orders of magnitude faster than gate-level simulations. A hardware description of the custom topology is generated using a parameterizable library and custom HDL generator. Fully reusable and scalable network components (switches, crossbars, arbiters, routing algorithms) are described using a template approach and are used to compose arbitrary topologies. A methodology for building and composing routers and topologies using a template engine is described. The entire flow is implemented as several demonstrable extensible tools with powerful visualization functionality. Several experiments are performed to demonstrate the design space exploration capabilities and compare it against a competing min-cut topology generation algorithm
    corecore