2,237 research outputs found

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    O pior caso estático de otimização do tempo de execução utilizando dpso para arquitetura ASIP

    Get PDF
    Introduction: The application of specific instructions significantly improves energy, performance, and code size of configurable processors. The design of these instructions is performed by the conversion of patterns related to application-specific operations into effective complex instructions. This research was presented at the icitkm Conference, University of Delhi, India in 2017. Methods: Static analysis was a prominent research method during late the 1980’s. However, end-to-end measurements consist of a standard approach in industrial settings. Both static analysis tools perform at a high-level in order to determine the program structure, which works on source code, or is executable in a disassembled binary. It is possible to work at a low-level if the real hardware timing information for the executable task has the desired features. Results: We experimented, tested and evaluated using a H.264 encoder application that uses nine cis, covering most of the computation intensive kernels. Multimedia applications are frequently subject to hard real time constraints in the field of computer vision. The H.264 encoder consists of complicated control flow with more number of decisions and nested loops. The parameters evaluated were different numbers of A partitions (300 slices on a Xilinx Virtex 7each), reconfiguration bandwidths, as well as relations of cpu frequency and fabric frequency fCPU/ffabric. ffabric remains constant at 100MHz, and we selected a multiplicity of its values for fCPU that resemble realistic units. Note that while we anticipate the wcet in seconds (wcetcycles/ f CPU) to be lower (better) with higher fCPU, the wcet cycles increase (at a constant ffabric) because hardware cis perform less computations on the reconfigurable fabric within one cpu cycle.    Introducción: la aplicación de instrucciones específicas mejora significativamente la energía, el rendimiento y el tamaño del código de los procesadores configurables. El diseño de estas instrucciones se realiza mediante conversión de patrones relacionados con operaciones específicas de la aplicación con instrucciones complejas y efectivas. Esta investigación se presentó en la Conferencia icitkm, Universidad de Delhi, India en 2017. Métodos: el análisis estático fue un método de investigación prominente durante la década de 1980; sin embargo, las mediciones de extremo a extremo son un enfoque convencional en los entornos industriales. Ambas herramientas de análisis estático se desempeñan a un alto nivel para determinar la estructura del programa que funciona en el código fuente, o que se ejecuta en un binario desmontado. Es posible trabajar a bajo nivel si la información de tiempo de hardware real para la tarea ejecutable presenta las características deseadas.  Introdução: a aplicação de instruções específicas melhora significativamente a energia, o desempenho e o tamanho do código dos processadores configuráveis. O desenho dessas instruções é realizado mediante a conversão de padrões relacionados com operações específicas da aplicação com instruções complexas e efetivas. Esta pesquisa foi apresentada na Conferência icitkm, Universidade de Délhi, Índia em 2017.Métodos: a análise estática foi um método de pesquisa proeminente durante a década de 1980; contudo, as medições de extremo a extremo são uma abordagem convencional nos contextos industriais. Ambas as ferramentas de análise estática se desempenham a um alto nível para determinar a estrutura do programa que funciona no código fonte ou que se executa num binário desmontado. É possível trabalhar a baixo nível se a informação de tempo de hardware real para a tarefa executável apresentar as características desejadas.Resultados: experimentamos, testamos e avaliamos com uma aplicação de codificação H.264 que utiliza nove elementos de configuração e cobre a maioria dos núcleos de cálculo intensivo. As aplicações multimídias estão com frequência sujeitas a duras restrições em tempo real no campo da visão por computador. O codificador H.264 consiste num complicado fluxo de controle com mais número de decisões e circuitos aninhados. Os parâmetros avaliados foram de diferentes números de particiones A (300 cortes num Xilinx Virtex 7 cada um) e largos de banda de reconfiguração, bem como de relações de frequência de cpu e frequência de fabric fcpu/ffabric. ffabric permanece constante a 100MHz. Selecionamos vários de seus valores para fcpu que são semelhantes a unidades realistas. É importante considerar que, ainda quando antecipamos o wcet em segundos (ciclos wcet/ fcpu), para que fossem inferiores (melhores) com fcpu mais alta, os ciclos wcet aumentam (num tecido constante f) porque os ci de hardware realizam menos cálculos no tecido reconfigurável dentro de uma cpu de ciclo.Conclusões: o método é similar à hibridação de árvores e métodos baseados en rotas, os quais são menos precisos, e ao método I pet global, que é mais preciso. A otimização é avaliada com o algoritmo de otimização por enxame de partículas discretas (dpso) para wcet. Para várias aplicações do mundo real que envolvem processadores integrados, a técnica proposta desenvolve conjuntos de instruções melhoradas em comparação com os conjuntos de instruções nativas.Originalidade: para a estimativa de wcet, deve-se considerar a análise de fluxo, a análise de baixo nível e as fases de cálculo do programa. A fase de análise de fluxo ou alto nível de análise ajuda a extrair o comportamento dinâmico do programa que proporciona informação sobre as funções invocadas, sobre o número de iterações de circuito, as dependências entre sentenças if, etc. Isso se deve a que a análise desconhece a rota de execução correspondente ao tempo de execução mais longo.Limitações: essa rota é executada dentro de uma iteração do núcleo que depende da natureza de mb, seja i-mb, seja p-mb, determinada pelo núcleo de estimativa de movimento, quer dizer que sua entrada depende das rotas i-mb e p-mb, que também contêm elementos de configuração separados que conduzem à instabilidade da rota do pior dos casos; em outras palavras, adicionar mais partições à rota atual do pior dos casos pode fazer com que a outra rota se converta no pior dos casos. A tubulação se detém pela demora de reconfiguração e continua ao ingressar no núcleo assim que finaliza o processo de reconfiguraçã

    Automatic design of domain-specific instructions for low-power processors

    Get PDF
    This paper explores hardware specialization of low­ power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed­ up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas

    Rapid evaluation of custom instruction selection approaches with FPGA estimation

    Get PDF

    Achieving a better balance between productivity and performance on FPGAs through Heterogeneous Extensible Multiprocessor Systems

    Get PDF
    Field Programmable Gate Arrays (FPGAs) were first introduced circa 1980, and they held the promise of delivering performance levels associated with customized circuits, but with productivity levels more closely associated with software development. Achieving both performance and productivity objectives has been a long standing challenge problem for the reconfigurable computing community and remains unsolved today. On one hand, Vendor supplied design flows have tended towards achieving the high levels of performance through gate level customization, but at the cost of very low productivity. On the other hand, FPGA densities are following Moore\u27s law and and can now support complete multiprocessor system architectures. Thus FPGAs can be turned into an architecture with programmable processors which brings productivity but sacrifices the peak performance advantages of custom circuits. In this thesis we explore how the two use cases can be combined to achieve the best from both. The flexibility of the FPGAs to host a heterogeneous multiprocessor system with different types of programmable processors and custom accelerators allows the software developers to design a platform that matches the unique performance needs of their application. However, currently no automated approaches are publicly available to create such heterogeneous architectures as well as the software support for these platforms. Creating base architectures, configuring multiple tool chains, and repetitive engineering design efforts can and should be automated. This thesis introduces Heterogeneous Extensible Multiprocessor System (HEMPS) template approach which allows an FPGA to be programmed with productivity levels close to those associated with parallel processing, and with performance levels close to those associated with customized circuits. The work in this thesis introduces an ArchGen script to automate the generation of HEMPS systems as well as a library of portable and self tuning polymorphic functions. These tools will abstract away the HW/SW co-design details and provide a transparent programming language to capture different levels of parallelisms, without sacrificing productivity or portability

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numérique, un nombre fractionnaire peut être représenté à l’aide de l’arithmétique en virgule fixe ou en virgule flottante. L’arithmétique en virgule fixe est largement considérée préférable à celle en virgule flottante pour les architectures matérielles dédiées en raison de sa plus faible complexité d’implémentation. Dans la mise en œuvre du matériel, la largeur de mot attribuée à différents signaux a un impact significatif sur des métriques telles que les ressources (transistors), la vitesse et la consommation d'énergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise à optimiser les chemins de données par l'ajustement des longueurs de mots attribuées aux signaux. Un nombre en virgule fixe est composé d’une partie entière et d’une partie fractionnaire. Il y a une limite inférieure au nombre de bits alloués à la partie entière, de façon à prévenir les débordements pour chaque signal. Cette limite dépend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant à lui, détermine la taille de l'erreur de précision finie qui est introduite dans les calculs. Il existe un compromis entre la précision et l'efficacité du matériel dans la sélection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procédures importantes: la modélisation de l'erreur de quantification et la sélection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont porté sur des circuits spécialisés comme plate-forme cible. Dans cette thèse, nous introduisons de nouvelles méthodologies, techniques et algorithmes pour améliorer l’implémentation de calculs en virgule fixe dans des circuits et processeurs spécialisés. La thèse propose une approche améliorée de modélisation d’erreur, basée sur l'arithmétique affine, qui aborde certains problèmes des méthodes existantes et améliore leur précision. La thèse introduit également une technique d'accélération et deux algorithmes semi-analytiques pour la sélection de la largeur de la partie fractionnaire pour la conception de circuits spécialisés. Alors que le premier algorithme suit une stratégie de recherche progressive, le second utilise une méthode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexité de calcul et le coût résultant. Le premier algorithme a une complexité polynomiale et obtient des résultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexité exponentielle, mais il donne des résultats quasi-optimaux par rapport à une recherche exhaustive. Cette thèse propose également une méthode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unités fonctionnelles sont les principaux objectifs ciblés par cette optimisation. Un nouvel algorithme d'optimisation a été développé pour trouver la meilleure combinaison de longueurs de mots et d'autres paramètres configurables dans la méthode proposée. Les exigences de précision, définies comme l'erreur pire cas, doivent être respectées par toute solution. Pour faciliter l'évaluation et la mise en œuvre des solutions retenues, un nouvel environnement de conception de processeur a également été développé. Cet environnement, qui est appelé PolyCuSP, supporte une large gamme de paramètres, y compris ceux qui sont nécessaires pour évaluer les solutions proposées par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacité de modéliser différents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari

    Instruction-set customization for multi-tasking embedded systems

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore