751 research outputs found

    Power Modeling and Optimization for GPGPUs

    Get PDF
    Modern graphics processing units (GPUs) supports tens of thousands of parallel threads and delivers remarkably high computing throughput. General-Purpose computing on GPUs (GPGPUs) is becoming the attractive platform for general-purpose applications that request high computational performance such as scientific computing, financial applications, medical data processing, and so on. However, GPGPUs is facing severe power challenge due to the increasing number of cores placed on a single chip with decreasing feature size. In order to explore the power optimization techniques in GPGPUs, I first build a power model for GPGPUs, which is able to estimate both dynamic and leakage power of major microarchitecture structures in GPGPUs. I then target on the power-hungry structures (e.g. register file) to explore the energy-efficient GPGPUs. In order to hide the long latency operations, GPGPUs employs the fine-grained multi-threading among numerous active threads, leading to the sizeable register files with massive power consumption. The conventional method to reduce dynamic power consumption is the supply voltage scaling. And the inter-bank tunneling FETs (TFETs) is the promising candidate compared to CMOS for low voltage operations regarding to both leakage and performance. However, always executing at the low voltage will result in significant performance degradation. In this study, I propose the hybrid CMOS-TFET based register file and allocate TFET-based registers to threads whose execution progress can be delayed to some degree to avoid the memory contentions with other threads to reduce both dynamic and leakage power, and the CMOS-based registers are still used for threads requiring normal execution speed. My experimental results show that the proposed technique achieves 30% energy (including both dynamic and leakage) reduction in register files with negligible performance degradation compared to the baseline case equipped with naive power optimization technique

    Efficient register renaming and recovery for high-performance processors

    Full text link
    © © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-ofthe- art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM CAM scheme does not exceed the area required by conventional renaming mechanisms.This work was supported in part by the Spanish MINECO under Grant TIN2012-38341-C04-01.Petit Martí, SV.; Ubal Tena, R.; Sahuquillo Borrás, J.; López Rodríguez, PJ. (2014). Efficient register renaming and recovery for high-performance processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 22(7):1506-1514. https://doi.org/10.1109/TVLSI.2013.2270001S1506151422

    Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques

    Get PDF
    Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version

    A novel architecture for large windows processors

    Get PDF
    Several processor architectures with large instruction windows have been proposed. They improve performance by maintaining hundreds of instructions in flight to increase the level of instruction parallelism (ILP). Such architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of the processor resources. Check-pointing, however, leads to an imprecise state recovery on mispredicted branches and exceptions and frequent re-execution of current-path instructions during the state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This technical report proposes a new processor architecture that does not use either a traditional ROB or check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. Its novel register management architecture allows implementation of large register files with simpler and more scalable, register renaming and commit. It is also key to the precise recovery mechanism.Postprint (published version

    A distributed processor state management architecture for large-window processors

    Get PDF
    Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version

    Adaptable register file organization for vector processors

    Get PDF
    Today there are two main vector processors design trends. On the one hand, we have vector processors designed for long vectors lengths such as the SX-Aurora TSUBASA which implements vector lengths of 256 elements (16384-bits). On the other hand, we have vector processors designed for short vectors such as the Fujitsu A64FX that implements vector lengths of 8 elements (512-bit) ARM SVE. However, short vector designs are the most widely adopted in modern chips. This is because, to achieve high-performance with a very high-efficiency, applications executed on long vector designs must feature abundant DLP, then limiting the range of applications. On the contrary, short vector designs are compatible with a larger range of applications. In fact, in the beginnings, long vector length implementations were focused on the HPC market, while short vector length implementations were conceived to improve performance in multimedia tasks. However, those short vector length extensions have evolved to better fit the needs of modern applications. In that sense, we believe that this compatibility with a large range of applications featuring high, medium and low DLP is one of the main reasons behind the trend of building parallel machines with short vectors. Short vector designs are area efficient and are "compatible" with applications having long vectors; however, these short vector architectures are not as efficient as longer vector designs when executing high DLP code. In this thesis, we propose a novel vector architecture that combines the area and resource efficiency characterizing short vector processors with the ability to handle large DLP applications, as allowed in long vector architectures. In this context, we present AVA, an Adaptable Vector Architecture designed for short vectors (MVL = 16 elements), capable of reconfiguring the MVL when executing applications with abundant DLP, achieving performance comparable to designs for long vectors. The design is based on three complementary concepts. First, a two-stage renaming unit based on a new type of registers termed as Virtual Vector Registers (VVRs), which are an intermediate mapping between the conventional logical and the physical and memory registers. In the first stage, logical registers are renamed to VVRs, while in the second stage, VVRs are renamed to physical registers. Second, a two-level VRF, that supports 64 VVRs whose MVL can be configured from 16 to 128 elements. The first level corresponds to the VVRs mapped in the physical registers held in the 8KB Physical Vector Register File (P-VRF), while the second level represents the VVRs mapped in memory registers held in the Memory Vector Register File (M-VRF). While the baseline configuration (MVL=16 elements) holds all the VVRs in the P-VRF, larger MVL configurations hold a subset of the total VVRs in the P-VRF, and map the remaining part in the M-VRF. Third, we propose a novel two-stage vector issue unit. In the first stage, the second level of mapping between the VVRs and physical registers is performed, while issuing to execute is managed in the second stage. This thesis also presents a set of tools for designing and evaluating vector architectures. First, a parameterizable vector architecture model implemented on the gem5 simulator to evaluate novel ideas on vector architectures. Second, a Vector Architecture model implemented on the McPAT framework to evaluate power and area metrics. Finally, the RiVEC benchmark suite, a collection of ten vectorized applications from different domains focusing on benchmarking vector microarchitectures.Hoy en día existen dos tendencias principales en el diseño de procesadores vectoriales. Por un lado, tenemos procesadores vectoriales basados en vectores largos como el SX-Aurora TSUBASA que implementa vectores con 256 elementos (16384-bits) de longitud. Por otro lado, tenemos procesadores vectoriales basados en vectores cortos como el Fujitsu A64FX que implementa vectores de 8 elementos (512-bits) de longitud ARM SVE. Sin embargo, los diseños de vectores cortos son los más adoptados en los chips modernos. Esto es porque, para lograr alto rendimiento con muy alta eficiencia, las aplicaciones ejecutadas en diseños de vectores largos deben presentar abundante paralelismo a nivel de datos (DLP), lo que limita el rango de aplicaciones. Por el contrario, los diseños de vectores cortos son compatibles con un rango más amplio de aplicaciones. En sus orígenes, implementaciones basadas en vectores largos estaban enfocadas al HPC, mientras que las implementaciones basadas en vectores cortos estaban enfocadas en tareas de multimedia. Sin embargo, esas extensiones basadas en vectores cortos han evolucionado para adaptarse mejor a las necesidades de las aplicaciones modernas. Creemos que esta compatibilidad con un mayor rango de aplicaciones es una de las principales razones de construir máquinas paralelas basadas en vectores cortos. Los diseños de vectores cortos son eficientes en área y son compatibles con aplicaciones que soportan vectores largos; sin embargo, estos diseños de vectores cortos no son tan eficientes como los diseños de vectores largos cuando se ejecuta un código con alto DLP. En esta tesis, proponemos una novedosa arquitectura vectorial que combina la eficiencia de área y recursos que caracteriza a los procesadores vectoriales basados en vectores cortos, con la capacidad de mejorar en rendimiento cuando se presentan aplicaciones con alto DLP, como lo permiten las arquitecturas vectoriales basadas en vectores largos. En este contexto, presentamos AVA, una Arquitectura Vectorial Adaptable basada en vectores cortos (MVL = 16 elementos), capaz de reconfigurar el MVL al ejecutar aplicaciones con abundante DLP, logrando un rendimiento comparable a diseños basados en vectores largos. El diseño se basa en tres conceptos. Primero, una unidad de renombrado de dos etapas basada en un nuevo tipo de registros denominados registros vectoriales virtuales (VVR), que son un mapeo intermedio entre los registros lógicos y físicos y de memoria. En la primera etapa, los registros lógicos se renombran a VVR, mientras que, en la segunda etapa, los VVR se renombran a registros físicos. En segundo lugar, un VRF de dos niveles, que admite 64 VVR cuyo MVL se puede configurar de 16 a 128 elementos. El primer nivel corresponde a los VVR mapeados en los registros físicos contenidos en el banco de registros vectoriales físico (P-VRF) de 8 KB, mientras que el segundo nivel representa los VVR mapeados en los registros de memoria contenidos en el banco de registros vectoriales de memoria (M-VRF). Mientras que la configuración de referencia (MVL=16 elementos) contiene todos los VVR en el P-VRF, las configuraciones de MVL más largos contienen un subconjunto del total de VVR en el P-VRF y mapean la parte restante en el M-VRF. En tercer lugar, proponemos una novedosa unidad de colas de emisión de dos etapas. En la primera etapa se realiza el segundo nivel de mapeo entre los VVR y los registros físicos, mientras que en la segunda etapa se gestiona la emisión de instrucciones a ejecutar. Esta tesis también presenta un conjunto de herramientas para diseñar y evaluar arquitecturas vectoriales. Primero, un modelo de arquitectura vectorial parametrizable implementado en el simulador gem5 para evaluar novedosas ideas. Segundo, un modelo de arquitectura vectorial implementado en McPAT para evaluar las métricas de potencia y área. Finalmente, presentamos RiVEC, una colección de diez aplicaciones vectorizadas enfocadas en evaluar arquitecturas vectorialesPostprint (published version

    Resource Banking An Energy-efficient, Run-time Adaptive Processor Design Technique

    Get PDF
    From the earliest and simplest scalar computation engines to modern superscalar out-oforder processors, the evolution of computational machinery during the past century has largely been driven by a single goal: performance. In today’s world of cheap, billion-plus transistor count processors and with an exploding market in mobile computing, a design landscape has emerged where energy efficiency, arguably more than any other single metric, determines the viability of a processor for a given application. The historical emphasis on performance has left modern processors bloated and over provisioned for everyday tasks in the hope that during computationally intensive periods some performance improvement will be observed. This work explores an energy-efficient processor design technique that ensures even a highly over provisioned out-of-order processor has only as many of its computational resources active as it requires for efficient computation at any given time. Specifically, this paper examines the feasibility of a dynamically banked register file and reorder buffer with variable banking policies that enable unused rename registers or reorder buffer entries to be voltage gated (turned off) during execution to save power. The impact of bank placement, turn-off and turn-on policies as well as rail stabilization latencies for this approach are explored for high-performance desktop and server designs as well as low-power mobile processor
    corecore