221 research outputs found

    A survey of emerging architectural techniques for improving cache energy consumption

    Get PDF
    The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students

    CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS

    Get PDF
    Due to physical limitations and design difficulties, computer processor architecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers. However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware architecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field. Due to the architectural specialties with multi-core based computers, a Cross-Layer Customization framework is proposed in this work, which combines application specific information and system platform features, along with necessary operating system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimization goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations. Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific customizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant improvement on system performance and power efficiency

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    Performance and power optimizations in chip multiprocessors for throughput-aware computation

    Get PDF
    The so-called "power (or power density) wall" has caused core frequency (and single-thread performance) to slow down, giving rise to the era of multi-core/multi-thread processors. For example, the IBM POWER4 processor, released in 2001, incorporated two single-thread cores into the same chip. In 2010, IBM released the POWER7 processor with eight 4-thread cores in the same chip, for a total capacity of 32 execution contexts. The ever increasing number of cores and threads gives rise to new opportunities and challenges for software and hardware architects. At software level, applications can benefit from the abundant number of execution contexts to boost throughput. But this challenges programmers to create highly-parallel applications and operating systems capable of scheduling them correctly. At hardware level, the increasing core and thread count puts pressure on the memory interface, because memory bandwidth grows at a slower pace ---phenomenon known as the "bandwidth (or memory) wall". In addition to memory bandwidth issues, chip power consumption rises due to manufacturers' difficulty to lower operating voltages sufficiently every processor generation. This thesis presents innovations to improve bandwidth and power consumption in chip multiprocessors (CMPs) for throughput-aware computation: a bandwidth-optimized last-level cache (LLC), a bandwidth-optimized vector register file, and a power/performance-aware thread placement heuristic. In contrast to state-of-the-art LLC designs, our organization avoids data replication and, hence, does not require keeping data coherent. Instead, the address space is statically distributed all over the LLC (in a fine-grained interleaving fashion). The absence of data replication increases the cache effective capacity, which results in better hit rates and higher bandwidth compared to a coherent LLC. We use double buffering to hide the extra access latency due to the lack of data replication. The proposed vector register file is composed of thousands of registers and organized as an aggregation of banks. We leverage such organization to attach small special-function "local computation elements" (LCEs) to each bank. This approach ---referred to as the "processor-in-regfile" (PIR) strategy--- overcomes the limited number of register file ports. Because each LCE is a SIMD computation element and all of them can proceed concurrently, the PIR strategy constitutes a highly-parallel super-wide-SIMD device (ideal for throughput-aware computation). Finally, we present a heuristic to reduce chip power consumption by dynamically placing software (application) threads across hardware (physical) threads. The heuristic gathers chip-level power and performance information at runtime to infer characteristics of the applications being executed. For example, if an application's threads share data, the heuristic may decide to place them in fewer cores to favor inter-thread data sharing and communication. In such case, the number of active cores decreases, which is a good opportunity to switch off the unused cores to save power. It is increasingly harder to find bulletproof (micro-)architectural solutions for the bandwidth and power scalability limitations in CMPs. Consequently, we think that architects should attack those problems from different flanks simultaneously, with complementary innovations. This thesis contributes with a battery of solutions to alleviate those problems in the context of throughput-aware computation: 1) proposing a bandwidth-optimized LLC; 2) proposing a bandwidth-optimized register file organization; and 3) proposing a simple technique to improve power-performance efficiency.El excesivo consumo de potencia de los procesadores actuales ha desacelerado el incremento en la frecuencia operativa de los mismos para dar lugar a la era de los procesadores con múltiples núcleos y múltiples hilos de ejecución. Por ejemplo, el procesador POWER7 de IBM, lanzado al mercado en 2010, incorpora ocho núcleos en el mismo chip, con cuatro hilos de ejecución por núcleo. Esto da lugar a nuevas oportunidades y desafíos para los arquitectos de software y hardware. A nivel de software, las aplicaciones pueden beneficiarse del abundante número de núcleos e hilos de ejecución para aumentar el rendimiento. Pero esto obliga a los programadores a crear aplicaciones altamente paralelas y sistemas operativos capaces de planificar correctamente la ejecución de las mismas. A nivel de hardware, el creciente número de núcleos e hilos de ejecución ejerce presión sobre la interfaz de memoria, ya que el ancho de banda de memoria crece a un ritmo más lento. Además de los problemas de ancho de banda de memoria, el consumo de energía del chip se eleva debido a la dificultad de los fabricantes para reducir suficientemente los voltajes de operación entre generaciones de procesadores. Esta tesis presenta innovaciones para mejorar el ancho de banda y consumo de energía en procesadores multinúcleo en el ámbito de la computación orientada a rendimiento ("throughput-aware computation"): una memoria caché de último nivel ("last-level cache" o LLC) optimizada para ancho de banda, un banco de registros vectorial optimizado para ancho de banda, y una heurística para planificar la ejecución de aplicaciones paralelas orientada a mejorar la eficiencia del consumo de potencia y desempeño. En contraste con los diseños de LLC de última generación, nuestra organización evita la duplicación de datos y, por tanto, no requiere de técnicas de coherencia. El espacio de direcciones de memoria se distribuye estáticamente en la LLC con un entrelazado de grano fino. La ausencia de replicación de datos aumenta la capacidad efectiva de la memoria caché, lo que se traduce en mejores tasas de acierto y mayor ancho de banda en comparación con una LLC coherente. Utilizamos la técnica de "doble buffering" para ocultar la latencia adicional necesaria para acceder a datos remotos. El banco de registros vectorial propuesto se compone de miles de registros y se organiza como una agregación de bancos. Incorporamos a cada banco una pequeña unidad de cómputo de propósito especial ("local computation element" o LCE). Este enfoque ---que llamamos "computación en banco de registros"--- permite superar el número limitado de puertos en el banco de registros. Debido a que cada LCE es una unidad de cómputo con soporte SIMD ("single instruction, multiple data") y todas ellas pueden proceder de forma concurrente, la estrategia de "computación en banco de registros" constituye un dispositivo SIMD altamente paralelo. Por último, presentamos una heurística para planificar la ejecución de aplicaciones paralelas orientada a reducir el consumo de energía del chip, colocando dinámicamente los hilos de ejecución a nivel de software entre los hilos de ejecución a nivel de hardware. La heurística obtiene, en tiempo de ejecución, información de consumo de potencia y desempeño del chip para inferir las características de las aplicaciones. Por ejemplo, si los hilos de ejecución a nivel de software comparten datos significativamente, la heurística puede decidir colocarlos en un menor número de núcleos para favorecer el intercambio de datos entre ellos. En tal caso, los núcleos no utilizados se pueden apagar para ahorrar energía. Cada vez es más difícil encontrar soluciones de arquitectura "a prueba de balas" para resolver las limitaciones de escalabilidad de los procesadores actuales. En consecuencia, creemos que los arquitectos deben atacar dichos problemas desde diferentes flancos simultáneamente, con innovaciones complementarias

    Power-aware caches for GPGPUs

    Get PDF
    In this thesis, we propose two optimization techniques to reduce power consumption in L1 caches (data, texture and constant), shared memory and L2 cache. The first optimization technique targets static power. Evaluation of GPGPU applications shows that once a cache block is accessed by a thread, it takes several hundreds of clock cycles until the same block is accessed again. The long inter-access cycle can be used to put cache cells into drowsy mode and reduce static power. While drowsy cells reduce static power, they increase access time as voltage of a cache cell in drowsy mode should be raised before the block can be accessed. To mitigate performance impact of drowsy cells, we propose a novel technique called coarse grained drowsy mode. In coarse grained drowsy mode, we partition each cache into regions of consecutive cache blocks and wake up a region upon cache access. Due to temporal and spatial locality of cache accesses, this method dramatically reduces performance impact caused by drowsy cells. The second optimization technique relies on branch divergence in GPGPUs. The execution model in GPGPUs is Single Instruction Multiple Thread (SIMT) which means processing cores execute the same instruction with different data for GPGPU threads. The SIMT execution model may result in divergence of threads when a control instruction is executed. GPGPUs execute branch instructions in two phases. In the first phase, threads in the taken path are active and the rest are idle. In the second phase, threads in the not-taken path are executed and the rest are idle. Contemporary GPGPUs access all portions of cache blocks, although some threads are idle due to branch divergence. We propose accessing only portions of cache blocks corresponding to active threads. By disabling unnecessary sections of cache blocks, we are able to reduce dynamic power of caches. Our results show that on average, the two optimization techniques together reduce power of caches by up to 98% and 15% for static and dynamic power, respectively

    Proposition for a Sequential Accelerator in Future General-Purpose Manycore Processors and the Problem of Migration-Induced Cache Misses

    Get PDF
    International audienceAs the number of transistors on a chip doubles with every technology generation, the number of on-chip cores also increases rapidly, making possible in a foreseeable future to design processors featuring hundreds of general-purpose cores. However, though a large number of cores speeds up parallel code sections, Amdahl's law requires speeding up sequential sections too. We argue that it will become possible to dedicate a substantial fraction of the chip area and power budget to achieve high sequential performance. Current general-purpose processors contain a handful of cores designed to be continuously active and run in parallel. This leads to power and thermal constraints that limit the core's performance. We propose removing these constraints with a {\it sequential accelerator} ({\bf SACC}). A SACC consists of several cores {\it designed} for ultimate sequential performance. These cores cannot run continuously. A single core is active at any time, the rest of the cores are inactive and power-gated. We migrate the execution periodically to another core to spread heat generation uniformly over the whole SACC area, thus addressing the temperature issue. The SACC will be viable only if it yields significant sequential performance. Migration-induced cache misses may limit performance gains. We propose some solutions to mitigate this problem. We also investigate a migration method using thermal sensors, such that the migration interval depends on the ambient temperature and the migration penalty is negligible under normal thermal conditions

    A survey of system level power management schemes in the dark-silicon era for many-core architectures

    Get PDF
    Power consumption in Complementary Metal Oxide Semiconductor (CMOS) technology has escalated to a point that only a fractional part of many-core chips can be powered-on at a time. Fortunately, this fraction can be increased at the expense of performance through the dark-silicon solution. However, with many-core integration set to be heading towards its thousands, power consumption and temperature increases per time, meaning the number of active nodes must be reduced drastically. Therefore, optimized techniques are demanded for continuous advancement in technology. Existing efforts try to overcome this challenge by activating nodes from different parts of the chip at the expense of communication latency. Other efforts on the other hand employ run-time power management techniques to manage the power performance of the cores trading-off performance for power. We found out that, for a significant amount of power to saved and high temperature to be avoided, focus should be on reducing the power consumption of all the on-chip components. Especially, the memory hierarchy and the interconnect. Power consumption can be minimized by, reducing the size of high leakage power dissipating elements, turning-off idle resources and integrating power saving materials

    Proposition for a sequential accelerator in future general-purpose manycore processors

    Get PDF
    The number of transistors that can be put on a given silicon area doubles on every technology generation. Consequently, the number of on-chip cores increases quickly, making it possible to build general-purpose processors with hundreds of cores in a near future. However, though having a large number of cores is beneficial for speeding up parallel code sections, it is also important to speed up sequential execution. We argue that it will be possible and desirable to dedicate a large fraction of the chip area and power to high sequential performance. Current processor design styles are restrained by the implicit constraint that a processor core should be able to run continuously; therefore power hungry techniques that would allow very high clock frequencies are not used. The "sequential accelerator" we propose removes the constraint of continuous functioning. The sequential accelerator consists of several cores designed for ultimate instantaneous performance. Those cores are large and power hungry, they cannot run continuously (thermal constraint) and cannot be active simultaneously (power constraint) . A single core is active at any time, inactive cores are power-gated. The execution is migrated periodically to a new core so as to spread the heat generation uniformly over the whole accelerator area, which solves the temperature issue. The "sequential accelerator" will be a viable solution only if the performance penalty due to migrations can be tolerated. Migration-induced cache misses may incur a significant performance loss. We propose some solutions to alleviate this problem. We also propose a migration method, using integrated thermal sensors, such that the migration interval is variable and depends on the ambient temperature. The migration penalty can be kept negligible as long as the ambient temperature is maintained below a threshold
    corecore