190 research outputs found
Control speculation for energy-efficient next-generation superscalar processors
Conventional front-end designs attempt to maximize the number of "in-flight" instructions in the pipeline. However, branch mispredictions cause the processor to fetch useless instructions that are eventually squashed, increasing front-end energy and issue queue utilization and, thus, wasting around 30 percent of the power dissipated by a processor. Furthermore, processor design trends lead to increasing clock frequencies by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. As next-generation high-performance processors become deeply pipelined, the amount of wasted energy due to misspeculated instructions will go up. The aim of this work is to reduce the energy consumption of misspeculated instructions. We propose selective throttling, which triggers different power-aware techniques (fetch throttling, decode throttling, or disabling the selection logic) depending on the branch prediction confidence level. Results show that combining fetch-bandwidth reduction along with select-logic disabling provides the best performance in terms of overall energy reduction and energy-delay product improvement (14 percent and 10 percent, respectively, for a processor with a 22-stage pipeline and 16 percent and 13 percent, respectively, for a processor with a 42-stage pipeline).Peer ReviewedPostprint (published version
Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques
Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version
Recommended from our members
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors
textThroughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput.Computer Science
Dynamic Multigrain Parallelization on the Cell Broadband Engine
This paper addresses the problem of orchestrating and scheduling
parallelism at multiple levels of granularity on heterogeneous
multicore processors. We present policies and mechanisms for adaptive
exploitation and scheduling of multiple layers of parallelism on the
Cell Broadband Engine. Our policies combine event-driven task
scheduling with malleable loop-level parallelism, which is exposed
from the runtime system whenever task-level parallelism leaves cores
idle. We present a runtime system for scheduling applications with
layered parallelism on Cell and investigate its potential with RAxML,
a computational biology application which infers large phylogenetic
trees, using the Maximum Likelihood (ML) method. Our experiments show
that the Cell benefits significantly from dynamic parallelization
methods, that selectively exploit the layers of parallelism in the
system, in response to workload characteristics. Our runtime
environment outperforms naive parallelization and scheduling based on
MPI and Linux by up to a factor of 2.6. We are able to execute RAxML
on one Cell four times faster than on a dual-processor system with
Hyperthreaded Xeon processors, and 5--10\% faster than on a
single-processor system with a dual-core, quad-thread IBM Power5
processor
K-hot pipelining
Computing systems in almost every application domain now support techniques to trade off power and performance. Such techniques are used to enforce power and thermal constraints, manage power and thermal budgets and respond to temperature and aging. Unfortunately, many of the current techniques are limited in the dynamic range they provide and scale poorly with technology. Techniques that can supplement or replace current techniques are needed. We propose k-hot pipelining, a novel technique to support multiple power-performance points in a processor. The key idea is to provide power and clock to only k stages of an m-stage pipeline (k < m); the k stages to be powered on change as instructions flow through the pipeline. Since the remaining m − k stages do not consume power, the technique results in power savings at the expense of performance. k-hot pipelining can be software or hardware-controlled, workload-agnostic or workload-adaptive, and can be used to provide power-performance points not supported by existing techniques. For one implementation of k-hot pipelining, we show that up to 49.9% power reduction is possible over the baseline design. Power reduction is up to 47% over the lowest power point supported by DVFS
Compiler-Directed Energy Savings in Superscalar Processors
Institute for Computing Systems ArchitectureSuperscalar processors contain large, complex structures to hold data and instructions as
they wait to be executed. However, many of these structures consume large amounts of energy,
making them hotspots requiring sophisticated cooling systems. With the trend towards larger,
more complex processors, this will become more of a problem, having important implications
for future technology.
This thesis uses compiler-based optimisation schemes to target the issue queue and register
file. These are two of the most energy consuming structures in the processor. The algorithms
and hardware techniques developed in this work dynamically adapt the processor's resources
to the changing program phases, turning off parts of each structure when they are unused to
save dynamic and static energy.
To optimise the issue queue, the compiler analysis tracks data dependences through each
program procedure. It identifies the critical path through each program region and informs the
hardware of the minimum number of queue entries required to prevent it slowing down. This
reduces the occupancy of the queue and increases the opportunities to save energy. With just a
1.3% performance loss, 26% dynamic and 32% static energy savings are achieved.
Registers can be idle for many cycles after they are last read, before they are released and
put back on the free-list to be reused by another instruction. Alternatively, they can be turned
off for energy savings. Early register releasing can be used to perform this operation sooner
than usual, but hardware schemes must wait for the instruction redefining the relevant logical
register to enter the pipeline. This thesis presents an exploration of compiler-directed early
register releasing. The compiler can exactly identify the last use of each register and pass the
information to the hardware, based on simple data-flow and liveness analysis. The best scheme
achieves 15% dynamic and 19% static energy savings.
Finally, the issue queue limiting and early register releasing schemes are combined for energy
savings in both processor structures. Four different configurations are evaluated bringing
25% to 31% dynamic and 19% to 34% static issue queue energy savings and reductions of 18%
to 25% dynamic and 20% to 21% static energy in the register file
- …