985 research outputs found
Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era
Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way.
One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions.
This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads.
Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)
Memory disambiguation hardware: a review
One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry.
This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic
A low-power cache system for high-performance processors
制度:新 ; 報告番号:甲3439号 ; 学位の種類:博士(工学) ; 授与年月日:12-Sep-11 ; 早大学位記番号:新576
Microarchitectural techniques to reduce energy consumption in the memory hierarchy
This thesis states that dynamic profiling of the memory reference stream can improve energy
and performance in the memory hierarchy. The research presented in this theses provides
multiple instances of using lightweight hardware structures to profile the memory
reference stream. The objective of this research is to develop microarchitectural techniques
to reduce energy consumption at different levels of the memory hierarchy. Several simple
and implementable techniques were developed as a part of this research. One of the
techniques identifies and eliminates redundant refresh operations in DRAM and reduces
DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for
multiprocessor systems. The emphasis of this research has been to develop several techniques
of obtaining energy savings in caches using a simple hardware structure called the
counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain
energy savings by not accessing the L2 cache on a predicted miss. A simple extension of
this technique allows CBFs to do way-estimation of set associative caches to reduce energy
in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and
reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic
power consumption in level one caches using significance compression. The significant
energy and performance improvements demonstrated by the techniques presented in this
thesis suggest that this work will be of great value for designing memory hierarchies of
future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka
Recommended from our members
Scalable hardware memory disambiguation
This dissertation deals with one of the long-standing problems in Computer Architecture
– the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
— partitioning, filtering, late binding and efficient overflow management —
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that couples Bloom filters and dependence predictors to dynamically identify
and buffer only those instructions which are likely to be dependent. The second
mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of
these optimizations allows the number of LSQ slots to be reduced by up to one-half
compared to a traditional organization without any performance degradation.
Finally, we describe a new decentralized LSQ design for handling LSQ structural
hazards in distributed microarchitectures. Decentralization of LSQs, and to
a large extent distributed microarchitectures with memory speculation, has proved
to be impractical because of the high performance penalties associated with the
mechanisms for dealing with hazards. To solve this problem, we applied classic
flow-control techniques from interconnection networks for handling resource con-
flicts. The first method, memory-side buffering, buffers the overflowing instructions
in a separate buffer near the LSQs. The second scheme, execution-side NACKing,
sends the overflowing instruction back to the issue window from which it is later
re-issued. The third scheme, network buffering, uses the buffers in the interconnection
network between the execution units and memory to hold instructions when the
LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network
buffering scheme is the most robust of all the overflow schemes and shows less than
1% performance degradation due to overflows for a subset of SPEC CPU 2000 and
EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS
processor.
The techniques proposed in this dissertation are independent, architectureneutral
and their cumulative benefits result in LSQs that can be partitioned at a
fine granularity and have low design complexity. Each of these partitions selectively
buffers only memory instructions with true dependences and can be closely coupled
with the execution units thus minimizing power, area, and latency. Such LSQ
designs with near-ideal characteristics are well suited for microarchitectures with
thousands of instructions in-flight and may enable even more aggressive microarchitectures
in the future.Computer Science
Hybrid Designs for Caches and Cores.
Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd
Improving processor efficiency by exploiting common-case behaviors of memory instructions
Processor efficiency can be described with the help of a number of desirable
effects or metrics, for example, performance, power, area, design
complexity and access latency.
These metrics serve as valuable tools used in designing new processors
and they also act as effective standards for comparing current processors.
Various factors impact the efficiency of modern out-of-order processors
and one important factor is the manner in which instructions are processed
through the processor pipeline.
In this dissertation research, we study the impact of load and store
instructions
(collectively known as memory instructions) on processor efficiency,
and show how to improve efficiency by exploiting common-case or
predictable patterns in the behavior of memory instructions.
The memory behavior patterns that we focus on in our research are
the predictability of memory dependences, the predictability in
data forwarding patterns,
predictability in instruction criticality and conservativeness
in resource allocation and
deallocation policies.
We first design a scalable and high-performance memory dependence
predictor and then apply
accurate memory dependence prediction to improve the efficiency of
the fetch engine of a simultaneous multi-threaded processor.
We then use predictable data forwarding patterns to eliminate power-hungry
hardware in the processor with no loss in performance. We then move to
studying instruction criticality to improve
processor efficiency. We study the behavior of critical load instructions
and propose applications that can be optimized using predictable,
load-criticality
information. Finally, we explore conventional techniques for
allocation and deallocation
of critical structures that process memory instructions and propose new
techniques to optimize the same. Our new designs have the potential to reduce
the power and the area required by processors significantly without losing
performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo
- …