19 research outputs found
Re-visiting the performance impact of microarchitectural floorplanning
Journal ArticleThe placement of microarchitectural blocks on a die can significantly impact operating temperature. A floorplan that is optimized for low temperature can negatively impact performance by introducing wire delays between critical pipeline stages. In this paper, we identify subsets of wire delays that can and cannot be tolerated. These subsets are different from those identified by prior work. This paper also makes the case that floorplanning algorithms must consider the impact of floorplans on bypassing complexity and instruction replay mechanisms
Late allocation and early release of physical registers
The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version
Recommended from our members
Single-level dynamic register caching architecture for high-performance superscalar processors
The amount of instruction level parallelism (ILP) that can be exploited depends
greatly on the size of the instruction window and the number of in-flight instructions
the processor can support. However, this requires a register file with a large set of
physical registers for renaming and multiple ports to provide register accesses to
several instructions at once. The number of registers and ports a register file must
contain will increase as the next generation wide-issue processors take advantage of
more ILP, which will also increase its access time, area, and power dissipation. This
paper proposes a method called Dynamic Register Caching, which uses a small, fast
register cache along with a slow full register file in a single-level configuration, and
splits the porting requirement between the two with each one capable of supplying
values to FUs. This reduces the miss penalty found in previous multi-level schemes to
just the access time of the full register file. The proposed method uses In-Cache bits
and Register-port Select logic to keep track of operands in the register cache and the
availability of free ports on both register files, and a simple instruction steering
mechanism to determine which register file will supply values to instructions. This
technique of dynamically steering instructions requires slightly more logic to
implement, but incurs no additional delay and insures that load balance is a non-issue.
Our study based on SimpleScalar microarchitecture simulation shows that the
proposed scheme provides on average 15~22% improvement in IPC, 47~56%
reduction in total area, and 23~49% reduction in power compared to a monolithic
register file
A Lightweight, Compiler-Assisted Register File Cache for GPGPU
Modern GPUs require an enormous register file (RF) to store the context of
thousands of active threads. It consumes considerable energy and contains
multiple large banks to provide enough throughput. Thus, a RF caching mechanism
can significantly improve the performance and energy consumption of the GPUs by
avoiding reads from the large banks that consume significant energy and may
cause port conflicts.
This paper introduces an energy-efficient RF caching mechanism called Malekeh
that repurposes an existing component in GPUs' RF to operate as a cache in
addition to its original functionality. In this way, Malekeh minimizes the
overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue
scheduling policy that utilizes the reuse distance of the values in the RF
cache and is controlled by a dynamic algorithm. The goal is to adapt the issue
policy to the runtime program characteristics to maximize the GPU's performance
and the hit ratio of the RF cache. The reuse distance is approximated by the
compiler using profiling and is used at run time by the proposed caching
scheme. We show that Malekeh reduces the number of reads to the RF banks by
46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves
performance by 6.1% while adding only 2KB of extra storage per core to the
baseline RF of 256KB, which represents a negligible overhead of 0.78%
Understanding the impact of 3D stacked layouts on ILP
Journal Article3D die-stacked chips can alleviate the penalties imposed by long wires within micro-processor circuits. Many recent studies have attempted to partition each microprocessor structure across three dimensions to reduce their access times. In this paper, we implement each microprocessor structure on a single 2D die and leverage 3D to reduce the lengths of wires that communicate data between microprocessor structures within a single core. We begin with a criticality analysis of inter-structure wire delays and show that for most tra- ditional simple superscalar cores, 2D floorplans are already very efficient at minimizing critical wire delays. For an aggressive wire-constrained clustered superscalar architecture, an exploration of the design space reveals that 3D can yield higher benefit. However, this benefit may be negated by the higher power density and temperature entailed by 3D integration. Overall, we report a negative result and argue against leveraging 3D for higher ILP
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization
A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations.
Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique.
Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry
Recommended from our members
Advanced microarchitecture and circuit design techniques for on-chip memories in CMOS technology
In modern on-chip memories, an increasing demand for higher performance, lower power, reduced area, and improved robustness creates a rising need for advanced microarchitecture and circuit design techniques. Particularly in large-signal multi-ported register files, these advanced design techniques include: (i) multi-banked arrays, (ii) multi-frequency arrays, (iii) multi-bit width gating, (iv) multi-latency cycle times, (v) multi-threshold devices, and (vi) multi-strength keepers. In modern microprocessors, register files are important ingredients, but the increasing number of register file read/write ports and entries can produce a bottleneck. This thesis discusses various new techniques, to address the challenges facing register file designers, and to satisfy microprocessor requirements.
The scalability of register files is a concern in modern microprocessors. As microprocessors become wider to exploit instruction level parallelism, this increases the amount of read/write ports. In turn this results in quadratic growth in register file area, decreasing frequency and increasing the power consumption. Multi-banked and multi-frequency register files reduce area and power consumption by relieving the read/write port congestion. Multi-bit width register files reduce active power during read/write operations by gating the clock/wordline. Pipelined register files improve frequency by reducing logic depth, but require multiple cycles for read/write operations. Multi-latency register files contain variable access cycle times, which are
dependent on the physical locality of the data. This improves overall microprocessor performance and recovers lost instructions per cycle.
As instruction window size continues to expand in modern microprocessors, the resulting demand for additional register file entries requires increased use of wide-OR dynamic circuits. However, these circuits, primarily found in local/global bitlines, are susceptible to leakage noise. In a multi-threshold process, a self-reverse bias technique exploits the use of leaky low-VTH devices, reducing bitline leakage and improving robustness. This circuit topology improves bitline delay from reduced keeper contention. Downsized keepers improve bitline delay in low leakage conditions; stronger keepers improve bitline robustness in high leakage conditions. Utilizing this concept, register files with multi-strength keepers enable robust operation across a wide range of process, voltage, and temperature.
These various design techniques show excellent promise in improving performance, power, area, and robustness of multi-ported register files in modern microprocessors
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%