Search CORE

19 research outputs found

Re-visiting the performance impact of microarchitectural floorplanning

Author: Balasubramonian Rajeev
Chakravorty Anupam
Publication venue: Workshop on Temperature Aware Computer Systems
Publication date: 01/01/2006
Field of study

Journal ArticleThe placement of microarchitectural blocks on a die can significantly impact operating temperature. A floorplan that is optimized for low temperature can negatively impact performance by introducing wire delays between critical pipeline stages. In this paper, we identify subsets of wire delays that can and cannot be tolerated. These subsets are different from those identified by prior work. This paper also makes the case that floorplanning algorithms must consider the impact of floorplans on bypassing complexity and instruction replay mechanisms

The University of Utah: J. Willard Marriott Digital Library

Late allocation and early release of physical registers

Author: González Colás Antonio María
González González José
Monreal Arnal Teresa
Valero Cortés Mateo
Viñals Yufera Víctor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Microarchitectural Floorplanning for Thermal Management: A Technical Report

Author
Publication venue: 'Defense Technical Information Center (DTIC)'
Publication date
Field of study

Crossref

Recommended from our members

Single-level dynamic register caching architecture for high-performance superscalar processors

Author: Liebert John A.
Publication venue: 'Oregon State University'
Publication date
Field of study

The amount of instruction level parallelism (ILP) that can be exploited depends greatly on the size of the instruction window and the number of in-flight instructions the processor can support. However, this requires a register file with a large set of physical registers for renaming and multiple ports to provide register accesses to several instructions at once. The number of registers and ports a register file must contain will increase as the next generation wide-issue processors take advantage of more ILP, which will also increase its access time, area, and power dissipation. This paper proposes a method called Dynamic Register Caching, which uses a small, fast register cache along with a slow full register file in a single-level configuration, and splits the porting requirement between the two with each one capable of supplying values to FUs. This reduces the miss penalty found in previous multi-level schemes to just the access time of the full register file. The proposed method uses In-Cache bits and Register-port Select logic to keep track of operands in the register cache and the availability of free ports on both register files, and a simple instruction steering mechanism to determine which register file will supply values to instructions. This technique of dynamically steering instructions requires slightly more logic to implement, but incurs no additional delay and insures that load balance is a non-issue. Our study based on SimpleScalar microarchitecture simulation shows that the proposed scheme provides on average 15~22% improvement in IPC, 47~56% reduction in total area, and 23~49% reduction in power compared to a monolithic register file

ScholarsArchive@OSU

A Lightweight, Compiler-Assisted Register File Cache for GPGPU

Author: Arnau Jose Maria
Gonzalez Antonio
Murgadas Jordi Tubella
Shoushtary Mojtaba Abaie
Publication venue
Publication date: 26/10/2023
Field of study

Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conflicts. This paper introduces an energy-efficient RF caching mechanism called Malekeh that repurposes an existing component in GPUs' RF to operate as a cache in addition to its original functionality. In this way, Malekeh minimizes the overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue scheduling policy that utilizes the reuse distance of the values in the RF cache and is controlled by a dynamic algorithm. The goal is to adapt the issue policy to the runtime program characteristics to maximize the GPU's performance and the hit ratio of the RF cache. The reuse distance is approximated by the compiler using profiling and is used at run time by the proposed caching scheme. We show that Malekeh reduces the number of reads to the RF banks by 46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves performance by 6.1% while adding only 2KB of extra storage per core to the baseline RF of 256KB, which represents a negligible overhead of 0.78%

arXiv.org e-Print Archive

Understanding the impact of 3D stacked layouts on ILP

Author: Awasthi Manu, Venkatesan, Vivek
Balasubramonian Rajeev
Publication venue: International Symposium on Microarchitecture
Publication date: 01/01/2007
Field of study

Journal Article3D die-stacked chips can alleviate the penalties imposed by long wires within micro-processor circuits. Many recent studies have attempted to partition each microprocessor structure across three dimensions to reduce their access times. In this paper, we implement each microprocessor structure on a single 2D die and leverage 3D to reduce the lengths of wires that communicate data between microprocessor structures within a single core. We begin with a criticality analysis of inter-structure wire delays and show that for most tra- ditional simple superscalar cores, 2D floorplans are already very efficient at minimizing critical wire delays. For an aggressive wire-constrained clustered superscalar architecture, an exploration of the design space reveals that 3D can yield higher benefit. However, this benefit may be negated by the higher power density and temperature entailed by 3D integration. Overall, we report a negative result and argue against leveraging 3D for higher ILP

The University of Utah: J. Willard Marriott Digital Library

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2004
Field of study

A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

ScholarlyCommons@Penn

Recommended from our members

Advanced microarchitecture and circuit design techniques for on-chip memories in CMOS technology

Author: Hsu Steven K.
Publication venue: 'Oregon State University'
Publication date
Field of study

In modern on-chip memories, an increasing demand for higher performance, lower power, reduced area, and improved robustness creates a rising need for advanced microarchitecture and circuit design techniques. Particularly in large-signal multi-ported register files, these advanced design techniques include: (i) multi-banked arrays, (ii) multi-frequency arrays, (iii) multi-bit width gating, (iv) multi-latency cycle times, (v) multi-threshold devices, and (vi) multi-strength keepers. In modern microprocessors, register files are important ingredients, but the increasing number of register file read/write ports and entries can produce a bottleneck. This thesis discusses various new techniques, to address the challenges facing register file designers, and to satisfy microprocessor requirements. The scalability of register files is a concern in modern microprocessors. As microprocessors become wider to exploit instruction level parallelism, this increases the amount of read/write ports. In turn this results in quadratic growth in register file area, decreasing frequency and increasing the power consumption. Multi-banked and multi-frequency register files reduce area and power consumption by relieving the read/write port congestion. Multi-bit width register files reduce active power during read/write operations by gating the clock/wordline. Pipelined register files improve frequency by reducing logic depth, but require multiple cycles for read/write operations. Multi-latency register files contain variable access cycle times, which are dependent on the physical locality of the data. This improves overall microprocessor performance and recovers lost instructions per cycle. As instruction window size continues to expand in modern microprocessors, the resulting demand for additional register file entries requires increased use of wide-OR dynamic circuits. However, these circuits, primarily found in local/global bitlines, are susceptible to leakage noise. In a multi-threshold process, a self-reverse bias technique exploits the use of leaky low-VTH devices, reducing bitline leakage and improving robustness. This circuit topology improves bitline delay from reduced keeper contention. Downsized keepers improve bitline delay in low leakage conditions; stronger keepers improve bitline robustness in high leakage conditions. Utilizing this concept, register files with multi-strength keepers enable robust operation across a wide range of process, voltage, and temperature. These various design techniques show excellent promise in improving performance, power, area, and robustness of multi-ported register files in modern microprocessors

ScholarsArchive@OSU

Use-Based Register Caching with Decoupled Indexing

Author: Gao L.
Gurindar S. Sohi
J. Adam Butts
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Author: Ausavarungnirun Rachata
Drumond Mario
Ehsani Seyed Borna
Falsafi Babak
Mirhosseini Amirhossein
Mutlu Onur
Sadrosadati Mohammad
Sarbazi-Azad Hamid
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/04/2018
Field of study

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%

Infoscience - École polytechnique fédérale de Lausanne

Crossref