24 research outputs found

    Improving DRAM Performance by Parallelizing Refreshes with Accesses

    Full text link
    Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. However, there are two shortcomings of per-bank refresh. First, the per-bank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations show that our mechanisms improve system performance and energy efficiency compared to state-of-the-art refresh policies and the benefit increases as DRAM density increases.Comment: The original paper published in the International Symposium on High-Performance Computer Architecture (HPCA) contains an error. The arxiv version has an erratum that describes the error and the fix for i

    Producing Reliable Full-System Simulation Results: A Case Study of CMP with Very Large Caches

    Get PDF
    The greater detail and improved realism of full-system architecture simulation makes it a valuable computer architecture design tool. However, its unique characteristics introduce new sources of simulation variability which could make the results of such simulations less reliable. Meanwhile, the demand for more levels of cache and larger caches has increased to improve the system power and performance. This paper presents techniques to produce reliable results in full-system simulation of CMP computer systems with large caches. Specifically, we propose the detailed emulation replay warmup technique to deal with cold or incompletely warmed up large caches. We also propose the region of interest synchronization technique to prevent simulating non-representative phase when running multi-program workloads. Furthermore, we quantify the variation reduction one can achieve when using processor affinity and checkpointing. Finally, we show that by applying all four of these simulation techniques, the simulation variability is limited to less than 1% and the simulation results are therefore more reliable.Oregon Microarchitecture Lab, Intel Corporatio

    Exploiting new design tradeoffs in chip multiprocessor caches

    No full text
    Microprocessor industry has converged on chip multiprocessor (CMP) as the architecture of choice to utilize the numerous on-chip transistors. Multiple CMP cores substantially increase the capacity pressure on the limited on-chip cache capacity while requiring fast data access. The lowest level on-chip CMP cache not only needs to utilize its capacity effectively but also has to mitigate the increased latencies due to slow wire delay scaling. Conventional shared and private caches can provide either capacity or fast access but not both. To mitigate wire delays in large lower-level caches, this thesis proposes a novel technique called Distance-Associativity, which employs non-uniform-access latency for widely-spaced cache subarrays. Distance associativity enables flexible placement of a core’s frequently-accessed data in the closest subarrays for fast access. To provide both capacity and fast access in CMP caches, this thesis makes the key observation that CMPs fundamentally reverse the latency-capacity tradeoff that exists in conventional symmetric multiprocessors (SMPs) and distributed shared memory multiprocessors (DSMs). While CMPs rely on limited on-chip cache capacity but fast on-chip communication, SMPs and DSMs have virtually unlimited cache capacity but slow offchip communication. To exploit this tradeoff reversal, this thesis proposes three novel mechanisms: (i) controlled replication, (ii) in-situ communication, and (iii) capacity stealing. This work also observes that commercial multithreaded programs exhibit substantial variations in capacity demands and communication behaviors. Optimizations using static replication thresholds such as controlled replication and in-situ communication cannot adapt to workload variations. To this end, this thesis proposes the use of dynamic replication thresholds in controlled replication and in-situ communication. Experimental results show that for a 4-core CMP with 8 MB cache, the proposed CMP-NuRAPID cache outperforms conventional shared caches by 20% and 33% in multithreaded and multiprogrammed workloads respectively

    Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache

    No full text
    Wire delays continue to grow as the dominant component of latency for large caches. A recent work proposed an adaptive, non-uniform cache architecture (NUCA) to manage large, onchip caches. By exploiting the variation in access time across widely-spaced subarrays, NUCA allows fast access to close subarrays while retaining slow access to far subarrays. While the idea of NUCA is attractive, NUCA does not employ design choices commonly used in large caches, such as sequential tagdata access for low power. Moreover, NUCA couples data placement with tag placement foregoing the flexibility of data placement and replacement that is possible in a non-uniform access cache. Consequently, NUCA can place only a few blocks within a given cache set in the fastest subarrays, and must employ a high-bandwidth switched network to swap blocks within the cache for high performance. In this paper, we propose the Non-uniform access with Replacement And Placement usIng Distance associativity" cache, or NuRAPID, which leverages sequential tag-data access to decouple data placement from tag placement. Distance associativity, the placement of data at a certain distance (and latency), is separated from set associativity, the placement of tags within a set. This decoupling enables NuRAPID to place flexibly the vast majority of frequently-accessed data in the fastest subarrays, with fewer swaps than NUCA. Distance associativity fundamentally changes the trade-offs made by NUCA's best-performing design, resulting in higher performance and substantially lower cache energy. A one-ported, non-banked NuRAPID cache improves performance by 3% on average and up to 15% compared to a multi-banked NUCA with an infinite-bandwidth switched network, while reducing L2 cache energy by 77%

    Wire Delay is not a Problem for SMT (in the near future

    No full text
    Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43 % increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.

    Coordinated Refresh: Energy Efficient Techniques for DRAM Refresh Scheduling

    No full text
    Abstract—As the size and speed of DRAM devices increase, the performance and energy overheads due to refresh become more significant. To reduce refresh penalty we propose techniques referred collectively as “Coordinated Refresh”, in which scheduling of low power modes and refresh commands are coordinated so that most of the required refreshes are issued when the DRAM device is in the deepest low power Self Refresh (SR) mode. Our approach saves DRAM background power because the peripheral circuitry and clocks are turned off in the SR mode. Our proposed solutions improve DRAM energy efficiency by 10 % as compared to baseline, averaged across all the SPEC CPU 2006 benchmarks. I

    Optimizing replication, communication, and capacity allocation in cmps

    No full text
    Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latencycapacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for readwrite sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose insitu communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighbors ’ caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a core’s capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previouslyproposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8 % over private caches for three commercial multithreaded workloads.
    corecore