8 research outputs found

    Reducing dram access latency by exploiting dram leakage characteristics and common access patterns

    Get PDF
    DRAM tabanlı bellek, bilgisayar sisteminde darboğaz oluşturarak sistemin başarımı sınırlayan en önemli bileşendir. Bunun sebebi işlemcilerin hız bakımından DRAM'lerin çok önünde olmasıdır. Bu tezde, ChargeCache ismini verdiğimiz, DRAM'lerin erişim gecikmesini azaltan bir yöntem geliştirdik. Bu yöntem, piyasadaki DRAM yongalarının mimarisinde bir değişiklik gerektirmediği gibi, bellek denetimcisinde de düşük donanım maliyeti olan ek birimlere ihtiyaç duymaktadır. ChargeCache, yeni erişilmiş DRAM satırlarının kısa bir süre sonra tekrar erişileceği gözlemine dayanmaktadır. Yeni erişilmiş satırlardaki DRAM hücreleri yüksek miktarda yük içerdiğinden, bunlara hızlı bir şekilde erişilebilir. Bu gözlemden faydalanmak için yeni erişilen satırların adreslerini bellek denetimcisi içerisinde bir tabloda tutmayı öneriyoruz. Sonraki erişim isteklerinin bu tablodaki satırlara erişmek istemesi durumunda, bellek denetimcisi yük miktarı yüksek hücrelerin erişilmek üzere olduğunu bileceğinden, DRAM erişim değiştirgelerini ayarlayarak erişimin düşük gecikmeyle tamamlanmasını sağlayabilir. Belirli bir süre sonra tablodaki satır adresleri silinerek, zaman içerisinde çok fazla yük kaybedip hızlı erişilebilme özelliğini yitirmiş satırların bu tablodan çıkarılması sağlanır. Önerdiğimiz yöntemi hem tek çekirdekli hem de çok çekirdekli mimarilerde benzetim ortamında deneyerek, yöntemin başarım ve enerji kullanımı açısından sistem üzerinde sağladığı iyileştirmeleri inceledik.DRAM-based memory is a critical factor that creates a bottleneck on the system performance since the processor speed largely outperforms the DRAM latency. In this thesis, we develop a low-cost mechanism, called ChargeCache, which enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems

    Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

    Full text link
    Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality

    Get PDF
    22nd IEEE International Symposium on High-Performance Computer Architecture (HPCA) (2016 : Barcelona, SPAIN)DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems

    Design of a distributed memory unit for clustered microarchitectures

    Get PDF
    Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance
    corecore