403 research outputs found

    LOT-ECC: LOcalized and tiered reliability mechanisms for commodity memory systems

    Get PDF
    pre-printMemory system reliability is a serious and growing concern in modern servers. Existing chipkill-level mem- ory protection mechanisms suffer from several draw- backs. They activate a large number of chips on ev- ery memory access - this increases energy consump- tion, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase ac- cess granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM de- vices. In this paper, we present LOT-ECC, a local- ized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying imple- mentation. Data and codes are localized to the same DRAM row to improve access efficiency. We use sys- tem firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power con- sumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while work- ing with commodity DRAMs and operating systems. Fi- nally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts

    GPGPU microbenchmarking for irregular application optimization

    Get PDF
    Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory bandwidth in certain circumstances. We also make other further observations on specific relevant behaviors of GPUs. We hope that this study opens the door for further investigation into techniques that can exploit latency amortization when latency hiding does not achieve sufficient performance

    LOT-ECC: LOcalized and tiered reliability mechanisms for commodity memory systems

    Get PDF
    pre-printMemory system reliability is a serious and growing concern in modern servers. Existing chipkill-level mem- ory protection mechanisms suffer from several draw- backs. They activate a large number of chips on ev- ery memory access - this increases energy consump- tion, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase ac- cess granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM de- vices. In this paper, we present LOT-ECC, a local- ized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying imple- mentation. Data and codes are localized to the same DRAM row to improve access efficiency. We use sys- tem firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power con- sumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while work- ing with commodity DRAMs and operating systems. Fi- nally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts

    Reducing dram access latency by exploiting dram leakage characteristics and common access patterns

    Get PDF
    DRAM tabanlı bellek, bilgisayar sisteminde darboğaz oluşturarak sistemin başarımı sınırlayan en önemli bileşendir. Bunun sebebi işlemcilerin hız bakımından DRAM'lerin çok önünde olmasıdır. Bu tezde, ChargeCache ismini verdiğimiz, DRAM'lerin erişim gecikmesini azaltan bir yöntem geliştirdik. Bu yöntem, piyasadaki DRAM yongalarının mimarisinde bir değişiklik gerektirmediği gibi, bellek denetimcisinde de düşük donanım maliyeti olan ek birimlere ihtiyaç duymaktadır. ChargeCache, yeni erişilmiş DRAM satırlarının kısa bir süre sonra tekrar erişileceği gözlemine dayanmaktadır. Yeni erişilmiş satırlardaki DRAM hücreleri yüksek miktarda yük içerdiğinden, bunlara hızlı bir şekilde erişilebilir. Bu gözlemden faydalanmak için yeni erişilen satırların adreslerini bellek denetimcisi içerisinde bir tabloda tutmayı öneriyoruz. Sonraki erişim isteklerinin bu tablodaki satırlara erişmek istemesi durumunda, bellek denetimcisi yük miktarı yüksek hücrelerin erişilmek üzere olduğunu bileceğinden, DRAM erişim değiştirgelerini ayarlayarak erişimin düşük gecikmeyle tamamlanmasını sağlayabilir. Belirli bir süre sonra tablodaki satır adresleri silinerek, zaman içerisinde çok fazla yük kaybedip hızlı erişilebilme özelliğini yitirmiş satırların bu tablodan çıkarılması sağlanır. Önerdiğimiz yöntemi hem tek çekirdekli hem de çok çekirdekli mimarilerde benzetim ortamında deneyerek, yöntemin başarım ve enerji kullanımı açısından sistem üzerinde sağladığı iyileştirmeleri inceledik.DRAM-based memory is a critical factor that creates a bottleneck on the system performance since the processor speed largely outperforms the DRAM latency. In this thesis, we develop a low-cost mechanism, called ChargeCache, which enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems

    DSPatch: Dual Spatial Pattern Prefetcher

    Full text link
    High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak. To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201
    corecore