2 research outputs found

    Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache

    Get PDF
    Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories — block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache — i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%

    Multi-Gigabyte On-Chip DRAM Caches for Servers

    Get PDF
    While DRAM latency has long been recognized as a major bottleneck in servers, DRAM bandwidth is emerging as an important bottleneck as server processors shift to many-core architectures to allow for sustainable throughput improvements. The rapid expansion of the digital universe, increasingly stored in memory, rapidly pushes the need for higher DRAM density as well. Emerging die-stacked DRAM technology dramatically improves the three major DRAM properties: latency, bandwidth and density. Recent advancements in die-stacking technology made it possible to integrate a sizeable amount of DRAM directly on top of the processor. While the feasible on-chip DRAM capacities are insufficient to satisfy the memory needs of modern servers, architecting on-chip DRAM as a high-capacity low-latency high-bandwidth cache has the potential to provide significant reduction both in off-chip memory traffic and in average memory access latency. We make the observation that high-capacity on-chip DRAM caches expose abundant spatial locality present in server applications and a modest amount of temporal data reuse. As a consequence, DRAM caches that manage and fetch data at a coarser granularity, e.g., in 2KB pages, exhibit overall superior properties compared to caches that do fine-grain management using 64B blocks. These properties include substantially higher hit rates, smaller tag storage, higher energy efficiency and set-associativity. Unfortunately, naive employment of page-based caches results in excessive data overfetch and capacity waste, as some of the fetched and allocated blocks are never accessed prior to their eviction. We demonstrate that if the cache is organized in pages, then page footprints -- i.e., the set of blocks that are touched while the page is in the cache -- are highly predictable using well-established code-correlation techniques. Accurately predicting access patterns within a page can eliminate most of the bandwidth overhead and capacity waste that page-based caches suffer from
    corecore