2 research outputs found
Die-Stacked DRAM: Memory, Cache, or MemCache?
Die-stacked DRAM is a promising solution for satisfying the ever-increasing
memory bandwidth requirements of multi-core processors. Manufacturing
technology has enabled stacking several gigabytes of DRAM modules on the active
die, thereby providing orders of magnitude higher bandwidth as compared to the
conventional DIMM-based DDR memories. Nevertheless, die-stacked DRAM, due to
its limited capacity, cannot accommodate entire datasets of modern big-data
applications. Therefore, prior proposals use it either as a sizable memory-side
cache or as a part of the software-visible main memory. Cache designs can adapt
themselves to the dynamic variations of applications but suffer from the tag
storage/latency/bandwidth overhead. On the other hand, memory designs eliminate
the need for tags, and hence, provide efficient access to data, but are unable
to capture the dynamic behaviors of applications due to their static nature.
In this work, we make a case for using the die-stacked DRAM partly as main
memory and partly as a cache. We observe that in modern big-data applications
there are many hot pages with a large number of accesses. Based on this
observation, we propose to use a portion of the die-stacked DRAM as main memory
to host hot pages, enabling serving a significant number of the accesses from
the high-bandwidth DRAM without the overhead of tag-checking, and manage the
rest of the DRAM as a cache, for capturing the dynamic behavior of
applications. In this proposal, a software procedure pre-processes the
application and determines hot pages, then asks the OS to map them to the
memory portion of the die-stacked DRAM. The cache portion of the die-stacked
DRAM is managed by hardware, caching data allocated in the off-chip memory
An Evaluation of a Commercial CC-NUMA Architecture—the CONVEX Exemplar SPP1200
Studies done with academic CC-NUMA machines and simulators indicate a good potential for application performance. Our goal therefore, is to investigate whether the CONVEX Exemplar, a commercial distributed shared memory machine, lives up to the expected potential of CC-NUMA machines. If not, we would like to understand what architectural or implementation decisions make it less efficient. On evaluating the delivered performance on the Exemplar, we find that, while a moderate-scale Exemplar machine works well for several applications, it does not for some important classes. Further, performance was affected by four fundamental characteristics of the machine, all of which are due to basic implementation and design choices made on the Exemplar. These are: the effect of processor clustering together with limited node-to-network bandwidth, the effect of tertiary caches, the limited user control over data placement, the sequential memory consistency model together with a cache-based cache coherence protocol, and lastly, longer remote latencies.