The future of memory systems is Multi-Level Memory (MLM). In a MLM system the main memory is comprised of two or more types of memory instead of a conventional DDR-DRAM-only main memory. By combining different memory technologies, an MLM system can potentially offer more usable bandwidth and more capacity for a similar cost as a conventional memory system. However, substantial software and hardware design challenges must be overcome to make this potential real.
INTRODUCTION
Main memory systems seek to provide a large capacity, low latency, and high bandwidth at a low cost. With a DDRonly memory, there are relatively few tradeoffs to be made. Latency is largely fixed by the DDR technology. Bandwidth can be improved by adding more channels. Capacity by adding more DIMMs per channel.
The goal of Multi-Level Memory is to improve capacity, latency, bandwidth and/or cost by combining multiple memory technologies. Ideally, high bandwidth memories such as * Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. the Micron Hybrid Memory Cube (HMC) or JEDEC High Bandwidth Memory (HBM) could be combined with lowcost capacity memory such as Flash NVRAM. If the majority of accesses are serviced by the high-speed memory and the majority of the capacity is provided by the lowcost memory, overall performance would be good and overall price would be low.
MEMSYS '15,
However, there are serious challenges to making this potential realizable. Ensuring that frequently accessed data is placed in the fastest memory requires action from the programmer, OS, runtime, or hardware. The choice of which objects should be placed in slow or fast memory is often nonintuitive, even for expert programmers who are very familiar with a code. Because main memory only sees post-cache accesses, data structures that are accessed very frequently often do not account for many main memory requests because they are quickly loaded into cache. As a result, programmers will need to identify objects that are used "a lot, but not too much" and prioritize them against other similar objects. This is a difficult and complex task, which will require new analysis tools to help the programmer, and our simulation framework is a step in this direction.
Similarly, design of the hardware will require new analysis. Partitioning capacity between high-speed (and highcost) memory, traditionaly DDR, and low-speed (low-cost) NVRAM will require a better understanding of application data access patterns. Poor partitioning will lead to poor performance or waste money.
A Post-DDR World
Computer main memory for supercomputers, desktops, and most embedded systems is dominated by Dynamic Random Access Memory (DRAM). DRAM provides relatively low cost and power compared to SRAM memory and much lower latency and higher bandwidth than hard disks or tape. For the last 20 years, Double Data Rate (DDR) DRAM has been the most common form of DRAM memory. Technological and economic factors are moving which will break this dominance.
The DDR standards (DDR and its successors, DDR2, DDR3, and DDR4) are defined by JEDEC , an industry group comprising all of the major memory and processor venders. From 1989 to today, JEDEC standards [2] have defined the increasingly complex electrical interfaces that allow a processor to connect to DDR memory. The DDR-based memory standards share a number of commonalities [1] .
• To save cost, they have a relatively small number of IO pins per DRAM chip
• To increase bandwidth, multiple chips are ganged together into a package (generally a DIMM), forming a wide parallel bus
• The individual memory chips are "dumb" -they contain minimal logic and are controlled in a master-slave fashion by the processor
• The interface allows little room for interpretation or differentiation -encouraging DDR memory to be a simple commodity .
These commonalties have benefited the computer industry by creating a common standard for memory that has seen steady performance increases, but it has hindered architectural innovation. Because it is a largely undifferentiated commodity, DRAM manufacturers have slim margins which has historically left little room for architectural innovation. Even with support from major manufacturer such as Intel competitors, to DDR, such as Rambus or FB-DIMMs, have failed due to the economies of scale and low cost of commodity DDR.
Things are changing. There is an emerging consensus that there will be no DDR-5 standard [6] . The increasing bandwidth and capacity requirements of future systems and the technical challenges of scaling the wide parallel DDR interconnect are too great. Main Memory architectures will become more diverse. Advances in packaging, growth in nonvolatile memories, and possibilities for merging processing and memory will lead to major changes in how memory is designed and integrated into systems. Additionally, memory vendors see new architectures as a way to differentiate themselves and offer higher-margin products. Rather than "racing to the bottom" to provide a low cost "dumb" commodity, memory and system vendors are seeking to create advanced products which can claim a larger portion of computer revenue. These new architectural combinations open up possibilities.
Future Multi-Level Memory Systems
In this paper we will examine two-and three-level memories. These MLM systems may be comprised of stacked 3D memory, conventional DDR DRAM, and non-volatile memory. For this paper we look at currently available Flash NVRAM, though several emerging NV technologies may exhibit better performance which would change this analysis.
The placement and movement of data between these levels can be performed by the programmer, compiler, OS, runtime, or hardware. However, this work focuses on programmerdirected static placement.
MAIN MEMORY ACCESS PROFILES
To understand the design space tradeoffs of MLM, we must first understand the appplications and how they access main memory. To do this, we have run a selection of DOE Miniapps In these simualtions, 16 cores are clusted into groups of four. Each core has a 32KB L1 Data cache (the instruction cache and instruction fecth is not modeled) and each cluster shares a 512KB L2 cache. L2 cache misses are monitored and collected to produce a histogram of main memory accesses.
Some preliminary application analysis indicates that application's interactions with main memory vary greatly. Figure 2 shows histograms of how often pages in main memory are accessed by the processor 1 . By sorting the histogram bins by frequency of access, distinctive patterns emerge. On the left, Lulesh, a hydrodynamics application, shows a very unequal distribution of memory accesses. Roughly 5% of pages in main memory account for about half of the memory accesses. 35% of pages account for almost 75% of memory requests. In contrast, Minife, a finite element application, has a more equal distribution (the top 35% of pages only account for 45% of accesses). Though a small number of pages still account for a disproportionate number of requests, the bulk of the memory footprint still receives a number of accesses.
As a rough estimate of the level of inequality in memory accesses we have computed the Gini coefficient for several applications. Originally used in economics to study income or wealth inequality, the Gini coefficient is a measure of statistical dispersion. A perfectly equal distribution (i.e. everyone has the same income or all memory pages area accessed the same number of times) results in a Gini coefficient of 0. A perfectly unequal distribution (i.e. only one memory page is ever accessed) would have a Gini coefficient of 1.0. In a selection of DOE miniapps we have obsevered Gini coefficients as high as 0.92 (for reads in rsbench) and as low as 0.12 (for reads in minife) . This leads us to believe that there is no easy one-size-fits-all solution for MLM data placement or partitioning capacity between levels of memory.
Because the Gini coefficient was originally used as an estimate of income inequality in a society, this analysis allows us to proclaim that applications vary dramatically, much like national economies do. This indicates that these codes could be easily modified to identify which portions could fit in faster memory and which may be safely relegated to slower memory. In contrast, codes like Lulesh (middle), show multiple regions that may be more difficult to track down. Other codes, like Rsbench (right), a molecular dynamics code, have considerable variation in the number of accesses per page. These codes may be very difficult to perform a priori data placement on and may require more adaptive application or runtime methods to move data during execution.
PERFORMANCE AND ENERGY ESTI-MATION 3.1 Economic Impacts
The primary motivation for a multi-level memory is economic. Replacing DDR main memory with a combination of memory technologies may allow a memory system that provides high bandwidth and high capacity at low cost. Application analysis (See Section 2) indicates that (for many applications) a relatively small percentage of main memory accounts for most of the cache misses. If this portion was stored in fast, expensive (e.g. 3D stacked) memory and the bulk of data kept in slower, cheaper (e.g. DDR or Flash [7] ) devices, it may be possible to realize the "best of both worlds." The largest impediment to successful deployment of MLM technology is software and algorithms. Software will need to be modified to place commonly used data in the fast "near" memory and less frequently used data in the "far" slow memory. Alternately, operating system or runtime algorithms will have to be implemented which transparently move data from one memory to the other by predicting future application requirements.
Analysis
Consider a computer with a multi-level memory hierarchy based on the technologies from Table 1 and optimized for an application like Lulesh (Figure 2, left) . The 5% of memory pages that dominate memory accesses could be placed in a small HMC-like memory sized to fit. An additional 30% could be placed in conventional, low-cost DDR, and the "long tail" of infrequently touched memory pages could be placed in non-volatile Flash. Using the cost and bandwidth estimates from Table 1 , we can estimate the rough cost of the memory system and an upper bound on available "average" bandwidth 2 .
Such a memory system could cost about half of a conventional DDR-only memory system (Table 2 ). Since almost half of memory accesses would go to the HMC, the overall "average" bandwidth would be increased (Table 3 ). The majority of remaining accesses would go to DDR, so their performance would be no worse than a conventional DDRonly system. The small portion of memory that would go to Flash would be slower, but if the latency could be masked (perhaps with intelligent prefetching or application modifications), this may not have a large performance impact. The end result is a memory system that offers significantly increased effective bandwidth and costs about half of a conventional (DDR) system.
Design Space Exploration
MLM offers design choices for both the programmer and hardware designer. For the hardware designer, the primary design tradeoff is the mix of different memory capacities. From a cost per bit perspective, NVRAM offers the largest capacities. However, the low bandwidth and high latency of current NVRAM would make an all-NVRAM machine intolerably slow for most applications. Similarly, stacked memories such as HMC offer superior bandwidth, but do so at a substantial price premium. Because total machine budgets are fixed, using these faster memories would require sacrifices in processing, network, or other capabilities. Also, machine procurements may have fixed memory capacity requirements [4] .
Using the simple analytical framework above, it is possible to quickly evaluate the potential of a variety of configurations. To give a quick overview of the design space, we analytically evaluated 240 memory configurations. Because of the complex tradeoffs in the different memory technologies, even taking a Pareto Optimal surface with respect to bandwidth, latency, and cost still yields 78 viable configurations for Lulesh (Figure 4 ) and 69 for minife. Within these pareto surfaces, additional application chracteristics can be seen. Examination of the 10 configurations with a cost closest to conventional DRAM shows that Lulesh configurations on average "prefer" 33% more stacked memory and 20% more NVRAM than minife.
SIMULATION
Four scientific computing mini-applications -miniFE v2, lulesh v2, miniaero, and rsbench -were studied here with some in more depth than the others. Collection of address histograms and timing information were performed using Structural Simulation Toolkit (SST) v5.0, DRAMSim-2.2.2, HybridSim v2, and NVDIMMSim v2. HybridSim provides a DRAM-based cache interface to the non-volatile memory simulator NVDIMMSim. The Hybrid Memory Cube like memory were simulated using the VaultSim package in SST. The processors were built using the Intel PIN based Ariel processor model in SST which captures all the memory operations. Sixteen of these cores were packaged into 4 quads with each quad sharing a 512 KB 16-way associative L2 cache. Each core operates at 2 GHz and has a 32 KB 8-way associative L1 cache. Since Ariel does not simulate the instruction fetches, these caches basically act as data caches. Within each quad, the four L1 caches are connected to their shared L2 cache through a bus. Both the caches have a 64 B line size. The L2 has 128 MSHR entries.
The addresses of the memory operations leaving the L2 caches were tracked using a cache listener module attached to the L2s. This module then generates histograms for both the physical addresses after translation and the virtual addresses. The four L2s are connected to the directory controllers (DC) through the Merlin network model in SST. There is one DC for each channel in the system. This 16-core processor configuration with the merlin router was unchanged between experiments.
Four different main memory configurations were explored: two channels of DDR (DRAMSim), two channel of DDR with one channel of HMC (VaultSim), and one channel of HMC with two channels of HybridSim memory. The memories are serviced by dedicated Memory Controller (MC) components connected to both the memories and DC. The HMC cubes contain 8 vaults with a peak data throughput of 128 GB/s. The DDR were configured to operate at 16 GB/s. The HybridMemory were configured to operate at the same clock as DDR. The size of the HMC was selected based on the knees in the sorted memory histogram profiles for miniFE. The size of the NV and DDR were chosen to keep the cost of the total system to be the same or less than a DDR only solution. We analyzed the main memory access profiles of all the four miniapplications for different problem sizes using the simulation setup, but we performed the more time consuming simulations for hybrid system comparisons for just miniFE.
In the 8 GB two channel DDR configuration, the pages were striped at 1 KB boundary. We found that for miniFE problems studied here that the first 32 MB of physical addresses were accessed more frequently than the remaining 150 MB or 300 MB. Therefore the HMC with DDR configuration assigns the first 32 MB of the physical address space to HMC, and the rest 8 GB to the two DDR channels. The HMC with HybridSim performs a similar partitioning between the 32 MB HMC and 8 GB HybridSim where each channel of 4 GB HybridSim is fronted by a 32 MB DRAM cache.
The physical addresses for the hot pages may change between runs for an application due to reasons such as threading. The physical addresses are bound to virtual addresses on a first come serve basis. We found that even the virtual addresses for hot pages change quite a lot due to malloc calls. We have implemented our own version of heap management to produce nearly consistent virtual addresses for the same data across runs customized for the miniapps studied. Ariel provides the ability to explicitly map virtual addresses at a page granularity to different levels of memory. We will be using this feature in the future to assign hot virtual address pages of miniapps such as lulesh to high-bandwidth memories. Similarly, the combination of a different custom malloc implementation to generate backtraces at malloc call sites and the ability to track hot pages through simulation profiling allowes us to build tools to automatically identify the heap allocations for hot data structures in the application source. The programmer can then modify these allocations with fast heap allocation calls.
The simulations for miniFE indicate that a combination of HMC stacked memory and DDR, with appropriate partitioning of the data, can greatly improve performance (55% improvement in execution time). Though the HMC costs more per bit, the overall performance for a fixed cost is still improved (14% improvement). In these experiments, the three-level HMC+DDR+NVRAM has decreased performance due to large latencies of the Flash memory. However, the upcoming NV technologies may have lower latencies, and thus improved performance. DDR+Only" 18%"HMC"82%"DDR" 18%"HMC"36%" DDR(cache)"64%"NV" 18%"HMC"18%"DDR" (cache)"%"64NV"
MiniFE&Simula,ons&
Performance"
Perf/Cost" 
RELATED WORK
Several vendors have announced future products that will use MLM or a roadmap which may contain MLM systems:
• Marvell [3] has proposed a three level system combining some sort of High Speed DRAM (e.g. HBM, HMC, or WideIO) for performance with conventional DDR-DRAM and non-volatile Flash memory for capacity.
• Intel's Knights Landing processor will use a modified HMC for performance plus conventional DDR DRAM for capacity.
• AMD is actively exploring 2-and 3-tiered memory systems.
FUTURE WORK
Future experiments will focus on gaining a better understanding of the cost, power, latency, and performance tradeoffs in a wider range of applications.
This understanding will be used to develop tools and techniques for MLM design and programming. Tools to identify frequently used data and guide placement will be necessary for programmers to utilize MLM systems. Similarly, tools for quick analysis of hardware configurations will be developed.
A major focus will be on techniques beyond static programmerdirected placement. Programmer-directed prefetching and double/triple buffering will be explored. Another option is for the OS or run-time to direct the placement of data. Hardware magagement, similar to a cache, should be examined along with ways to decrease the amount of metadata (tags) that is required for cache management.
CONCLUSIONS
For a variety of reasons, future memory systems will be multi-level. The early simulations and analysis of applications indicates that there is a strong potential to create memory systems which are faster and/or less costly than conventional DDR-only systems.
However navigating this vast new design space will require better tools for software and hardware engineers. Better understanding of application requirements and hardware capabilities as shown here will be needed to perform detailed codesign of future machines and the codes which run on them.
