Managing multi-level memories will require different policies from those used for cache hierarchies, as memory technologies differ in latency, bandwidth, and volatility. To this end we analyze application data allocations and main memory accesses to determine whether an application-driven approach to managing a multi-level memory system comprising stacked and conventional DRAM is viable. Our early analysis shows that the approach is viable, but some applications may require dynamic allocations (i.e., migration) while others are amenable to static allocation.
INTRODUCTION
With multi-level memory (e.g., multiple types of main memory in the same system) on the horizon, new approaches to managing and allocating memory are needed. The prevailing "always-allocate" policy for managing hardware caches relies on the low cost of allocating data in caches, as well as the fact that caches closer to the processor provide lower latency and higher bandwidth. Therefore, assuming temporal locality, migrating data closer to the processor is usually beneficial. In contrast, because memories are generally managed at larger granularities (e.g., 4KB pages) than caches (e.g., 64B cache lines), the cost of migrating data between memories is significantly higher. Thus, to avoid highcost, low-reward migrations, allocation must be selective.
ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. To copy otherwise, distribute, republish, or post, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Additionally, unlike caches, successive memory levels may not strictly decrease in latency and increase in bandwidth. In this work, for example, we consider the management of a two-level memory system with a "near" high-bandwidth, stacked memory (e.g., HBM or HMC) and a "far" conventional DDR DRAM. In this system the near memory latency is equal to or greater than the far memory latency, while bandwidth is substantially higher to the near memory.
Accordingly, effective management of the near memory will involve identifying and selectively allocating in near memory, data that consumes a disproportionately large fraction of main memory accesses. There are two primary strategies for identifying such data: (1) real-time monitoring and prediction by the system (i.e., hardware and/or operating system) and (2) application analysis. Because monitoring and prediction can be complex, we consider the second approach and seek to determine the viability of having the application provide allocation hints to the system about where to allocate a given piece of data.
In the following section we analyze the memory access behavior of three high-performance computing workloads to determine whether an application-driven allocation strategy of marking individual allocations as suitable for near memory placement is viable, and whether a static allocation is sufficient or dynamic (migration) may be needed. For the applications studied, we find that: (1) a small subset of malloc call sites consume a majority of the main memory bandwidth, implying that application-directed allocation may be possible; and, (2) , that the same allocations tend to be accessed similarly when we compare the memory accesses throughout the application.
RELATED WORK
While cache management by both the system and applications has been studied in detail, few existing works look at the problem of managing multiple types of main memory in a system [6] . Of the latter, some have looked at system approaches for managing a DRAM/NVRAM multilevel memory to reduce energy consumption [7, 4] . Others have considered stacked and conventional DRAM memory systems. [8] considers replacement policies for using stacked DRAM as an OS managed cache for conventional DDR, but finds that the best policy is workload-and memory sizedependent.
ALLOCATION ANALYSIS
We now analyze the main memory behavior of malloc calls across three high-performance computing applications which 
Methodology
We use the Structural Simulation Toolkit (SST) [5] to simulate and analyze the threaded version of each application. In particular, we combine SST's Ariel processor model with the memSieve memory analysis model. memSieve predicts whether a memory request will hit in the cache hierarchy without simulating a full, detailed hierarchy. As such, memSieve executes considerably faster than a cycle-accurate cache simulation, yet provides reasonably accurate predictions. We use Ariel's PIN tool to identify malloc call sites and link the mallocs to later accesses in memSieve. Here we define a malloc as any dynamic allocation, whether explicitly called by the application or implicit within a library (e.g., in C++ standard container modifiers). We model a 16-core processor with a 16MB shared last-level cache.
The input parameters for the three applications are shown in Table 1 ; each input yields an 8G data footprint. Because the applications at this scale cannot be simulated in full in a reasonable time frame, we simulate samples of 1-2 iterations throughout each application. Note that while the access counting described above is done for the sample, the tracking of mallocs is done across the entire execution so that accesses done to regions allocated before the sample start are correctly tracked.
HPCG
In this section we analyze the memory behavior of HPCG's conjugate gradient (CG) loop. We define a call site as a unique call stack ending with an allocation call, and group allocations by call site. Figure 1 shows the main memory access density of each call site along the Y-axis measured as the access count per byte allocated at that site. Call sites with no accesses are excluded-these account for just 1MB of the total 7.5GB. Sites are sorted by density (highest to lowest) along the X-axis. The right Y-axis shows the cumulative number of accesses accounted for by each malloc call site. To aid analysis, we cluster sites by their densities. For HPCG, we define a dense call site as one with a density of at least 0.5. Using this classification, a small fraction of call sites-13% (18) are dense but these sites comprise 60% of the total accesses. Therefore, the majority of accesses can be captured by putting just a few call sites in near memory. Additionally, the set of dense sites is nearly identical across iteration samples (not shown). Therefore, we expect static allocation without migration will be sufficient. Figure 2 shows the cumulative sum of the allocation sizes (left Y-axis) for the sorted call sites (X-axis). Again, the right Y-axis shows cumulative accesses. Ideally, the most dense sites would account for a small fraction of total memory so that those sites could always reside in near memory regardless of the memory's size. Fortunately, we see that the 17 most dense sites account for 400MB (5%) of the allocated memory while the 18th contributes 2GB. Because the set of dense allocations is small in terms of size but large in terms of accesses and is stable across loop iterations, we argue that HPCG is amenable to static, application-driven management of near-memory.
PENNANT
We now consider PENNANT. Because PENNANT frequently deallocates memory, it incurs 8 billion allocations totaling over 30TB although its data footprint is just 8GB. In Figure 3 , we show access density along the left Y-axis and the cumulative fraction of accesses along the right Yaxis for each malloc call site, sorted from most to least dense (X-axis). For PENNANT, we classify a dense site as having a density of at least 1.7. In contrast to HPCG, a moderate fraction of PENNANT's call sites are dense, at 49% (68). Similar to HPCG however, the dense sites account for nearly all of the memory accesses. Fortunately, Figure 4 shows that the dense call sites (X-axis) account for less than 1% of the allocated memory (left Y-axis). Further, we repeated this analysis with a smaller data footprint of 1 GB (not shown). We found that the set of dense call sites is largely stable, both throughout execution and as problem size changes. As such, we argue that like HPCG, PENNANT is amenable to static, application-driven management of near-memory.
MiniFE
Finally, we look at MiniFE. MiniFE's call site density is shown in Figure 5 along the left Y-axis with sorted call sites along the bottom and the cumulative fraction of accesses shown along the right Y-axis. Unlike the previous applications which exhibited hundreds of distinct call sites, MiniFE has just 22. Dense sites are those with densities of at least 1.0. Just five sites are dense but these account for only 22% of accesses. Thus while there are few allocations to manage, Turning to the size graph ( Figure 6 ), one sees a very different trend than that seen earlier. In the previous applications, the "accesses" line lay to the left of the "size" line, indicating that most memory accesses were to a small fraction of allocated memory. In contrast, MiniFE's size and access curves are nearly identical, indicating that to increase accesses to near memory, one must also increase the size of the near memory. Unfortunately, this implies that to maximize performance as the data footprint grows, MiniFE would likely require dynamic migration of allocations between near and far memory. Still, the small number of call sites makes managing dynamic migration easier.
CONCLUSION
Based on the applications studied, application-driven allocation is a viable approach to managing multi-level memory. A small to moderate number of call sites are dense, implying that the identification of dense call sites would not require unreasonable effort. Further, in most cases, a small set of allocation call sites accounts for a large fraction of memory accesses and a small fraction of the total memory. As such, static allocation is sufficient for most applications. In the case where dynamic migration between near and far memory may be needed (MiniFE), the number of call sites is small enough that managing the migration in the application is likely to have low complexity. While the set of studied applications is small, we argue that application-driven allocation should be considered in addition to or in place of hardware-driven approaches. Applications often know in advance which allocations are likely to have certain properties (e.g., high density) and can ensure that these allocations always fall into near-memory, ensuring performance predictability and stability. Our future work will include evaluating the performance impact from allocating data in near or far memory, as well as further evaluating the stability of the set of dense sites, both throughout execution and as the application footprint and memory architecture changes.
