Because of the read and write asymmetries in both power consumption and latency, as well as limited write endurance, which often requires wear-leveling techniques, NVMs require a specialized controller. The fact that future on-die memory controllers are expected to handle different memory technologies pushes future hardware towards on-DIMM controllers. In this paper, we propose an architectural model for NVM-based DIMMs with internal controllers, explore their design space, evaluate different optimizations and reach out to several architectural suggestions. Finally, we make our model publicly available and integrate it with a widely used architectural simulator.
INTRODUCTION
Scaling the cell size of DRAM is becoming more challenging [20] and emerging NVMs are being widely studied as potential replacements for DRAM. Emerging NVMs, such as Phase-Change Memory (PCM), have latencies that are comparable to DRAM (less than order of magnitude slower) and promise high densities [8, 17, 19, 25, 30] . Many of the design parameters of NVM can impact the performance of systems, particularly at scale. Examples are the read latency from NVM cells, write latency and its associated optimizations such as Write-Cancellation, the maximum number of concurrent writes (typically limited by power budget), the maximum number of outstanding requests, number of banks, number of channels, internal caching and scheduling. In this paper, we propose and describe in detail an architectural simulation model for NVM-based DIMMs. We use our model to evaluate the impact of different optimizations and parameters on the performance of memory systems built from NVM-based DIMMs. We integrate our model with the Structural Simulation Toolkit (SST) simulator [26] , and make it publicly available. This paper targets memory-intensive HPC applications. Our analysis shows that most of the applications are very sensitive to the read and write latencies associated with NVM modules. We also observe that using large row buffers does not help for most of the studied applications. The maximum number of allowed concurrent write operations, which is limited by the power budget, can significantly affect the performance of several applications.
Additionally, we examine state-of-the-art write latency mitigation techniques, such as Write-Cancellation [23] , and examine their sensitivity to write latency and impact on performance. Additionally, we study the potential performance impact of on-DIMM caching and compare it with a DRAM-only system. Finally, we study the cost-performance trade-offs for alternative design options, such as a hybrid NVM/DRAM system with different insertion policies.
The rest of the paper is organized as follows. First, in Section 2, we discuss the key characteristics of emerging NVM technologies.
In Section 3, we describe an open-source architectural simulation model for NVM-based DIMMs, its key parameters and the rationale behinds each of them. In Section 4, we describe our evaluation methodology, including the default parameters. Section 5 includes a detailed analysis of the performance sensitivity for different NVM parameters. Section 6 discusses the conclusions from Section 5 and propose several architectural optimizations. Section 7 discusses the related work. Finally, in Section 8, we conclude our paper, with suggestions for future work.
BACKGROUND
In this section, we discuss emerging non-volatile memory technologies, their key characteristics and design options.
Emerging Non-Volatile Memory (NVM) Technologies
Emerging In this paper, we adopt the latter design approach, which we expect to be the most dominant for future memory systems. Our expectation is based on designs appearing in industrial patents and current prototypes [4, 6] . Figure 1 shows an example design of near-memory controllers. The internal controller typcially includes optimizations for writes, wear-leveling, internal caching and buffering, power managment and scheduling. One promising technique that has been shown to improve performance is Write-Cancellation [23] . Write-Cancelation cancels pending write operations in order to service read operations, which are usually on the critical path, avoiding a long time delay for read operations. For wear-leveling, Start-Gap wear leveling technique is expected to be used [24] . PCM write operations incur high power, which necessitates a power management scheme to limit the number of concurrent NVM device writes to avoid exceeding the power budget of the DIMM.
THE MESSIER NVM MODEL
In this section, we describe our NVM software model, which we use to conduct our experiments and base our analysis upon.
Our NVM model assumes a DIMM that consists of one or more ranks, where each rank consists of several banks. Each bank has a fast row buffer that caches the most recently read row. Note that since PCM writes must persist, PCM bypasses the row buffer and writes directly to the NVM cells in this case. The DIMM has an internal memory controller that tracks outstanding requests, schedules requests and implements several optimizations, such as Write-Cancellation and caching. If the request is a write operation, once scheduled, it will be written to the write buffer, which is guaranteed to be persistent. Persistancy can be ensured either by using a small capacitor or by using fast NVM memory such as To better describe the parameters, we will go through a read request scenario. Once a read request is received, it is placed in the transaction queue, where it waits until being scheduled. Once the scheduler decides to dispatch the request, which only happens if the corresponding bank is free and the rank internal circutary/bus/channel is free, it checks if it is a row buffer hit or miss, and accordingly issue a command to the corresponding NVM bank.
Sending the command occupies the rank bus for tCMD cycles. Once the command is received by the bank, if row activation is required (as in the case of row buffer miss), it will take tRCD cycles to load the data into the row buffer. Later, after the row activation or in the case of row buffer hit, the scheduler will again send another command to read the data from the bank, which will only happen when the rank bus is free. Once the command is received by the bank, it occupies tCL cycles to read a column and tBURST cycles to transfer the data over the bus. The request will be buffered in the ready_trans buffer before the data is sent back to the processor and notifying the on-die memory controller of the request completion.
In case of a write operation, once the request is scheduled, the data will be written to the persistent write buffer, and immediately notify the on-die memory controller of the request completion.
Note that a write is scheduled only when the write buffer is not full. To avoid throttling the system as the write buffer is nearing getting full, a flushing mechanism is deployed. Typically, a threshold value is deployed to determine when to start flushing the write buffer, through prioritizing evicting write entries over servicing new requests. The maximum number of concurrent writes is limited by the max_writes parameter, which can be set based on the power budget and thermal limitations. When a write is evicted, it occupies the bank for tCL_W cycles, while the rank bus is occupied for the time of sending the data and write command, tBURST and tCMD, respectively.
METHODOLOGY
We use the Structural Simulation Toolkit (SST) [26] with the Messier NVM-DIMM model [9] to conduct our experiments and analysis.
SST was configured to model 8 cores with private L1 caches and L2 caches. The L3 cache is shared across all cores and paritioned into eight banks. Our simulation default parameters are shown in Table 1 .
The simulation source code for these experiments is available on the SST repository 1 . The simulation infrastructure allows us to model the entire cache hierarchy, the coherency protocol, and memory latency. SST's memHierarchy, Merlin, and Ariel components were used for the caches, on-chip network, and processors respectively.
Since our focus is HPC appications, we use five miniapps from the U.S. Department of Energy (U.S. DoE): miniFE [14] , an unstructured implicit finite element code; Lulesh [15, 16] , a hydrodynamics code; Pennant [11] , an unstructured mesh physics mini-app; SimpleMoC [12] , a mini-app to study Method of Characterstics (MoC) for 3D neutron transport calculations; and XSBench [29] , A mini-app that represents a key computational kernel for the Monte Carlo 1 https://github.com/sstsimulator/sst-elements/tree/ ADVANCED_MESSIER Additional parameters are listed in Table 2 .
These applications were selected because they are memoryintensive and exhibit a diverse set of main memory access patterns. 
DESIGN SPACE EXPLORATION
In this section, we investigate the impact of several state-of-the-art optimizations and how varying several NVM parameters can affect the performance.
The Impact of Write Latency and Write Cancellation
Write latency is considered to be one of the key challenges for using emerging NVMs as main memory. So, we begin our design exploration by studying the sensitivity of the write latency of NVM devices. Figure 3 shows the impact of write latency on performance.
We vary the write latency, tCL_W, from 100 to 1000 cycles in 100-cycle increments. As expected, for most of the applications, the performance decreases as the write latency increases. The exception here is XSBench, which does not have many writes.
One way to combat the impact that write latency can have on application performance is to use Write-Cancellation, as described in Section 2.2. From Figure 3 , we can observe that at low write formance. This is because it can increase the average number of cycles that a bank is allocated for a write operation without actually decreasing the read latency as intended. The implementation of Write-Cancellation for this study uses adaptive thresholds [23] .
This adaptive technique uses the elapsed time since the beginning of the write as well as the current number of entries in the write buffer to determine whether or not to cancel the write. The rationale behind this implementation is to achieve a good balance between not aggregating too many writes while still using write cancellation effectively.
Power Constraints for Concurrent Writes
NVM write operations involve applying high current to change a cell state. However, due to cooling constraints and thermal limits, a maximum power budget is given for each DIMM. Accordingly, to abide by that power budget, each DIMM should limit the maximum number of concurrent writes to the NVM banks. Limiting this number to only few concurrent writes will increase the chances of filling the write buffer, placing back pressure into the memory system. In contrast, allowing a large number of concurrent writes may cause the system to exceed its given power budget. Figure 4 shows how the number of concurrent writes affects the overall execution time of selected applications. From the figure, we can observe that some applications, such as Pennant and Lulesh, are very sensitive to this parameter. This is consistent with our findings from Section 5.1, where we observed similar sensitivity for the write latency. On the other hand, we can observe that some applications, such as XSBench, have negligible sensitivity to the number of concurrent writes due to the read/write patterns inherent in the application.
NVM Read Latency
While NVM read latency is much better than that for writes, it is still slower than that of DRAM. To study the effect this has on application performance, we vary the NVM read latency, i.e., tRCD, and observe the change in the execution time, as shown in Figure   5 .
We can observe that some applications are highly sensitive to read latency, while others are not. Specifically, we can observe that Lulesh and Pennant are minimally affected by increasing the read latency, which can be explained by our observations on Section 5.1; Lulesh and Pennant performance is heavily dominated by the write latency.
Row Buffers Locality
As the read latency can have significant impact on the performance of some applications, we now explore a way to mitigate it. A common way to mitigate high read latency is through row buffers, which cache the row of the most recently accessed cache line in a bank. To study the effectiveness of this technique, we vary the row buffer size from 64B to 8KiB, as shown in Figure 6 .
We can observe that applications like XSBench and MiniFE benefit well from increasing the row buffer size, however, some applications are less sensitive to row buffer size, e.g., SimpleMoC and Pennant. 
The Impact of Internal Caching on Performance
One way to improve NVM performance is through caching blocks internally. To maintain the persistence feature of NVMs, we use read-only caches, where any write request immediately invalidates the cache copy once received. This internal cache is checked in parallel when adding the request to the transactions queue. If the block is found, i.e., a cache hit, the block will be returned from the cache and the pending request will be squashed. As the NVM does not incur significant idle power, the additional power overhead of SRAM or DRAM caches can still be comparable to DRAM-only systems. To study the performance gains of caching, we model an internal cache inside each DIMM with an access latency of 15 cycles.
Figure 7:
The impact of internal caching on performance. Figure 7 shows the results for using caches and compare it with no-cache-NVM and DRAM-only systems. We can observe that most applications benefit from using a cache as small as 4MiB. However, even with large caches, the NVM performance is by far worse than a DRAM-only system.
Paged Multi-Level Memory
Another mechanism to improve NVM performance is a multi-level memory (MLM). In this organization (See Figure 8) , main memory is comprised of both NVM and DRAM memories. Memory is accessed through a controller that can implement a number of policies to determine which data is placed in the fast, stacked DRAM or in the slower NVM-based DIMMs. An SRAM table within the controller contains the mapping of which pages are in which memory and additional meta-information (e.g. page access frequency) to implement its paging policy. For this study, we use an optimized NVM-based DIMM that implements the Write-Cancellation technique.
There are several possible policies for MLM management [13] which govern which pages are removed from fast memory and which are added. For this work, we tested 2 the addMFRPU (More performance on XSBench and Lulesh, however its performance was no more than 1-3% better than the simple addT policy. More significant was the threshold level. The threshold level defines a minimum number of accesses to a page before the page is considered for addition to the fast memory. We tested two thresholds (2 and 16) and found that different applications benefit form different thresholds. Figure 9 summarizes the results of different policies for an MLM system with roughly 1 4 of the memory as fast DRAM and the remainder NVM and using the addMFRPU policy. We varied both the threshold and the presence of a 16MB persistent cache in the NVM. The results are very application dependent. XSBench did better with a high threshold, while Lulesh, MiniFE, and Pennant do worse with a high threshold and no cache, but prefer a high threshold if there is a cache. Generally, an NVRAM cache did not help the performance of an MLM system, as would be expected since the page-level caching of the MLM system would interfere with the block-level cache of the NVM cache. However, the best XSBench performance was achieved with both paged-level MLM caching and the NVM cache. In general, performance was less than that of DRAM, though SimpleMOC performance was as good or better.
We also examined the impact of amount of fast stacked DRAM on the application performance (Figure 10) . In all cases, total main memory was 1GB. Lulesh, MiniFE, and SimpleMOC were largely insensitive to the size of the "fast" memory. XSBench and Pennant were very sensitive with XSBench more than doubling performance.
Overall, a MLM organization shows promise in improving the performance of a NVM system, however in most cases the raw performance is still inferior to conventional DDR DRAM. can incur significant overhead. While our results raise a warning for using emerging NVMs as the sole building block of the main memory, it helps to provide a case for architectures with multi-level memory -where NVMs can be used as an extension to memory capacity [7] . Additionally, we found that internal caching within NVM-based DIMMs is of limited use, which raises the case for software-managed caching for hot pages. For energy efficiency, we found that some applications do not benefit as much from large row buffer sizes, which motivates dynamic enabling/disabling or adjustable size row buffers solutions.
Cost & Performance
As show above, even with caching, NVM main memories generally have lower performance than conventional DRAM memories. However, the value proposition of NVM is not raw performance but its potential cost and power savings. Current and emerging NVM technologies have storage densities much higher than conventional DRAM cells, which will lead to significant cost savings. An NVM main memory may not be higher performance, but with its much lower cost it may still be a valuable architectural alternative.
To test this, we propose a simple cost model (Table 3) 
RELATED WORK
NVMain, which is a detailed NVM simulator, has been proposed to provide an architectural model to simulate emerging NVM devices [22] . Our proposed NVM model is more specific in that it aims to capture the form factor of NVM-based DIMMs with a high-end internal controller. We expect this model to resemble a large portion of the future NVM products. Additionally, our model is integrated in a widely used architectural simulator, SST. A hardware protoype to emulate PCM-based Storage Arrays has been proposed in [6] . In contrast, our work targets PCM-based DIMMs.
Many previous studies have explored using Multi-level Memory that consists of DRAM and NVRAM [7, 28] and demand paging insertion policy for DRAM, occasionally using dynamic policies [21] .
Some others explored hybrid policies to minimize energy [27] . Our work and study of the Multi-level Memory design option focuses more on cost and performance, aiming for a better understanding of the trade-offs when considering design options.
CONCLUSIONS
In this paper, we propose an new architectural simulation model for NVM-based DIMMs and explore implementation options. This model is highly parameterized and provides fast execution performance to permit scaling for long-duration simulations or complex application modeling. Later, we use the model to explore performance sensitivity to different NVM parameters. We used our model to investigate key parameters such as read latency, write latency, number of concurrent writes, row buffer size, internal caching, the effectiveness of the Write-Cancellation technique, and integration with paged multi-level memory systems.
Our study showed that the studied HPC applications vary in their performance sensitivity to read latency. For instance, we found that MiniFE and SimpleMoC are very sensitive to read latency, while less sensitive to write latency. In contrast, we found that MiniFE and SimpleMoC has less sensitivity to write latency, when compared to Pennant and Lulesh. We also studied how limiting the number of concurrent writes can affect the performance. Our analysis also shows the potential gains for increasing the row buffer sizes and augmenting DIMMs with internal caching. We publish our infrastructure integrated in a widely used simulator (SST), to enable community researchers to investigate designs such as hybrid memory systems, performance optimizations and write latency mitigation optimizations.
These experiments show that NVM-DIMM based systems generally have lower performance than conventional DDR4 systems.
However, when analyzed with memory system cost in mind, main memory with NVM becomes more attractive. There are a number of configurations which provide a better performance / cost tradeoff than conventional DDR-based main memory. These results can be used to guide future NVM-DIMM implementations and can be used by system architects to select a more efficient memory system.
For future work, we plan to incorporate models for Multi-Level Cell (MLC) technologies and model the impact of read latency asymmetry for different levels in cells. We also plan to investigate the impact of wear-leveling techniques, such as start-gap, in the lifetime of the system. Additional MLM policies can be crafted for NV memory, particularly policies that account for the difference between read and write latency. The impact of power and energy on total system cost will also be explored.
