ABSTRACT Emerging non-volatile memory (NVM) requires refactoring of the hardware and software stacks used on current computer systems. Modern researchers typically rely on simulators to test their innovations. Unfortunately, running a simulation requires orders of magnitude more time than performing a native run, and most simulation platforms are difficult to modify or debug. In this paper, we propose using emulation to reduce the substantial simulation overhead by proposing an extensible lightweight emulation framework called LEEF. Unlike previous NVM emulation implementations, which rely on specific hardware and use simple performance models, LEEF is built on a detailed performance model implemented through performance monitoring events that can be found on most commodity processors. LEEF also exposes a realsystem memory trace generation interface for trace-based memory simulators. Using the traces, simulation results can be analyzed and integrated into future LEEF emulations. The results of experiments show that LEEF is more accurate than prior emulation approaches. We also present two case studies of recent microarchitectural innovations simulated on LEEF. To the best of our knowledge, this is the first work that combines simulation with memory emulation.
I. INTRODUCTION
Emerging memory technologies differ from DRAM in several aspects. Emerging memory technologies have different latency, bandwidth and power consumption requirements than DRAM, positioning them as promising substitute candidates for main memory. Systems researchers are working on building customized file systems, storage stacks, and memory management strategies to make better use of these new memory technologies [1] .
However, the process of designing new memory subsystems currently relies heavily on simulators. To date, most studies on emerging memory technologies have been performed using simulators. Simulators are built from the bottom up. Researchers use device-level simulators such as CACTI [2] and Mcpat [2] to model the dynamic power requirements, access times, area, and power leakage of memory technologies and then use measured parameters to build simulators that simulate DIMM memory such as DRAMsim [3] . Finally, the DIMM memory simulators are integrated into full-system simulators such gem5 [4] and MARSS [5] that provide the ability to simultaneously optimize both software and hardware capabilities. Although full-system simulators support whole-system redesign, they suffer from slow simulation speeds. It requires approximately 8 hours to simulate one core for one second on simulators working at 200 KIPS [6] -a speed that is common to full-system simulators such as gem5, Flexus and MARSS. Moreover, full-system simulators often include a detailed model for hardware, making them difficult to understand or extend. Hence, researchers are looking for new platforms on which to perform experiments. Recently researchers have been using emulation methods to develop system software [7] , [8] .
The implication of NVMs is that they will require refactoring the virtual memory and file system of operating systems [8] - [11] . Architectural innovations are also required to bring NVM into the current hardware-software stack. Unfortunately, optimization at both the software and hardware levels requires using a full-system simulator, which inhibits NVM innovation.
Recently, researchers have begun to investigate emulation methods for NVM research. However, the existing simulation approaches either rely on specific processor models or omit important features of NVM. PMEP is designed for customizing file systems on NVM [8] ; however, it requires exposure of the bandwidth throttling register, which operates only on specific BIOS and motherboard combinations. In this work, we propose a novel NVM emulation platform called LEEF. LEEF can be differentiated from existing emulation platforms from two aspects. First, it is built through a deep survey of state-of-the-art memory performance models on real systems. Because LEEF implements the most accurate models, it can heuristically select the most accurate model during NVM emulation. Second, it provides an interface for integrating the simulation results with other emulations, forming a promising substitute for full-system simulations for software-hardware co-optimization.
Our contributions in this paper are threefold.
• By implementing popular memory performance models, we discovered that no single performance model is the most accurate for all application. Therefore, we propose an online algorithm to select the best performance model in LEEF.
• Second, LEEF can emulate a wide range of NVM latency and bandwidth characteristics with a detailed memory performance model proposed. The implementation of LEEF is based on hardware performance counter hence is lightweight with trivial overhead. Also the persistence feature of NVM is emulated without the need to modify application source code.
• Third, LEEF supports research of fast softwarehardware co-optimization by exposing a real-system memory interface to simulators, which is orders of magnitude faster than current full-system simulation. To the best of our knowledge, this is the first work that combines simulation with memory emulation. The remainder of this paper is organized as follows. Section 2 presents background on emerging memory experimentation platforms and the organization of current memory systems. Section 3 describes the architecture and implementation of LEEF. Section 4 discusses the performance model used in LEEF emulation. Section 5 evaluates the accuracy of the LEEF model's performance and presents the results of experiments using LEEF. Section 6 presents related works, and Section 7 concludes our work.
II. BACKGROUND A. EMERGING MEMORY EXPERIMENTAL PLATFORMS
Researchers use both simulation and emulation in experimenting NVM. We will first introduce simulation platforms. We can categorize simulation platforms as follows:
1. Full-system simulators simulate an entire computer system, which can run an OS on the simulated computer. System researchers can modify the OS or any hardware component. However, full-system simulators are extremely detailed. Consider gem5 as an example. This simulator executes at 200 KIPS when using the detailed core and memory model, which is orders of magnitude slower than native execution speeds. In addition, full-system simulators emulate all peripheral devices to support the OS. To support research into NVMs, trace-based memory simulators are integrated into full-system simulators [4] , [5] , making them even more costly in time and difficult to extend.
2. User-level simulators are driven by benchmark applications. For example, in the system-call emulation mode of gem5, all the dynamic binary translation (DBT) simulators are app-driven simulators. The OS is not the benchmarking target of user-level simulators. For example, Zsim [6] only requires a user to configure the benchmark application and the architectural model used during simulation. Although some user-level simulators run at speeds from 1 10 MIPS and typically run at several hundreds of KIPS, they are still considerably slower than native execution speeds. Moreover, because fast user-level simulators are built using DBT, they are difficult to debug or extend.
3. Trace-based simulators are driven by memory traces. A memory trace is a memory request stream in which each request is represented by a tuple composed of an address, a timestamp and a read or write request label. For example, nvmain [12] and dramsim [3] are trace-based simulators. Trace-based simulators parse each memory trace and insert them into a request queue. Then they dispatch the request through the architectural model to calculate the data of interest. Trace-based simulators are organized in a layered structure, which makes them easier to understand and modify. However, they are limited by the traces they use; to acquire the memory traces, trace-based simulators are integrated with full-system or user-level simulators, Next, we discuss emulation approaches when building an NVM experimental platform. There are three key components to emulating NVM: latency, bandwidth and persistence. To our knowledge, there have been only two prior works that have attempted to build an NVM emulation platform. The first is PMEP [8] , which uses hardware debug hooks to build the emulation system. Only the read latency is emulated based on a simple proportional equation, but write latency is not emulated. PMEP uses bandwidth throttling in the memory controller. To emulate persistence, PMEP uses software to inject delays into the running process.
The second is NVMpro, built by Sengupta et al. [13] . NVMpro uses a per-channel configuration register to throttle DRAM bandwidth, which requires both a specific processor and a tuned BIOS. However, NVMpro does not emulate the persistence feature of NVM. FPGA-based emulation can also be used in NVM emulation, the access latency can be injected at the memory controller level [14] . This method is intuitive but relies a seperate experiment board.
B. MEMORY ORGANIZATION
This section introduces current memory organizations.
Conceptually, a memory system consists of multiple channels, each of which is divided into multiple ranks, and each rank is further divided into multiple banks. Each bank contains a fixed number of rows and a row buffer. For example, Figure 1 shows a machine with 2 channels. Each channel includes 4 ranks, and each rank consists of 8 banks, each of which is divided into n+1 rows and a row buffer.
When a request is forwarded from the processor to the integrated memory controller(iMC), the iMC first checks the timing constraint of the target channel/rank/bank corresponding to the requested address. If satisfied, it sends the VOLUME 5, 2017 FIGURE 1. Memory organization in a conceptual two-channel memory subsystem.
respective memory command to the target channel/rank/bank. Abstractly, four basic steps are involved when accessing current memory system. The first command is precharge (PRE) which basically drives any data in the row buffer back to its former attached original row and disconnects the row buffer from the row, which requies tRP time. The second command is row access (RA), which connects the row buffer to the requested row, which requires tRCD time. The third step is column addressing (CA), which locates the desired data in the row buffer, which requires tCAS time. Finally the required data is sent back to the iMC, which requires tBurst time.
However, for emerging memory, because the read operation for each cell is no longer destructive, the precharge step is not required; therefore, the abstract steps that occur when accessing an NVM cell can be considered as only three steps: RA,CA and data transfer.
III. SYSTEM DESIGN
This section describes details of the LEEF design and then discusses the implementation overhead. As a brief overview, to provide flexibility in emulating emerging memory, LEEF support two modes: local mode and simulation mode. Local mode is intended to monitor the performance gap between native executions using DRAM and target simulations using NVM and then to try to bridge the gap, providing an illusion that the application is running entirely on NVM. Simulation mode is aimed at providing a way to integrate memory microarchitecture modifications. By analyzing and understanding the influence of memory organization changes on application performance, we can quantify and reflect the changes in the performance model. Then, we can use the refreshed performance model to emulate NVM.
Next, we introduce the implementation of the two modes mentioned above. For local mode, PMU is used to monitor selected events. We divide the application runtime into equal-length epochs. At the end of each epoch we calculate and fill the performance gap using ptrace. Simulation mode is based on local mode. However, before explaining the simulation mode implementation, we must first present the LEEF architecture.
A. ARCHITECTURAL OVERVIEW LEEF consists of a trace generator, a performance monitor, and the LEEF runtime. LEEF also includes a trace-based simulator whose input is the memory trace output from the trace generator. LEEF is an online lightweight emulation framework for emerging memory. LEEF monitors an application running on the host's memory subsystem and predicts its performance on the target memory subsystem. To support architectural innovation emulation on the memory subsystem, LEEF provides a real-system memory trace generator that drives the simulation.
1) TRACE GENERATOR
The trace generator is a kernel level logger that is an extension of Badgertrap [15] . Badgertrap marks the page table entry of a process's physical page frame as reserved; consequently, when the hardware performs a page table traverse it will be trapped by the kernel fault handler. To generate the memory trace, we traverse the page table and record the corresponding physical address in the OS fault handler and then use PMU to monitor the read and write ratio. Finally, we mark the physical address with a read or write label according to monitor the ratio and output the labeled address as the memory trace.
2) SIMULATOR
Here, ''simulator'' denotes a trace-based memory simulator such as DRAMsim [3] , Nvmain [12] , or Ramulator [16] . Using a trace-based simulator, a researcher can reorder the request queue using various strategies or alter the organization of a memory module. A simulator is aimed at evaluating the impact of microarchitectural innovations. Traces generated by the trace generator function as the input for trace-based simulators where the simulation results can be analyzed. Although many simulation results can be considered for emulation, we chose the change ratios of β and RPF because the change of locality and memory level parallelism is the key to better performance. A trace-based simulator requires only slight modifications to be able to collect and calculate these change ratios. Finally, these indicators are input to the LEEF runtime.
3) PERFORMANCE MONITOR
A performance monitor is used to monitor memory requests during application runtime. Here, we use the performance counters on an Intel server. Both incore and uncore MSR are used to profile monitoring events, from which we derive memory access fraction, read and write fraction, and row buffer miss/hit/empty rate.
4) LEEF RUNTIME
The LEEF runtime is responsible for emulating the latency, bandwidth and persistence of NVM. Because NVM can either substitute for DRAM or as function as storage-class memory, an application running on NVM can be classified as a persistent application or a non-persistent application. To support non-persistent applications, LEEF emulates the delay and bandwidth of NVM, while for persistent applications, in addition to delay and bandwidth, LEEF also emulates the persistent overhead of NVM. LEEF emulation of NVM will be discussed in more detail in Section III-B and Section III-C.
B. NVM LATENCY AND BANDWIDTH
In our pursuit of lightweight fast and accurate performance modeling, we model only some of the features of NVM. However, we don't consider the SET-UNSET latency imbalance for phase-change memory, although we can emulate it using an empirical probability of bit-flip frequency.
At present, emerging memory has higher access latency and lower bandwidth than DRAM. Hence emulating emerging memory on DRAM requires delaying the CPU pipeline to emulate slower memory access or bandwidth limits. To our knowledge, there are two ways to emulate NVM latency. The first method is aimed at ensuring that every memory access has the same latency as NVM. The second method focuses on ensuring that the average memory latency corresponds to NVM. The first method can be implemented using two approaches. The first approach is to use the remote DRAM access delay of a NUMA machine to emulate NVM, although this approach can emulate only fixed latency. The second approach is to set the protected bit of a page frame in the page table. When a process accesses the protected page frame it will enter page fault handling, causing every memory access to be delayed by software. Unfortunately, we find that this approach will fire CPU soft traps for memory intensive applications. Moreover, frequent context switching into the kernel causes kernel data structures to remain in a locked state for extended periods, causing kernel race. The second method is intended to ensure average memory latency. The core of this method is to stretch the time to ensure that memory requests issued during a fixed period of time are below a configured threshold. Following this method, we divide the runtime of the target application into epochs of equal length. During each epoch, we profile the target application to acquirer performance statistics. At the end of each epoch, we use the collected data to model the latency of NVM. Then, based on the performance gap between NVM and DRAM, we stall the application to fill the gap.
For bandwidth emulation, prior works propose using thermal control registers on the Intel XEON iMC to throttle the memory bandwidth. Unfortunately, this approach relies on a certain platform and on customizable BIOS settings. In contrast, LEEF uses a pure software approach to throttle bandwidth. During each epoch, we monitor the write bandwidth and then throttle bandwidth by injecting a delay at the end of each epoch if the monitored bandwidth exceeds the NVM bandwidth limit.
PMEP emulates NVMM bandwidth only at discrete values that are between 1 and 1/16 of the default DRAM bandwidth.
C. NVM PERSISTENCE
To maintain memory consistency, write order must be preserved in systems using NVRAM. If a pointer stored in persistent memory points to dirty data in the CPU cache, then a sudden loss of power or hardware faults will cause dangling pointer. In most systems that adopt write-back caching, a clflush operation must be issued before assigning a value to an NVM-resident pointer. More importantly, we must issue an mfence operation to ensure the operation order. Therefore, to provide full support for NVM persistence, a dual clflush-mfence operation is required.
Prior emulation works have often overlooked the importance of NVM persistence, typically adding a latency to emulate the NVM persistence overhead or simply overlooking it. However, we believe that NVM persistence is key feature of NVM that must be exposed to NVM researchers. Consequently, we execute real clflush-mfence operations to simulate NVM persistence. To avoid rewriting application code, we inject machine code into a running application to flush the cache line using ptrace. Then, we attach to each process and execute mfence to preserve the operation order. Using this approach, we can support research on persistence overhead without having to modify or recompile applications.
Intel has introduced two instructions to provide persistence guarantees in future-generation server processors: PCOM-MIT and CLWB [17] . PCOMMIT is a persistent commit instruction that ensures that data stored via a store-to-memory command are stored to a power-failure-protected region on persistent memory. CLWB writes back the specified cache line from any level of cache to memory and is ordered only by store-fencing operations. While these features are not currently available in our machine, we can extend LEEF to support these two instructions in the future.
To convert NVM-unoblivious application to NVM-aware application, compile time identification and source code analysis of persistent need is required. LEEF emulation uses a simple persistence model which in which using performance monitor, each 50 write request will init a persistence operation. We assume that persistence is required at the end of each epoch and inject persistence instruction to application at runtime to provide a more accurate model.
D. IMPLEMENTATION OVERHEAD
LEEF introduces both offline and online implementation overhead. The offline overhead is due to the trace generator. The trace generator prints memory request traces in the kernel fault handler and because it involves a page table traverse, context switching and page table operations are involved. The online overhead comes from the LEEF runtime. Injecting both latency and persistence delays into a running program causes online overhead. Note that the overhead must not overwhelm the performance gap between the target NVM and DRAM. Suppose the injection overhead is T inject lat , and that the performance gap between NVM and DRAM is T gap . VOLUME 5, 2017 Then, at the end of each epoch, we must ensure that T gap is larger than T inject lat . Moreover, we must subtract the injection overhead from T gap so that the gap to fill becomes T gap -T inject lat . In addition to latency injection, we also inject memory persistence operation for every fixed write count to NVM. Assuming that we have performed persistence injection N times, each injection requires T inject persist . Finally the total gap to fill is T gap -T inject lat -N*T inject persist .
IV. PERFORMANCE MODEL
This section introduces the LEEF performance model. Although there are a few accurate performance models built into simulators [18] - [20] , we limit this discussion of performance models to those built into real systems.
Performance models in prior memory emulation efforts can be categorized as bandwidth-based and latency-based models. The bandwidth-based model programs the iMC register on a specific hardware platform to throttle bandwidth while remaining ignorant of variance due to latency. Unfortunately, bandwidth-based models require a specific hardware platform [8] . Latency-based models require only knowledge of memory operation latency and are more general. Hence, we focus this discussion primarily on latency-based models. The most recent latency-based model is the linear-latency (LL) model.
The linear-latency model is the performance model proposed by HP. It monitors a performance monitoring event called LDM_PENDING and uses that as the memory bound. The memory time consumed by emerging memory is predicted in proportion to the increase of latency [13] .
The linear-latency model is built on three assumptions of memory time consumption.
1) Stall cycles caused by load requests are proportionate to memory latency. 2) Changes in bandwidth and latency will not influence the linear relationship between stall cycles and memory latency. 3) Memory time consumption is determined by load requests. To verify the accuracy of linear-latency model, we used numactl to route all memory accesses to a remote node, where the linear-latency model is used to predict application execution time. The results showed that although the linear-latency model is good at predicting times for some applications, it results in significant prediction errors for several applications. As depicted in Figure 2 , the prediction result of the linearlatency model is unreasonable for libquantum. Note that the results shown are predictions only of the LDM_PENDING cycles rather than the execution time; it is reasonable to expect higher error levels when predicting times using a linear-latency model. A further exploration of predicting execution times using the linear-latency model is discussed in Section V-A. We also consider a widely-used latency-based model that considers last-level cache misses multiplied by the average memory latency, an approach also known as the Green-Governor (GG) model. Our experiment shows that the FIGURE 2. Linear-latency model memory-bound prediction. Note that here we predict the PMU event count rather than the total execution time.
GG model achieves better accuracy than does the LL model for high-bandwidth applications such as libquantum.
Apparently, stall cycles for memory are not always proportionate to memory latency. The bandwidth of the memory subsystem and the issue rate of each application should also be taken into account-namely, the ratio of issue rate and max memory subsystem bandwidth. Based on this observation, we propose two optimizing models that consider the relative bandwidth of an application and the memory subsystem. We believe an intuitive model can be used to improve the accuracy of the LL model, which we call the non-linear latency (NLL) model. The NLL model multiplies the LDM_PENDING reading by the relative bandwidth ratio. NLL could be calculated more accurately by using logistic regression or kernel-based regression, but this method is intuitive and is just a brutal optimization over a simple model, so we do not discuss this further.
Because emerging memory currently operates less quickly than DRAM, its relative bandwidth ratio is higher. Therefore, a better performance model that features high accuracy even with a high relative bandwidth ratio is required. Figure 3 shows a colormap scatter of monitored LDM_PENDING as a function of LLC misses. The colormap corresponds to the row buffer miscount. We can see that with high row buffer miss rate, LDM_PENDING changes faster as the memory load increases and that dif-ferent row buffer miss rates correspond to different rates of change in LDM_PENDING. Hence it is reasonable that row buffer misses should be considered when calculating memory bounds.
To relect the influence of both memory locality and memory-level parallism we derive a new model. Here, we focus on a new model called the average GreenGovernor (AGG) model in emulating NVM on LEEF. The AGG model considers hit/miss/empty operations on the row buffer. The AGG model is a modified Green-Governor model [21] . Put simply, the AGG model uses the product of the last-level cache miss count and the average memory latency to calculate the memory bound. The average memory latency is calculated by the following equation:
where N_r and N_w are the uncore read and write counts in the iMC. Previous works have also used O_rtr and O_wtr in calculating the average memory latency [22] , where O_rtr is the rank-to-rank switching overhead and O_wtr is the writeto-read switching overhead on the bus. However, rank-to-rank switching and write-to-read switching can be monitored only on AMD processors. Moreover, it is trivial compared with the average read and write latency; consequently, we do not calculate O_rtr and O_wtr here. Instead, we calculate latency as follows:
Equation (2) shows the approach we took to calculate the average read latency. We consider the page hit/empty/miss ratio. Note that since N pe and N hit cannot be directly monitored using performance events, they are calculated by using monitored numbers of N act , N pm , N r , N w , N act . Here N act denotes number of row activation, N pm denotes number of page miss. Similarily, N ph denotes number of page hit and N pe denotes number of page empty. N r and N w denotes number of read and write. For the average latency of page hit/empty/miss we first consider the following equation as used in Dramon:
where MaxAccBk is the maximum number of activated banks limited by FAW. We do not calculate tRP for page misses because emerging memory features non-destructive read operations; therefore, it does not need to precharge or restore values after reading them.
Writing to emerging memory uses an equation similar to Equation (3) and considers page hit/empty/miss operations. However, we calculate the average latency of page hit/empty/miss operations using the following equation:
All used performance events are based on events listed in Table 1 . Page miss data can be derived by using page activation and page precharge. The load buffer delay and store buffer delay can be used with LDM_PENDING to calculate memory bound. For uncore events such as read, write, page activation and page precharge, the demonstrated event select code is only for channel 0, other code is not shown. Emerging memory technologies such as phase-change memory feature long iterative write latency. Hence, intuitively, a write operation may overlap with other operations such as other writes or activations due to limited programming current [23] . Considering the high current required for activation, write and activation commands must be spaced apart from each other [12] . We do not attempt to account for any overlap between write and activation commands in this study.
V. EVALUATION
In this section we analyze the key parameters in our performance model and present our emulation results on several benchmarks. The parameters are shown in Table 2 .
We conducted our study on a PowerEdge T430 server with two Intel XEON E5-2603v3 processors clocked at 1.6GHz, with 32GB of total memory. Each E5-2603 v3 processor includes 6 out-of-order processor cores with a three-level cache hierarchy. The L1 and L2 caches are private for each core, while L3 is shared among all the cores in each processor. Our system has 32GB of memory divided evenly for each node (16GB for each node). Each node has 2 channels, and each includes an 8GB DIMM.
We measured cache and memory latency using lat_mem_rd in the pointer-chasing benchmark lmbench3 [24] . To determine the row buffer miss latency, we first reserved a memory VOLUME 5, 2017 area from the OS, then performed ioremap on the reserved area with ioremap_no_cache. Next, we accessed this area with an address inside it. Then we flipped the row bit of the address and accessed it using the new address. Finally, we measured the access time of the two consecutive accesses and compared it with two replicated-address accesses to calculate the row buffer miss overhead. In this work, we use memory intensive benchmarks from CPU SPEC06. To measure performance, we use the average execution time per instruction(TPI). We ran LEEF on CentOS 6.4 with the 2.6.32 Linux kernel. 
A. ACCURACY
To verify the prediction accuracy of the performance model used in LEEF, we tuned the memory frequency and accessed memory remotely from the NUMA node to throttle memory bandwidth. Latency and bandwidth statistics are shown in Table 3 . We compare the predicted result with actural running result to evaluate the accuracy. The prediction target is a NUMA configuration in which the application runs on cores of socket 0 and application data are located on memory of socket 1, which is implemented using numactl. Then we use the tested performance model in LEEF to compare the acuracy of these model. The accuracy is calculated using Equation (5) .
Here, MB denotes memory bound, which is the total time spent in memory. MB configx means memory bound when the memory subsystem is tuned to configuration x, and T configx denotes the application execution time under configuration x. Figure 4 shows the error of the memory performance model for cross-node memory accesses. We used numactl [25] to bind the benchmaking process to one node and allocated memory using another node. We did not use all the benchmarks from CPU SPEC06 [26] because prediction of memory non-intensive benchmarks involve only a 2% error rate [13] on average. However, among the ten most memory-intensive benchmarks from CPU SPEC06, linear latency (LL) model has a prediction error of 11.9%. As shown in Figure 4 , for cactus, deal, omnetpp, and zeusmp, LL provides the best accuracy. For lbm, leslie3d, libquantum and wrf, the memory access (MA) model is the best. For gems, gobmk, and soplex, our Average Green Governor (AGG) model is the most accurate. It can be seen that LL model, which simply based on a performance event, is the worst among the three model. And the memory access model, which is based on average latency, is better than LL model. However, they lack the knowledge of row-buffer locality and memory level parallism, which is discussed on AGG model.
After exploring the errors caused by different performance models, we discovered that no single performance model is most suitable to predict all the benchmarks; hence, we believe a heuristic method should be used to select the most suitable performance model by characterizing applications. We use memory bandwidth, last-level cache hit miss rate and row buffer hit/miss/empty rate as the main metrics. Our experiment shows that when choosing the best performance model, a heuristic algorithm can achieve an average accuracy of 5.7%. The result is based on performance model selection hence the result is the best of the tested three model.
Note that the heuristic algorithm is a regression model with input of page hit/miss/empty rate, read and write fraction and average instruction rate. The trained model will be used to do the performance model selection. Since the proposed model are all based on performance events that can be multiplexed in performance counter on processor at the same time. The selection of model is a process of using different equation in essence. Naturally, this classification problem can also be solved using various regression methods. However, many performance models are empirical and rely on specific parameter tunings to achieve better accuracy. In contrast, the proposed heuristic algorithm is an intuitive method that simply selects the best performance using prior knowledge.
In this paper, we focus on emulating emerging non-volatile memory. For NVM, due to the limited timing constraints, the MA and AGG models are often better than the LL model. We chose the AGG model to monitor NVM in LEEF, which allows us to monitor the row buffer locality, memory-level parallelism and micro-ops latency in the model [22] .
B. PERSISTENT VS NON-PERSISTENT
Applications are categorized as persistent and non-persistent based on the way they use persistent memory. If an application is NVM-aware (meaning it will utilize the non-volatility of NVM and treat NVM as persistent storage) then it is called a persistent application. In contrast, when an application is not aware of NVM and simply uses NVM as working memory, it is called a non-persistent application.
In this section, we show comparisons between the performances of both persistent and non-persistent applications using LEEF. The parameters used by LEEF were derived from published data [23] , [27] , [28] . The execution times of non-persistent applications are shown in Figure 5 . The performance degradation on NVM varies across different benchmarks. On average, PCM degrades performance by 43.3% while STT degrades performance by 8.9%, and RRAM degrades performance by 9.3%. Across all benchmarks, deal and gobmk run almost equally as fast when using different memory types, which provides the insight that when treating NVM as working memory these benchmarks can be scheduled to run on NVM with trivial performance degradations. However, the other benchmarks, especially gems and mcf, are more sensitive to the underlying memory technology; consequently, they should be scheduled to run on DRAM to avoid further performance degradations.
As shown in Figure 6 , on PCM, non-persistent applications will suffer an average slowdown of 1.4x, while persistent applications will suffer a slowdown of 2.6x on average. Among all the benchmarks, mcf suffers the most; non-persistent mcf execution is 2.3x slower and persistent execution is 3.4x slower.
For each potential NVM candidate, we conducted an experiment using both persistent and non-persistent emulation. The results are presented in Figure 7 . We calculated persistence slowdown by dividing the execution time of the persistent application by that of the non-persistent application. On average, the persistence slowdown is 83% for PCM, 113% for STT and 112% for RRAM.
C. ARCHITECTURAL OPTIMIZATION EMULATION IN LEEF
As we have discussed in section, LEEF features integrating architectural innovation with emulation. In this section we focus on two cases optimizing DRAM organization and demonstrate how to use LEEF to support architectural innovation extension in emulation.
1) CASE STUDY (ROWCLONE)
RowClone is an architectural innovation that optimize memory organization for bulk copy operations [29] . It adds new memory operation to integrated Memory Controller to accelerate copy operation. In this section we will illustrate how we study RowClone on LEEF.
Since RowClone is designed for faster copy operation, we have to identify the fraction of memory copy in all memory traffic. We use the same notation used in RowClone, FTMC, the fraction of traffic due to memory copy.
We implemented forkbench to analyze the performance gain of RowClone. In this benchmark, we first initialize the value of an array sized S with random value so that we can be VOLUME 5, 2017 sure that all physical memory is allocated. Secondly we call fork(). In the child process we access the array in a discrete manner so that everytime we access different page and assign a new value to each accessed cacheline.
Due to the limitation of simulation, RowClone only tested a small array which is at most 128MB. However, it is common for an application to occupy GBs of memory. LEEF is capable of monitoring big memory with low overhead. The FTMC on LEEF is shown in Figure 8 for 8GB array. Rowclone relies on two access pattern to make copy faster. The first pattern is called FPM, which is performed in the same subarray and multiple copy operation in different subarray can be paralleled. The second pattern PSM is performed in different subarray and data is transferred in an DMA-like manner across different banks in the same chip. The optimization of RowClone is orthogonal to memory techniques since it only requires adding memory operations in iMC. Hence we analyze the performance impact of RowClone on NVM. By identifying the copy operations, both FPM and PSM can be easily combined with LEEF. As shown in Figure 9 , PSM and FPM bring 15.8% and 8.4% IPC increase when N equals 8GB respectively. However, on previous experimentation, due to limitation of simulation, the result given are at most 512MBopera-tions [29] . By LEEF we can observe that the performance gain increase become lower when memory footprint get close to 8GB.
2) CASE STUDY (SALP)
In this section we will introduce a case on integrating SALP into emulation of LEEF.
SALP is based on the insight that independent subarray in the same bank will improve memory parallelism [30] . SALP is composed of three mechanisms to overlap operations within the same bank, SALP-1, SALP-2 and MASA. SALP-1 overlaps precharge and activation of different subarray. SALP-2 overlaps PRECHARGE of current activeted row and the activation of another subarray. MASA allows multiple subarrays to be activited at the same time, but only one activated subarray will drive the bitline. Although we cannot directlly conduct such low-level alternation on microarchitecture in emulation, we can derive the parameters from simulation results. We inherit the parameters from posted data [30] . The parameters we use are given in Table 4 . These data are collected by simulators and will be combined into the emulation of LEEF. We emulate for SALP-1, SALP-2, and MASA on LEEF by deriving parameters from simulation results. We use the data in Table 4 in emulation. The three key parameters used are read latency, memory parallels and row buffer hit rate improvement. Since read latency is decide by tRCD and tCAS. RD Latency represents the fraction of latency using different SALP innovation. For memory parallellism, it will enable more banks working in parallel, hence the overlap between row buffer hit/miss/empty will increase. Finally row buffer hit rate is an intuitive value that is easy to be combined in NVM model. The result are shown in Figure 10 . As can be seen, in average mcf is most sensitive to the SALP optimization and leslie3d are not sensitive to SALP. For all benchmarks SALP-1 brings 17.6% IPC increase, SALP-2 is 18.7% and MASA increase IPC by 20.3%. We can see that deal and wrf are insensitive to subarray-level optimization. For deal the average performance increase by only 2.8% and for wrf it is 1.9%.
VI. RELATED WORKS
NVM emulation relies on accurate modeling. However, prior works on NVM emulation are incomplete and tend to model NVM using only simple performance models [8] , [13] . For example, the first NVM emulaton platform PMEP [8] considered only read latency with proportional latency, and for write operations, PMEP simply throttled the bandwidth using a discrete bandwidth value via the thermal throttling registers in the memory controller [31] . Another known NVM emulation method emulated NVM using the LL model [13] , [32] ; however, as discussed in Section V-A, the LL model is not accurate for some benchmarks. Recent FPGA-based NVM emulation uses hardware to emulate DRAM, parameterizing read and write latency by injecting latency at the memory controller hardware [14] . Unlike previous emulation works, LEEF tries to emulate application performance on NVM with higher accuracy, which requires a through analysis of memory bound. Memory bound is an important component of performance modeling. Prior efforts to create performance models can be categorized into empirical models and mechanistic models [33] . Empirical models treat prediction targets as black boxes, and emulate them using regression methods, while mechanistic models are designed through an understanding of the underlying mechanisms. Empirical models use regression-based models to predict memory bandwidth and application execution time [34] .
Mechanistic models can be categorized as static analysisbased, simulator-based and real-machine PMU-based. The static analysis based models use application source code annotation and analysis to increase the prediction accuracy, but this approach requires recompiling the source code and increases the complexity of the performance [34] , [35] . For simulator-based models, mechanistic models can sample any event of interest to improve accuracy but cannot be implemented on a real machine [36] , [37] . Finally, the PMU-based performance models rely on specific exposed processor features. Many mechanistic performance predictions are available for the AMD platform. For example, LL-MAB is built for AMD processors and uses the MAB register to model memory bound with high accuracy [38] . Dramon is also built for AMD processors: by using a probabilistic model it predicts memory bandwidth with an average accuracy of 96% [22] . However, on the Intel platform, to our knowledge, memory bound predictions are restricted to using latency scaling to predict memory time [39] or to using the PMU event LDM_PENDING [13] to estimate the memory bound.
Recent performance-prediction efforts have combined an empirical model with a mechanistic model to improve accuracy [40] . LEEF is also a combination of an empirical and mechanistic model because it builds the initial model using three mechanistic models [8] , [13] , [22] and then uses a regression method to select the best model.
VII. CONCLUSION
As a combination of memory-like performance and storagelike capacity, emerging NVM enabled deep rethinking of overall system architecture. Both hardware and software optimizations have been made to better leverage NVM. The problem that has long been a trouble to researchers is the lack of a research prototype platform. Software development around NVM, especially system software, still relies on full-system simulators that operate orders of magnitude slower than native run. To date, most NVM hardware tuning efforts are conducted using simulators. These tunings include reordering memory requests in the read/write queue of memory controller, using bit flipping to reduce cell wear, changing the memory refresh frequency to balance persistence and performance, and so on.
This paper presented LEEF, a lightweight emulation platform that mimics the behavior of emerging non-volatile memory on real machines with high accuracy and low overhead. LEEF can be directly deployed on most current computer systems. LEEF also supports integrating microarchitectural innovations into the performance model. We demonstrated that none of the current performance models are accurate for every benchmark for all presets. Consequently, we propose an online algorithm to select the best model. Apart from emulating pertinent NVM behaviors, LEEF also provides an interface for collecting real system memory traces, enabling simulation result to be integrated into emulations. These memory traces can be input to trace-based simulators. We presented two case studies that demonstrate how to integrate the simulation results.
