Abstract: Satisfying a demand for higher memory capacity is a major problem for computing systems. Conventional solutions are reaching those limits; instead, DRAM/NVM hybrid main memory systems which consist of emerging Non-Volatile Memory for large capacity and DRAM last-level cache for high access speed were proposed for further improvement. However, in these systems, the two device types share limited memory channels/ranks and NVM channels/ranks are often less utilized than DRAM ones. This paper proposes an OBYST (On hit BY pass to ST eal bandwidth) technique to improve memory bandwidth by selectively sending read requests that hit on DRAM cache to NVM instead of busy DRAM. We also propose an inter-device request scheduling policy optimized to OBYST. With negligible area overhead, OBYST improves bandwidth, IPC, and EDP by up to 22%, 21%, and 26% over the baseline without bandwidth optimizations, respectively.
Introduction
Despite the continuous increase in memory capacity, there has always been a demand for higher capacity due to the emergence of new data-intensive applications. For example, in-memory database [1] , key-value store [2] , and RAMCloud [3] keep their most critical data in main memory instead of disks to improve performance. Providing sufficient memory capacity is critical to these applications as a system with memory capacity lower than the working set of an application experiences a surge in page fault, leading to radical performance degradation [4] . However, increasing main memory capacity without growing the cost and power consumption of systems is getting harder for the following reasons: 1) DRAM process shrinking slows down heavily over time [5] and 2) DRAM leakage accounts for a large portion of the system power consumption and grows with increasing the capacity [6] .
To overcome these limitations, the ideas of utilizing Non-Volatile Memories (NVMs), such as Phase Change Memory (PCM), Resistive RAM (ReRAM), and 3D XPoint TM with low leakage power and higher capacity compared to DRAM, as main memory have been proposed [4, 7, 8, 9] . The read latency values of these byte-addressable (c.f., block devices such as NAND Flash) NVMs are only several times longer than those of DRAM. Therefore, these NVMs are assumed as more promising DRAM alternatives than NAND Flash. However, directly replacing DRAM with emerging NVMs is impractical due to their limited write endurance (10 7 -10 10 ) and high write energy/latency [4, 8, 10, 11] . Instead, DRAM/NVM hybrid memory systems were proposed, where low capacity DRAM processes frequently accessed requests with lower latency whereas NVM maintains a large memory space with longer lifetime due to fewer accesses [4, 7, 8] . These systems have two ways to deploy DRAM: implementing DRAM as a software-transparent LLC or a software-managed fast memory region. This paper focuses on the former.
To gain higher memory capacity without noticeably increasing the cost of systems, DRAM/NVM hybrid memory systems require the two device types to share the existing/limited interfacing resource for cost-effectiveness. That is, a subset of memory channels/ranks is allocated to NVM, decreasing the number of the channels/ranks used by DRAM (other options are described in Section 2). However, the channels/ranks allocated to DRAM cache are often more heavily utilized compare to NVM. This causes imbalance in use of every channel/rank, deteriorating memory system performance when an application needs moderate bandwidth.
To alleviate this inefficiency, we propose a novel bandwidth balancing technique called OBYST (On hit BY pass to ST eal bandwidth). OBYST improves effective memory bandwidth by moving a portion of loads from DRAM to NVM when DRAM utilization is relatively high. When a read request is issued, if 1) it hits the DRAM cache, 2) the corresponding cache line is clean, and 3) DRAM bandwidth is beyond a certain threshold and higher than NVM bandwidth, OBYST accesses NVM instead of DRAM. For 3), OBYST measures DRAM and NVM bandwidth for every time interval (e.g., 1K memory clock cycles) and determines the target device (DRAM or NVM) which will be applied during the following time interval. OBYST also considers the number of the channels/ranks allocated to each device and whether DRAM bandwidth (or per-channel bandwidth if DRAM and NVM share the same channel) is beyond a certain threshold (e.g., 30% of peak). Sim et al. [12] proposed an alternative technique called Self-Balancing Dispatch (SBD). For the memory systems in which processor-integrated die-stacked DRAM (on-chip DRAM) works as a cache of off-chip DRAM, SBD compares predicted latencies (the number of requests waiting for the same bank multiplied by the typical latency of one request of the corresponding device) to select the target device (on-chip or off-chip DRAM) for hit read requests heading to clean cache lines. Even though SBD works well, OBYST is also compelling due to the following limitations in SBD. In balancing bandwidth, the number of pending requests does not exactly reflect current bandwidth (the former fluctuates more). In minimizing latency, latency prediction is less accurate due to influences from row-buffer states and request scheduling. We also propose an inter-device request scheduling policy optimized for the DRAM/NVM hybrid system adopting OBYST, in which DRAM and NVM share the same channel (i.e., separate ranks for each device). Even if DRAM requests have higher priority than NVM requests because DRAM works as a cache, the requests sent to NVM by OBYST also deserve high priority for better performance. On a PCM-based DRAM/NVM hybrid main memory system, we show through system-level simulation that OBYST improves memory bandwidth, IPC, and EDP by up to 22%, 21%, and 26% over the baseline (without any bandwidth optimization), respectively, while incurring negligible area and dynamic energy overheads.
Background and motivation

Memory channel and rank
The number of memory channels and ranks have substantial impacts on the performance of computer systems. Fig. 1(a) shows a processor-memory interface in servers and personal computers. This interface connects a memory controller and Dual In-line Memory Modules (DIMMs) through on-board or on-package memory channels. A processor often has several memory channels (CH0-3 in Fig. 1(a) ) each functioning independently of each other. Therefore, as the number of channels grows, both maximum and effective memory bandwidth is improved owing to more parallelism. A DIMM has one or more ranks. Consequently, the multiple ranks (RANK0-3) are connected to a single channel. As multiple banks in a rank operate in parallel (BLP: bank-level parallelism), these ranks attached to a channel also operate in parallel (also called BLP in a wide sense). Thus, as the number of ranks per channel grows, effective memory bandwidth is improved due to an increase in BLP. 
Sharing of limited interfacing resource
The number of memory channels is limited by the pin count of a processor, and the number of ranks is limited by the signal integrity of electrical lines forming a channel. To implement DRAM/NVM hybrid systems, two device types share the existing memory channels or ranks. Even if some previous studies propose different interface types (using optical interconnect [13] , separate PCI Express channel or on-chip DRAM [8] ), these proposals are less mature or not cost-neutral. [9] shows a DIMM architecture including DRAMs, PCMs, and a PCM controller, a preferred pragmatic near-term solution would be to give processors control of both DRAM and NVM through conventional DIMM form factors.
Baseline DRAM/NVM hybrid main memory structures
For the aforementioned reasons, we choose the DRAM/NVM hybrid systems replacing a subset of DRAM DIMMs with NVM DIMMs [7, 8] as baseline architecture. Among our two baseline structures ( Fig. 1(b) ), the left system allocates a subset of its memory channels to NVM (SC : separate channel for NVM), whereas the right system allocates a subset of the ranks on each channel to NVM (SR: separate rank for NVM). The ratio of the number of DRAM DIMMs to that of NVM DIMMs is optimized based on the trade-off between DRAM bandwidth and NVM capacity. For both the SC and SR examples in Fig. 1(b) , those ratios are set as '1'. Although the DRAM-only system in Fig. 1 (a) is configured to have the largest DDR4 Registered DIMM in the market (256GB), the capacity of SC/SR in Fig. 1(b) is 1TB excluding DRAM cache thanks to the higher scalability (8× [11] ) of NVM than that of DRAM. 4GB single-rank DRAM DIMMs are used instead of 32GB DRAM DIMMs for SC/SR ( Fig. 1(b) ) because DRAM cache size is limited by the feasibility of storing a portion of DRAM cache tag information in the processor. Many DRAM cache studies suggest storing a part of tag information in the processor and an entire tag array in DRAM to mitigate performance overhead from DRAM accesses due to tag lookups [14, 15] . We leverage the idea of ATCache [15] , which caches some of tags into the processor. The storage size of the tag array managing a 16GB DRAM cache ( Fig. 1(b) ) is 512MB with the cache line size of 64B, and we assume using a 2MB tag cache that consists of SRAM [15] . Fig. 2 compares the performance of DRAM-only systems and DRAM/NVM hybrid systems on 16-core chip-multiprocessor system simulation (refer to details in Section 4). The baseline is DO-4C4R, the DRAM-only system with 4 channels and 4 ranks per channel ( Fig. 1(a) ). DO-2C2R has only the DRAM channels of SC, DO-4C1R has only the DRAM ranks of SR, and SC and SR are the ones shown in Fig. 1(b) , respectively. We focus on applications whose primary working sets fit in memory; hence, all the system configurations have the same page fault counts. Because DRAM, not NVM, processes most memory requests, the performance of SC and SR closely follows that of DO-2C2R (fewer channels) and DO-4C1R (fewer ranks). SC (SR) performs worse than DO-2C2R (DO-4C1R) due to the additional traffic for cache management. However for MICA, a key-value store application, SC (SR) outperforms DO-2C2R (DO-4C1R) owing to high channel (rank) utilization allocated to NVM. SC performs worse than SR because the number of channels affects memory bandwidth more directly than the number of ranks does. Finally, although SC and SR perform worse than the DRAM-only system (DO-4C4R) in simulation, adopting DRAM/NVM hybrid system still could be justified for the merit in large capacity. If the primary working set of an application is much larger than the capacity of the DRAM-only system, the hybrid memory system can outperform the DRAM-only system. [4] shows that 4× main memory capacity reduces page faults by 5× and provides a speedup of 3×. OBYST makes this even more attractive.
Performance degradation
OBYST (On hit BY pass to ST eal bandwidth)
We propose OBYST which improves the memory bandwidth of DRAM/NVM hybrid memory systems by mitigating inefficiency in use of every channel/rank by changing the target device of some requests from busier DRAM to underutilized NVM. To find the requests NVM can process among the ones DRAM normally processes, we focus on the memory data whose latest copies are stored in both DRAM and NVM. Consequently, we identify that the read requests which 1) hit on DRAM cache and 2) head to clean (not dirty) cache lines can be processed by NVM. Hereafter, we call this request a clean-hitrequest. OBYST improves memory bandwidth by switching the target device of clean-hit-requests adaptively.
Design of OBYST
OBYST is an epoch-based adaptive scheme and uses DRAM, NVM, or both bandwidth per channel monitored during the latest epoch interval (e.g., 1K memory clock cycles) to select the target device of clean-hit-requests issued during the following epoch interval. For SC (separate channel for NVM), the OBYST algorithm can be applied at least per channel group which consists of one or more NVM and DRAM cache channels. For SR (separate rank for NVM), OBYST can be applied at least per channel which consists of NVM and DRAM cache ranks. There are two necessary conditions in sending clean-hit-requests to NVM. First, the DRAM bandwidth of the channel group (SC) or the channel bandwidth consumed by both DRAM and NVM (SR) should be over a certain threshold called BW-threshold (e.g., 30% of peak). When memory bandwidth utilization is low (under BW-threshold ), the existing banks and ranks provide sufficient parallelism for the few requests, thus, sending clean-hitrequests to NVM is not beneficial. Therefore, when the first condition is not met, OBYST always targets using DRAM caches. Proper BW-threshold values depend heavily on architectural parameters, such as the number of banks/ranks, the latency and minimum interval of row/column accesses, as well as the row-buffer miss rate of a workload. In this study, we empirically set the BW-threshold value for both SC and SR as 30% of peak bandwidth through performance simulation.
Second, the ratio of DRAM bandwidth to NVM bandwidth called BWratio should be over a certain threshold. If a channel group (SC) has the same number of NVM and DRAM channels or a channel (SR) has the same number of NVM and DRAM ranks, a better inter-channel or inter-rank bandwidth balance can be achieved when BW-ratio approaches '1'. Likewise, if the number of NVM channels (or ranks) is three times more than that of DRAM channels (or ranks), a better bandwidth balance can be achieved when BW-ratio approaches '0.33'. To make our threshold a unique value '1', OBYST multiplies BW-ratio by a weight called RA-ratio (resource allocation ratio), the ratio of the number of NVM channels (or ranks) to that of DRAM channels (or ranks) in a channel group (in a channel). For the latter case above, RA-ratio is '3'. The OBYST algorithm is described in Fig. 3(a) . When a clean-hit-request arises, if DRAM bandwidth (SC) or channel band- width (SR) is over BW-threshold and BW-ratio × RA-ratio is larger than '1', the request will be sent to NVM; otherwise, it will be sent to DRAM.
Inter-device request scheduling policy for OBYST
Conventional request schedulers on memory controllers, such as FR-FCFS and PAR-BS [16] , are designed to schedule requests for a single device type (e.g., DRAM) on a channel. However for SR where DRAM and NVM share the same channel, a memory controller should schedule between DRAM and NVM requests. We call this inter-device request scheduling and devise a policy optimized for SR adopting OBYST. The policy is applied to selecting one request between a DRAM request and a NVM request each being independently selected per device type by an existing scheduler. When there are pending requests, the existing scheduler finds the request with the highest priority among the requests meeting timing constraints at every memory clock. If this operation is executed per device type for SR, at most two requests will be selected per cycle (one for DRAM and the other for NVM). Then, OBYST selects the request based on the priority ( 1 through 4 ) shown in Table I . 1 For SR without OBYST, DRAM requests have higher priority than NVM requests because DRAM cache hit requests should be processed as soon as possible (in terms of shortest-job-first). By contrast, for SR with OBYST, the clean-hit-requests sent to NVM have higher priority than DRAM writes as they are both hit requests and, in general, reads are more critical for performance than writes.
Implementation of OBYST
We describe the implementation of OBYST only for SR because it is more complicated than SC and the latter is easily inferred by our description. Fig. 3(b) shows the functional blocks (grey-colored) added on the memory controller integrated into the processor to implement OBYST on SR. The memory controller designed for DRAM/NVM hybrid systems has 2-level memory engine (2LME) which converts original memory requests to either DRAM or NVM requests depending on DRAM cache status and takes charge of DRAM cache management, such as tag update, data caching, and evicts [8] . OBYST target decision logic in 2LME determines the target device of clean-hit-requests. DRAM read/write counter and NVM read/write counter count the number of read/write commands issued to each device. OBYST target decision logic resets the counters at the end of each epoch. At the next epoch, it 1) calculates the bandwidth of DRAM, NVM, and their sum with those counter values and 2) determines the target device (DRAM or NVM) which will be applied during the following epoch and 3) resets the counters again. 2LME sends the clean-hit-requests issued during the following epoch interval to the target device. To apply our inter-device scheduling policy, the inter-device scheduler in DRAM/NVM controller should be modified. The area overhead by the modifications listed above is at most 2K logic gates per memory controller.
Experimental methodology
To quantify the effect of OBYST, we modeled a chip-multiprocessor (CMP) system with PCM-based DRAM/NVM hybrid memory system. Detailed parameters are listed in Table II . The composition of memory channels and ranks as well as their capacity equals those in Fig. 1(b) . Because the mem-ory footprints of the evaluated workloads (193MB-1.6GB) are smaller than DRAM cache size (16GB), we scaled down DRAM cache size to a quarter of workload footprint for each simulation by reducing the number of rows per bank. The latency/energy parameters of processor (14nm technology), DRAM, and PCM are extracted from a modified McPAT [19] , Micron datasheet, and previous studies [10, 17, 18] , respectively. We adopted PAR-BS [16] as a memory request scheduling policy and adaptive open/close policy (which is also adopted at Intel Xeon TM series) as a DRAM/PCM page management policy. For OBYST, the optimal epoch interval that we found is 1K memory clock cycles (833ns), and for SBD (Self-Balancing Dispatch [12] ), a previous work, we use '1' as the ratio of the typical latency of a DRAM access to that of a PCM access. Although [12] used the ratio lower than '1', '1' led to better performance in our environment because intermittent drops in the number of pending DRAM requests at high DRAM bandwidth need to be compensated. We modified McSimA+ [20] for performance simulation.
SPEC CPU2006 benchmark suite was used for multi-programmed workloads. We identified and used the most representative simulation point (each consisting of 100M instructions) per application using Simpoint. We classified the nine most memory-intensive applications based on the memory accesses per kilo-instructions (MAPKI) and composed mix-high with them (two instances of mcf, milc, leslie3d, soplex, GemsFDTD, libquantum, and lbm plus an instance of omnetpp and sphinx3). mix-blend is composed of sixteen random-selected applications (an instance of perlbench, bzip2, gobmk, dealII, bwaves, zeusmp, sjeng, h264ref, astar, xalancbmk, mcf, milc, GemsFDTD, lbm, omnetpp, and sphinx3). Radix and fft of PARSEC, in-memory hash join [1] , and MICA [2] were used for multi-threaded workloads.
Evaluation
We evaluate gains in system-level performance (IPC), energy efficiency (EDP), and memory bandwidth, from OBYST. Fig. 4(a) and (b) show the relative IPC (higher is better) and EDP (lower is better) as well as memory bandwidth (DRAM plus NVM) of multi-programmed and multi-threaded workloads on the simulated system with (a) SC or (b) SR DRAM/NVM hybrid memory system. The baseline (without any bandwidth optimization), SBD [12] , OBYST, and OBYST with our inter-device scheduling policy (IS+OBYST) are compared. Note that the performance of the baseline configuration is shown already in Fig. 2 (SC and SR) and IS+OBYST is only for SR.
Improvements by OBYST
For every simulated workload, memory bandwidth, IPC, and EDP are improved in the order of baseline, SBD, and OBYST on both SC and SR. By adopting SBD or OBYST, NVM bandwidth (black bar) is increased by switching clean-hit-requests from DRAM to NVM while DRAM bandwidth (grey bar) is partially maintained by processing other pending requests instead of the switched clean-hit-requests. Therefore, total bandwidth and IPC as well as EDP are improved. Because SC has lower performance on the baseline and a wider room for improvement compared to SR, the degree of improvement by those schemes on SC is larger than that on SR. For mix-high, a memory-intensive multi-programmed workload, OBYST improves bandwidth, IPC, and EDP by 22% (19%), 21% (20%), and 26% (25%) respectively over the baseline on SC (SR). For a group of application with moderate memory intensity (mix-blend, radix, and hashjoin), OBYST improves bandwidth, IPC, and EDP by 9-14% (7-11%), 8-9% (7-8%), and 10-14% (8-12%), respectively. By contrast, for fft and MICA, OBYST improves every metric by only 0-3% because their clean-hit-requests comprise 0.2% (fft) and 3% (MICA) of all memory requests. 99% of the read requests that hit on DRAM cache of fft are heading to dirty cache lines (almost not clean), and 89% of read requests of MICA are miss on DRAM cache (almost not hit).
The proposed inter-device scheduling policy (IS+OBYST) improves IPC by 1% over OBYST in every workload except for mix-blend.
Analysis on limitations of previous work
OBYST outperforms even SBD for every workload (up to 13% in IPC). SBD misses a portion of good chances to send clean-hit-request to NVM due to the following two limitations. First, Fig. 4(c) shows the number of pending requests to a DRAM channel and its bandwidth (average for 1K memory clock cycles) over time that are sampled during 50K memory clock cycles in mixblend on the baseline of SC. The effectiveness of bandwidth balancing in SBD is limited by a lack of correlation between the number of pending requests and bandwidth. When bandwidth utilization is moderate on average, the number of pending requests could fluctuate widely. Moreover, SBD is based on the number of pending requests not per channel but per bank. Second, Fig. 4(d) shows latency spectrum including queuing delays of 1M sampled DRAM reads over the number of pending requests heading to the target bank of each read (mix-blend, baseline of SC). Because actual latency is not exactly proportional to the number of pending requests to the corresponding bank (the same trend is observed with per-channel pending request count or other schedulers, such as FR-FCFS [16] ), the effectiveness of predictionbased latency minimization in SBD is limited. Real latency values heavily depend on current row-buffer states, row-buffer locality of pending requests, and request scheduling results. Even in terms of latency, our simulation shows that that OBYST is more effective than SBD.
Conclusion
In this paper, we have proposed OBYST which improves the bandwidth of DRAM/NVM hybrid main memory systems by sending the read requests that hit on DRAM cache to NVM instead of DRAM when the monitored DRAM bandwidth is relatively high. An inter-device request scheduling policy which prioritizes the NVM requests steered by OBYST is also proposed. Our proposals are from the following observations: 1) a large portion of hit requests is heading to clean data and can be processed by NVM, 2) real-time bandwidth monitoring is a more direct and reliable metric for inter-device bandwidth balancing (on the various channel/rank allocation ratios) than pending request count used in state-of-the-art SBD, and 3) although hit requests are sent to NVM, they still deserve high priority. In our evaluation, OBYST improves bandwidth, IPC, and EDP by up to 22%, 21%, and 26% over the baseline, respectively, with negligible area penalty.
