With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the muchanticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of in-memory services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to help improve overall performance per cost over existing DRAM-only architectures. We first show that even with the best latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deploying a modestly sized high-bandwidth stacked DRAM cache makes SCM-based memory competitive. The high degree of spatial locality in-memory services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and mitigation of SCM's read/write latency disparity. We finally perform a case study with PCM, and show that a 2 bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 5% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.
Introduction
For almost 30 years, DRAM has served as the universal standard for memory in mainframes, laptops, and today's datacenters. In the datacenter, however, we are entering a new age where memory will no longer exclusively be comprised of DRAM. Although the interactive needs of online services will continue to dictate that hot data must be kept DRAM resident, capacity limitations have begun to pressure datacenter operators to investigate emerging technologies to eventually replace it. Future servers will undoubtedly retain some DRAM for performance optimization, while shifting to denser main memory to hold vast modern datasets.
As dataset sizes have continued to march upwards in their exponential growth trend, datacenter operators and memory architects have been unable to keep pace, obstructed by fundamental limitations on channels per packaged IC as well as intra-channel signal integrity. With the pressure squarely on DRAM manufacturers to deliver DIMMs with ever-increasing capacities, memory has begun to form a significant fraction 1 of server acquisition cost, as high density components command higher margins and therefore prices. The synthesis of these two trends has led to a concerted effort to provision memory systems with reduced cost per bit, markedly reducing expenditure for large volume deployments.
Emerging storage-class memory (SCM) technologies are a prime candidate to serve as the next generation of main memories, as they boast approximately an order of magnitude greater density than DRAM at a lower cost per bit [16, 38, 48, 56] . These traits come at the price of increased access latencies when compared to commodity DRAM, creating new challenges for systems designers as memory latency is a critical factor in datacenter application performance [23] . Given that typical SCM latencies are typically 4-100× higher than DRAM [42, 47] , naïvely and entirely replacing DRAM with SCM is an unacceptable compromise for datacenter operators.
We submit that DRAM-based memory can be replaced by a memory hierarchy combining high-density SCM with a modestly sized high-bandwidth 3D stacked DRAM cache, with the former component offering cheap capacity, and the latter preserving the low latency and high bandwidth required by online services. Based on the insight that SCM's longer access latencies can be amortized by larger accesses, we show that the cache has to be page-based, with the page size matching the row buffer's size. Such large accesses are also favored by server workloads, which exhibit significant spatial locality [19, 45] . We show that with such a carefully provisioned two-level hierarchy, datacenter operators can preserve the performance of their applications while reducing the memory system's total cost by 40%, and increase performance/cost by 2.6× as compared to today's DRAM-only servers.
The caveat of using DRAM caches is their high design and integration costs [8] , diminishing the returns in cost per bit attained from replacing DRAM-based memory with SCM. At the same time, SCM comes in a variety of flavors, with a typically inverse relationship between performance and density, where higher density yields lower cost per bit. Hence, the design space for two-level hierarchies comprising a DRAM cache and backing SCM memory is very broad, spanning wide performance and cost ranges. To guide system designers through this complex design space, we devise a performance and cost model for different SCM technologies. We then use these models to pinpoint the most cost efficient solutions.
To summarize, our main contributions are the following: • We analyze emerging SCM devices that are built around traditional DDR DIMMs, and conclusively show that even the fastest among them cannot directly replace DRAM due to their long access latencies. • We show that end-to-end application performance of servers equipped exclusively with DRAM can be maintained with SCM-based memory coupled with a modestly sized DRAM cache. Abundant access interleaving among cores requires the cache to be 3D stacked, and the need to amortize SCM's high access latency requires the cache to be page-based. Interestingly, we find that the use of a properly designed page-based cache renders SCM's read/write performance disparity unimportant. • We devise a performance and cost model to investigate the tradeoffs of a DRAM cache/SCM memory hierarchy and bound the broad design space. Our model facilitates the identification of the most cost efficient memory hierarchy given the characteristics of an SCM device. • We deploy our models to conduct a case study on emerging phase-change memory (PCM) devices, and demonstrate that 2-bit cell organizations (MLC) represent the only costeffective choice, while both 1-bit and 3-bit cells fail to introduce an appealing design point in terms of performance/cost. We show that MLC-based memory hierarchies are 1.4-2.7× more cost-effective than their DRAM-based counterparts. The rest of the paper is organized as follows: First, we describe emerging SCM technologies in §2 and motivate our insight to amortize SCM latency with coarse-grain transfers. We then analyze server workloads in §3 to show that naïvely replacing SCM with DRAM is impractical without the use of a stacked page-based DRAM cache. As said cache inflates the system's cost, we then outline our performance/cost optimization framework in §4. Based on our methodology presented in §5, we perform a case study with emerging phase-change memory in §6. Finally, we discuss related work in §7 and conclude in §8.
SCM Background
Storage-class memory (SCM) is a term that encapsulates a class of emerging and much-anticipated technologies that are expected to penetrate the computing market in the following decade [16, 26, 38] . Being slower and denser than DRAM, but faster than Flash while retaining persistence, it cannot be strictly classified as either memory or storage but has characteristics of both. While the first SCM products were marketed as faster block devices for storage, memory vendors have recently announced the launch of SCMs packaged in a the dual in-line memory module (DIMM) form factor and using the conventional DDR interface [16] , with the ultimate goal of replacing DRAM as main memory. The latter usage of SCM will be far more disruptive for modern servers, as the increased density boasted by SCM devices will translate into a commensurate reduction in the cost of memory.
Although designing SCM DIMMs for compatibility by using the DDR interface will potentially accelerate their adoption, it also introduces performance disparities due to the fundamental differences in the underlying DRAM and SCM technologies. More specifically, the DDR interface specifies that the 64-bit wide channel is driven from a fast SRAM-based row buffer, which stores the most recently used row opened from the data array. Every access to the row buffer is referred to as a burst, where the requested word (64 bits) is selected from the row buffer and driven across the interface. The row buffer operates at the memory's clock speed, the speed of which must remain consistent between SCM and DRAM DIMMs to ensure compatibility with today's integrated memory controllers. Accessing an address that is not currently present in the row buffer (i.e., a row buffer miss) means that the existing row must be closed, and the proper row read into the buffer, which is referred to as an activation. Existing data that is dirty must be written back to the data array prior to moving in a new row, in a process that is called write restoration.
Maintaining the same DDR interface and simply swapping the DRAM data array for SCM generates a mismatch in the memory system's characteristics, since every row buffer miss now incurs between 4-100× the DRAM latency [42, 47] to read the SCM data array. Given this elevated disparity between the row buffer and data arrays, the bandwidth of modern SCM devices depends heavily on the fraction of accesses that hit in the row buffer. Figure 1 conceptually demonstrates this behavior, using an example of three write accesses either hitting or missing the same open row buffer. We use writes because clean rows do not incur write restorations in persistent memory. In Figure 1a and Figure 1b , we show the increase in total access latency that results from replacing a DRAM array with SCM. Although the burst latency remains the same (due to the standardized DDR interface), SCM's increased array activation and write restoration latency now dominate the latency of the three accesses. Figure 1c shows how the activation and restoration costs can be amortized when the three writes all hit in the open row, and then are written back together.
Motivated by the performance premium SCM DIMMs place on row buffer hits, we conduct an experiment to compare the average memory access time (AMAT) of a representative DRAM-and SCM-based DDR4-2666 DIMM with 8KB row buffers 2 , varying the size of each memory request. Larger requests serve as a proxy for access patterns that incur more hits in each opened row. Figure 2 shows the results of this experiment. The DRAM's latency quickly becomes bound by speed of the channel, as the 14ns activation time is amortized with approximately 1KB of data transfer. In contrast, the SCM requires far larger requests to approach the DRAM's AMAT, as its 125ns activation time is approximately an order of magnitude higher than DRAM's activation time of 14ns. Unfortunately, the access patterns of real applications are hardly so simple. We therefore conclude that directly replacing DRAM with SCM, using the same DDR interface, places the memory system's performance entirely at the mercy of the applications' access patterns, and whether or not they expose enough row buffer locality. In the next section, we study typical datacenter applications to determine if their memory access patterns express the row locality that SCM DIMMs depend on for high performance.
Designing SCM For Server Workloads
In this section, we investigate the feasibility of replacing DRAM with SCM for servers. We find that direct replacement results in unacceptable performance degradation, but the use of a modestly sized high-bandwidth DRAM cache makes the replacement viable. We also find that the cache's organization is critical, with page-based caches representing a clearly superior choice.
Direct replacement of DRAM with SCM
Due to the fact that server workloads are largely memorybound [11, 23] , any compromises caused by replacing DRAM with SCM will manifest themselves directly in application performance. To validate this hypothesis, we use a number of representative server workloads from the CloudSuite benchmark suite [11] and evaluate them with standard DRAM-based memory and an SCM-based alternative. We model an SCM with characteristics matching those of latency-optimized existing prototypes, to provide an upper bound on the performance of a SCM-based DIMM. Our application performance metric is user-level instructions per cycle (U-IPC), which has been shown to accurately reflect server throughput [46] . Methodology details can be found in §5. Figure 3 shows the performance of server workloads for SCM-based main memory, normalized to the performance on a DRAM-based configuration. The results show that naïvely replacing DRAM with SCM results in a severe performance degradation of 60% on average.
In order to identify whether the main cause of such degradation is SCM's reduced bandwidth or increased latency, we study the row buffer hit ratio. We find that 31% of memory accesses result in a row buffer hit on average, corroborating prior results [45] . Given the used row buffer size of 8KB, a 31% hit ratio corresponds to 2.6KB accessed per row buffer activation. From Figure 2 , we see that at this request size, an unloaded SCM device provides an AMAT that is 1.79× higher than DRAM.
Comparing these results to the AMATs we measure during Figure 3 's experiments, the loaded SCM system has an AMAT which is 4.2× higher than the DRAM system. We attribute this increased disparity in the loaded system to more pronounced queuing effects in the case of SCM, because of its higher row activation time ( Figure 1 ). In the SCM-based system, more memory transactions are serialized in the banks' command queues due to the slower data arrays, thus placing a paramount importance on row buffer hits, which are serviced at latencies equal to DRAM.
Prior work that has studied and improved row buffer locality in scale-out workloads has reported that an ideal memory scheduling system can achieve row buffer hit ratios of 77% [45] , a 2.5× increase over what we observe. However, the same work also demonstrated abundant spatial locality present in the applications themselves, with many pages incurring hits to over 50% of their constituent cache blocks during their time in the LLC. Unfortunately, because the memory channels in a modern server processor are multiplexed be- tween many CPU cores, interleaved access streams destroy a large fraction of whatever row buffer locality inherent in the application's access patterns executing in isolation [45] . This constraint means that the spatial locality in these applications is unable to be captured simply at the row buffer level. A naïve conclusion from our observations tying SCM performance to row buffer size is that memory architects should simply design SCM DIMMs with larger row buffers, and improve memory scheduling to exploit the locality therein. In fact, the write programming process of various SCM technologies precludes us from constructing such large rows, due to limitations on the amount of current that can be driven into the data cells during write restoration [13, 20] . Current memory technology constrains SCM row sizes between 512-2048B [24, 42] , and even with perfect spatial locality, these small rows do not provide enough opportunity to fully amortize the SCM's longer latencies. Techniques for row buffer optimization [41, 45] can only provide a maximum row buffer hit rate of 50%, which unfortunately lags far behind the hit rates required to guarantee application performance. These fundamental SCM limitations, combined with limited scheduling scope in the memory controllers, lead us to conclude that SCM cannot serve as a drop-in DRAM replacement.
DRAM caches to the rescue
State-of-the-art server memory hierarchies include 3D stacked DRAM caches [19, 44] , such as Hynix' HBM [3] or Micron's HMC [30] , which can be leveraged to mitigate SCM's long access latency problem. Applied in our context, these 3D stacked caches have the opportunity to solve both problems we have previously discussed, namely low row buffer hit ratio and increased SCM-inflicted AMAT. First, a well designed 3D DRAM cache enables the majority of memory accesses to be serviced at DRAM latency rather than requiring an SCM activation. Second, setting cache block size to be equal to the backing SCM's row buffer size means that the application's spatially localized accesses can be aggregated over the block's lifetime in the DRAM cache; when the block is evicted and written to the backing SCM, a far greater fraction of the row buffer is actually used than if the row was repeatedly opened and closed in the SCM. This access coalescing has the same effect as providing near-ideal access scheduling without the requirements for complex reordering logic, and amortizes the SCM's latency as we showed is necessary in §2.
Due to stacked DRAM's high comparable cost over planar DRAM, and increased integration complexity, we must be judicious about its architecture and provisioning. Prior work in this field has shown that DRAM caches of this nature necessarily need to be constructed from 3D stacked DRAM rather than conventional planar DRAM, as the internal bandwidth of planar DRAM is not sufficient to serve the abundant memory traffic generated by server applications [19, 44] . Given that the primary benefit of SCM in scale-out deployments is a reduction in total memory cost, we now present a short design exploration of the 3D stacked DRAM cache. From this point onwards, we will omit the "3D stacked" qualifier and refer to it simply as a DRAM cache.
There are three main parameters that define a (DRAM) cache's effectiveness: associativity, capacity, and block size. Prior work studying DRAM caches for server workloads has shown that associativity requirements are modest, with minuscule performance improvements beyond 4 ways [19] . Regarding capacity, prior work has advocated that the working set of server workloads represents 10-15% of the total dataset size, hence the DRAM cache capacity should account for that fraction of the backing memory [44] . However, as SCM provides approximately an order of magnitude more capacity per module than DRAM, building a DRAM cache with 10% of the aggregate SCM capacity is implausible due to limitations on packaging and cost.
In order to investigate whether realistically sized DRAM caches can capture the working sets of server applications, we use a trace simulator based on Flexus [46] , conducting a classical miss ratio study where we sweep the DRAM cache's capacity and search empirically for the "knee of the curve". We model a fully associative DRAM cache with varied capacity and block sizes, and display the results in Figure 4 .
There are two main phenomena that manifest themselves in these results. First, for all of the block sizes shown, a cache provisioned with approximately 2-4% of the backing SCM's capacity sits at the knee of the curve and therefore represents the sweet spot. Such capacity is reasonable, as existing stacked DRAM products feature capacities up to 8GB, and industry projections expect 64GB by 2020 [5] .
Second, we note significantly reduced cache miss ratios with larger DRAM cache block sizes. For example, Web Search's miss ratio drops from 14.5% to less than 1% as the block size increases from 256B to 4KB. Using larger blocks allows the DRAM cache to amortize the cost of accessing the high-latency backing SCM, as every miss now loads larger chunks of data that will likely be accessed in the future. We take these results as further evidence that our set of server workloads exhibits significant spatial locality, but we need a larger temporal window to capture it than the one offered by a row buffer. The DRAM cache serves that exact purpose, coalescing accesses within large blocks of data that, upon eviction, amortize the cost of an SCM row activation and write restoration, as illustrated in §2.
Using terminology commonly used in the literature, we argue that DRAM caches should be architected as page caches [21, 25, 44] rather than block caches, where the term page refers to the cache block size being significantly larger than a typical cache block size of 64B. Page-based caches are superior due to the much lower miss ratios exhibited when the cache block size exceeds 1KB. With a block-based cache, misses to each small block will be serialized once again by the SCM. Existing DRAM caches that cache small blocks, typi-cally equal to the L1 cache size, are unsuitable for SCM-based memory hierarchies [15, 27, 34, 39] . Using a page-based DRAM cache that captures spatial locality harmonizes with our findings in §2, that demonstrate the importance of amortizing long SCM activations with accesses to spatially local data.
We further justify this choice with a direct study on SCM's latency amortization opportunity as a function of the DRAM cache block size. Figure 5 displays the density of regions being evicted from the DRAM cache and written to the backing SCM. We define the density of a region as the fraction of 64B sub-blocks that are accessed during its lifetime in the cache, and sweep different region sizes on the x-axis.
All of the applications exhibit similar behavior, albeit grouped into two different clusters. As the region size increases, density naturally drops. While most of the applications exhibit densities in excess of 70% between 512B and 2KB region sizes (which correspond to a typical SCM row buffer), Web Serving and Data Analytics have much more sparse traffic patterns, with 15% less density than the others. Comparing those two applications to the miss curves in Figure 4 , we see that these same two applications are the least sensitive to the cache block size. Data Analytics is particularly agnostic to the cache block size, incurring the smallest decrease in miss ratio, due to the fact that it has less innate locality inside each opened row.
By synthesizing the results in Figures 4 and 5 , we explicitly argue that the DRAM cache's block size should match the SCM's row buffer size. Matching these two parameters allows the DRAM cache to coalesce accesses together and therefore amortize the elevated activation and restoration latencies of the backing SCM. Figure 5 essentially shows that the opportunity presented in Figure 2 is attainable, thanks to the combination of the workloads' access patterns with a page-based DRAM cache.
Finally, we present end-to-end application performance results in Figure 6 for a system whose memory hierarchy features a DRAM cache sized at 3% of the backing SCM 3 . Performance is normalized to a 4-way DRAM-based system without a DRAM cache, and we also compare to a DRAM-based system featuring the same DRAM cache as the SCM-based system, labeled "DRAM". Across the board, the SCM-based system performs better with larger cache block sizes, until an inflection point appears at 2KB blocks. This limitation occurs due to overfetching with 4KB blocks, causing bandwidth contention in the SCM, thus setting an upper bound on the DRAM cache block size.
Putting all of our observations together, we present three key design guidelines for memory hierarchies that use SCMbacked DRAM caches in order to preserve performancecritical memory accesses. First, the performance/cost optimum for the DRAM cache falls at approximately 3% of the backing SCM's size, and it should necessarily feature large blocks (512B-2KB) to capture the spatial locality present in server workloads and amortize the high SCM access latency. We find that a page cache with 2KB pages (i.e., cache blocks) hits the sweet spot between hit ratio and bandwidth misuse because of overfetch. Second, the SCM's row buffers should be the maximum size permitted by the underlying memory technology, to maximize the potential of latency amortization, up to a maximum of 2KB to avoid overfetching ( Figure 6 ). Finally, if the SCM row buffer is smaller than 2KB because of technology limitations, the DRAM cache's block size should match its size, as the opportunity for amortization is bounded by the row buffer's size.
The SCM Cost-Performance Tradeoff
In the previous section, we demonstrated that an SCM-based system is able to attain competitive performance with a DRAM-based one, with the addition of a DRAM cache. However, 3D stacked DRAM technology costs at least an order of magnitude more per bit than SCM, conflicting with the initial motivation of replacing DRAM with SCM as a more cost-efficient memory. Hence, whether the resulting memory hierarchy represents an attractive solution depends on whether the cost reduction from replacing DRAM with SCM offsets the additional cost of a DRAM cache. The challenge designing an SCM that is cheap enough to achieve that, while preserving its performance at acceptable levels. In this section, we trim the broad design space by identifying the key parameters that define SCM's performance.
In general, the denser the SCM, the lower its cost per bit [29, 33, 40] . We therefore use SCM density as a proxy for SCM cost. Unfortunately, common density optimizations, like storing multiple bits per cell, or vertical stacking of multiple cell layers, result in higher access latency, lower internal bandwidth, and potentially higher read/write disparity [29, 38, 42, 47, 50] . The goal is to deploy the densest possible SCM while respecting the end-to-end performance goals. Therefore, solving the cost-performance optimization puzzle requires SCM designers to understand which parameters affect end-to-end performance the most, and by how much.
We identify read latency, write latency, and row buffer size as the three SCM design parameters that control end-to-end application performance. Read latency (i.e., SCM row activation delay) directly affects each memory access' critical path, while high write latency (i.e., write restoration delay), even though off the critical path, implicitly hurts memory access delay may cause head-of-line blocking delays inside an SCM DIMM [4, 32] . Finally, as discussed previously, the row buffer's size defines the extent to which SCM's high access latency can be amortized.
Putting all three parameters together-row buffer size, read latency, and write latency-we devise a 3D SCM design space, illustrated in Figure 7 . All the SCM configurations that satisfy the performance target reside inside the volume shaped like a truncated triangular pyramid. The SCM devices with the lowest read and write latencies lie close to the vertical axis. All designs for a given row buffer size are represented by a horizontal cut through the pyramid, and the resulting plane indicates the space of all read and write latencies that are tolerable with that row buffer size. The pyramid's truncated top is defined by the smallest row buffer size that is sufficient to amortize the SCM's row activations ( §3); on that plane, only the fastest SCM devices are acceptable, which are unlikely to deliver the desired high density. Growing the row buffer size (and implicitly the SCM's bandwidth) widens the design space, as increased amortization opportunity reduces the overall system's sensitivity to high activation latency.
Our model can help device designers reason about the feasibility of employing different SCM technologies to serve as main memory. For example, with multi-level cells, designers may deploy smaller serial sensors to optimize for higher density, by sacrificing read latency [50] . Another example is the write latency/bandwidth tradeoff, where designers may choose a different cell writing algorithm, optimizing either for low latency or high bandwidth based on their performance needs. Fewer high-current write iterations result in faster writes, but place an upper limit on the row buffer size because of fundamental limitations on the current that can be driven through the data array at any given time [54] . A general observation from our model is that design planes with bigger row buffers are appealing, as they widen the design space and offer better opportunities for high SCM latency amortization.
To summarize, we analyzed three SCM parameters, namely the row buffer size and read/write latencies, and devised a unified performance model. In the following sections, we instantiate this model and evolve it from qualitative to quantitative, using samples drawn from a wide range of existing SCM configurations. We then use our model to perform a case study of four representative PCM configurations, and compare memory hierarchies that deploy each of these configurations on the metric we are ultimately interested in, namely performance per cost.
Methodology
In this section, we describe the organization of each system we model throughout this paper, provide the details of our simulation infrastructure, state the performance and cost assumptions, and finally list the parameters we use for our case study with phase-change memory.
System organization. Next-generation server processors will feature many cores to exploit the abundant request level parallelism present in online services [11, 28] . Recent server chips follow this very trend: AMD's Epyc features 32 cores [12] , Qualcomm's Centriq 48 cores [9] , and Phytium's Mars 64 cores [52] . To make simulation turnaround time tractable, we model a server with 16 cores and a single memory channel, representing a scaled-down version of the OpenCompute server roadmap, which calls for 96 cores and 8 memory channels [31] . We configure the DRAM cache's size as 3% of the workload's dataset in order to achieve the optimal cache-to-memory-capacity ratio (see §3.2). The DRAM cache is connected to the chip over a SerDes serial link, which in turn is connected to main memory over a conventional DDR4-2666 channel [2], as illustrated in Figure 8 . lytics, and Web Serving. We measure workload performance in User IPC (U-IPC), which is defined as the ratio of user instructions committed to the total number of cycles spent in both user and kernel spaces. Prior work [49] has shown U-IPC to be a metric representative of application throughput. Our datasets are in the 16-32GB range and fully fit in the DIMMs we model. To explore how our DRAM cache design would change with larger datasets that are impractical to simulate, we used analytical models to determine how the required cache to memory capacity ratio would change as the datasets scale into the TB range. A key input to our models was a representative query distribution that accurately reflects the skewed popularity phenomenon in datacenter applications. We used the canonical Zipfian distribution and study coefficients (α) from 0.9 to 0.99, observing that with a 100-fold increase of the dataset size, the fraction of total data that is resident in the DRAM cache (i.e., the popular keys) decreases by 6-32%. Therefore, we expect that our scaled down system's performance is representative of applications with larger datasets, as increasing datasets do not significantly affect the disparity between hot and cold data and leads to even higher data locality, resulting in DRAM caches that are an even smaller fraction of the backing memory's capacity than what we assume.
Simulation infrastructure.
We use the Flexus [46] full-system cycle-accurate simulator coupled with DRAM-Sim2 [36] , and employ the SMARTS sampling methodology [49] . To extend DRAMSim2 to support non-uniform SCM access latencies, we adjusted its t RCD and t W R , and added SCM-related parameters (t RRDpre and t RRDact , similarly to the models used by prior work [4, 24] ). To simplify our explanations, we refer to the read and write latencies of the SCM device as t RCD and t W R , as they define the major part of the data array's access.
Without loss of generality, we consider a DRAM cache with tags in SRAM [44] . For the DRAM cache's memory controller, we use a critical-block-first policy and FR-FCFS open-row schedule with page-based interleaving. We assume that each SCM is packaged in a DIMM form factor. To model different SCM configurations, we replicate expected performance and cost characteristics from recent prototypes [22, 38, 47, 48, 56] . policy with page-based interleaving, which is optimized for bulk transfers ( §2). We size the write buffer corresponding to the number of banks, with each write entry equal to the page size. Table 1 summarizes our simulation parameters.
For the SCM's controllers, we model FR-FCFS open-row
t CAS -t RCD -t RP -t RAS -t RC 14-14-14-24-38 t W R -t W T R -t RT P -t RRD 9-6-3-3 SCM DDR4-2666, 512-2048B row buffer t CAS -t RCD -t RP -t RAS -t RC 14-t read -14-24-t read t W R -t W T R -t RT P t write -6-3 t RRDpre -t RRDact 2-11
Phase-change memory assumptions
PCM is generally considered the most mature SCM technology, as its performance, density and endurance characteristics are well-studied. Additionally, industry has built reliable single-level and multi-level cell (up to 3 bits/cell) configurations. We assume a typical PCM cell and project its performance characteristics for single-level (SLC), multi-level (MLC) and triple-level cells (TLC), which provide 1, 2, and 3 bits/cell respectively. For the baseline SLC-PCM configuration, we assume 60ns read latency, and 150ns write latency. Based on a survey of recent PCM prototypes [42] , we assume a maximum row buffer size in SLC-PCM of 1024B. Assuming the same cell material, we project MLC-PCM to operate with 120ns read latency, and a range of possible write latencies, depending on the algorithm used for cell writing. Prior work [54] has described two ways to program an MLC cell. The first approach, which we call MLC lat , favors faster writes, resulting in write latencies of 550ns and 512B row buffers. The second approach, which we call MLC BW , favors higher bandwidth, resulting in write latencies of 1000ns and 1024B row buffers. Finally, we project the specifications of TLC-PCM based on a recent industrial prototype [6, 40] , and assume read and write latencies of 250ns and 2350ns, respectively. For the row buffer size, we optimistically assume 512B.
Cost model. To evaluate the cost of the memory subsystem, we build a model for both planar and 3D stacked DRAM, as well as SCM of different densities. We compare different technologies according to their expected cost/bit metric, normalizing to the same total capacity. Taking planar DRAM's cost/bit as a baseline, we project stacked DRAM cost/bit to be 7× more expensive than planar DRAM, as cooling and bonding costs increases for stacked dies [8] . Due to the higher manufacturing costs because of the immaturity of the PCM technologies [48, 56] , we conservatively assume that the cost/bit of SLC-PCM is equal to commodity planar DRAM. Then, we assume linear cost reduction for MLC and TLC-PCM, proportionally to the number of stored bits per cell (i.e., 50% and 25% cost for 2 and 3 bits/cell). Table 2 summarizes the performance and cost assumptions for all considered technologies: planar DRAM, stacked DRAM, and the four aforementioned PCM configurations (SLC, MLC lat , MLC BW , and TLC).
Evaluation
Using our simulation infrastructure, we model a wide range of SCM configurations to quantify the SCM design space we previously described in our SCM performance model ( §4).
We study a variety of combinations of row buffer sizes and read/write latencies. For each configuration, we set the DRAM cache block size equal to the SCM's row buffer size. Then, we use the resulting model to conduct a case study about the feasibility of the four different PCM configurations from the performance and cost perspectives, based on the assumptions summarized in Table 2 . Each different row buffer size configuration is depicted by a diagonal line that separates the configurations that satisfy the performance target from those that do not (i.e., design points that fall inside or outside the pyramid's volume). We set the lower bound for the SCM-based memory hierarchy's performance target as follows: the SCM-based system should be at least within 90% of the best DRAM-based system's performance, for every one of the evaluated workloads. We find that the best DRAM-based memory hierarchy is the one with a page-based DRAM cache of 2KB blocks.
Design space exploration
In Figure 9 , the points below each diagonal line satisfy the performance target. For example, with a row buffer size of 512 bytes, the slowest configurations that match the performance target are the skewed SCM configuration with 125ns read and 500ns write latencies, and the symmetric configuration of 250ns read and write latencies.
As we explained in §4, the row buffer size sets the upper bound for SCM access latency amortization, and is therefore, implicitly, the defining parameter for the upper bound of acceptable SCM latencies. Increasing the row buffer size from 512B to 2KB expands the design space linearly. Hence, the maximum read and write latencies meeting our performance target increase proportionally to the row buffer size. For example, sweeping the row buffer size from 1KB to 2KB, the maximum acceptable read latency increases from 250ns to 500ns, while the maximum allowed write latency grows from 1 to 2µs. This relation between the maximum allowed latency and row buffer size demonstrates the efficiency of amortizing longer SCM latencies over multiple accesses within a large row buffer.
Application performance turns out to be much less sensitive to slow writes, as compared to reads, because writeback traffic is not directly on the critical path of memory access. This leads us to the important conclusion that SCM's inherent read/write performance disparity is a secondary concern for SCM hierarchy design, as a carefully organized DRAM cache enables significant overall system performance tolerance to both read and write latencies.
Growing the row buffer and DRAM cache block size beyond 2KB is not winsome, since some applications do not take advantage of additional data fetched into DRAM cache. For example, Data Serving fails to satisfy our performance target using blocks larger than 2KB, even for DRAM-based systems, as we have seen before in Figure 6 . For the rest of the workloads, growing the row buffer size to 4KB widens the design space further, up to 1µs read and 4µs write latencies (not shown on Figure 7 ). However, most of the workloads experience performance degradation with cache blocks and row buffers sized to 4KB, as compared to the corresponding 2KB configuration. For example, for the skewed configuration with 125/2000ns read and write latencies, increasing the row buffer from 2KB to 4KB leads to mean degradation of 3%, whereas Data Serving's performance degrades by 9%. As a result, the designers may consider using slower memory with a row buffer size bigger than 2KB only if their applications exhibit that amount of spatial locality.
To summarize, we quantified the frontier that separates plausible SCM configurations from those that are not able to reach the performance target. We demonstrated that a bigger row buffer and corresponding DRAM cache block size widen the SCM design space, albeit only within the range that corresponds to the spatial locality exhibited by the applications' access patterns. Thus, increasing SCM row buffers beyond 2KB is not beneficial for some of the applications we evaluated, which exhibit lower spatial locality. Finally, we make the interesting observation that a page-based design efficiently mitigates conventional SCM read/write latency disparity, eliminating the need for any additional disparity-aware mechanisms.
Case study with phase-change memory
We now demonstrate the utility of our performance model by using it to reason about the implications of a number of plausible PCM configurations on overall system performance and cost. We evaluate the economic feasibility of SLC, MLC lat , MLC BW and TLC PCM configurations that we introduced in §5.1. Figure 10 shows all four configurations as points, according to their assumed read and write latencies. Points with no fill represent configurations with 512B PCM row buffer size, while filled points depict configurations with 1024B row buffer. Similarly to Figure 9 , diagonal lines bound the configurations that match the performance target of 90%, according to corresponding PCM row buffer sizes. For all the configurations, we model an SCM hierarchy with a page-based DRAM cache, sized at 3% of the application dataset, and organized in pages equal to the row buffer size. However, as this TLC-PCM based hierarchy fails to deliver acceptable performance, we also evaluate TLC-PCM based configurations with 2-and 4-fold bigger caches, as low TLC-PCM price (25% of DRAM) provides some cost flexibility. Based on the cost model described in §5.1 for cost estimations, we summarize the performance results for each system along with its overall cost in Table 3 . The SLC-PCM configuration we consider allows performance within 2% of the best DRAM configuration with a DRAM cache. Although being well within our performance target (>90%), SLC-PCM's cost/bit is too high to offset the expense of adding the DRAM cache.
For MLC-PCM, we consider two alternative configurations: MLC lat and MLC BW , which are optimized for low write latency and high internal bandwidth respectively. The row buffer sizes of these configurations are 512B and 1KB. According to the model in Figure 10 , both configurations deliver performance above the 90% target. Although MLC lat outperforms MLC BW by 3% on average (1.96 vs. 1.9), designers may prefer MLC BW as it shows few orders of magnitude higher lifetime, as demonstrated by [54] . As the cost/bit of MLC-PCMs is half that of planar DRAM, the overall performance/cost is improved by ∼40% as compared to the DRAM-based system with a similar capacity DRAM cache (2.64/2.72 vs. 1.63). As compared to planar DRAM, MLC-PCM improves performance/cost by 2.6-2.7×, reducing overall memory cost by 28%.
Finally, we consider a TLC-PCM configuration with three different DRAM caches, sized at 3%, 6%, and 12% of the dataset. Figure 11 demonstrates that TLC-PCM can only satisfy the performance target with the largest of the three, bringing the overall memory hierarchy approach the cost of the baseline DRAM+cache system. Given its marginal improvement in performance/cost, as well as TLC's inherently worse endurance reported by prior work [40] , we conclude that TLC-PCM is unable to act as a viable main memory tech- nology for server applications. That conclusion is reinforced by the clear superiority of MLC-based alternatives.
In summary, we used our performance/cost model to conduct a case study that considered a number of PCM configurations featuring different cell types. We showed that the configuration that stores 2 bits per cell (MLC) allows to significantly boost performance/cost metric of the memory hierarchy, whereas the configurations that store 1 and 3 bits per cell (SLC and TLC) are not plausible building blocks to for memory hierarchies targeting in-memory services.
Related Work
Our work draws inspiration from extensive studies in the fields of server architecture and memory systems. In this section, we look at the relationship between our work and prior proposals, comprised of three main areas.
DRAM caches for servers. Previous studies have leveraged the wide high speed interface and highly parallel internal structure of stacked DRAM caches to mitigate the "memory bandwidth wall" found in server applications [44] . Block-based organizations [15, 27, 34, 39] tend to perform better in the presence of temporal locality, while page-based ones [21, 25, 44] favor applications with spatial locality, respectively. Scaleout workloads tend to be characterized more by spatial locality [45] , motivating for the use of page-based caches [44] . However, increasing core counts in servers introduce bandwidth concerns as well, rendering simple page-based designs that overfetch data suboptimal. The Footprint [19] and Unison [18] cache designs mitigate the overfetch problem, by leveraging an access pattern footprint predictor, at the cost of slightly increasing the DRAM cache miss ratio. Our work extends these observations to SCM hierarchies, and shows that DRAM caches in our context should also be page-based, as their large unit of transfer amortizes the high SCM access latency. The increased cost of DRAM cache misses do not justify the design decisions of Footprint and Unison cache. As latency is more valuable than bandwidth in our setting, sacrificing hit ratio to reduce overfetch does not pay off.
Other researchers have proposed to mitigate long SCM latencies by using conventional planar DRAM DIMMs for hardware-managed caches [35] and OS-based page migration [14] . Applying these designs in the context of scale-out workloads will expose the lack of internal parallelism in planar DRAM devices [44] , leading to excess request queuing and therefore latency inflation. Our work shows that SCM hierarchies require high bandwidth page-based caches, with high frequency SerDes interfaces [17] to offer competitive performance in scale-out servers. We highlight the need for bigger row buffers, i.e., optimizing the SCMs for higher internal bandwidth, as opposed to minimal latency which unnecessarily sacrifices density.
Optimizing SCM devices. Since SCM write bandwidth is heavily constrained by current limitations inside the DIMM, industry prototypes have limited-sized row buffers [42] . In order to reduce peak write power, prior work uses a technique called differential writes, that detects the subset of bits that actually change their values during a write restoration, which are often as few as 10-15% [24, 35] . This technique shrinks the effective write current and enables greater row buffer sizes, which is critical to our techniques in this paper. Fine-grained power management techniques at the DIMM level have a similar goal but operate above the circuit and cell level [13, 20] , and mainly focus on manipulating the limited power budget.
To reduce SCM DIMM latency through the use of SRAM row buffers, Yoon et al. proposed a row buffer locality aware caching policy for heterogeneous memory systems [51] , allocating addresses that cause frequent row buffer misses in DRAM. Lee et al. suggested architecting SCM row buffers as small associative structures to leverage applications' temporal locality, meaning that recently closed rows would not be immediately written to the data arrays, but kept in a "backup" row buffer [24] . However, server applications exhibit poor temporal but abundant spatial locality. Volos et al. [45] proposed proactive data fetch and writeback mechanisms in the on-chip last-level cache, to stream memory traffic in bulk and take advantage of this spatial locality. We also take advantage of spatial locality in server workloads, and design our DRAM cache to capture it with the goal of amortizing bulk transfers to/from our SCM devices.
Tackling the SCM read/write disparity. As most SCM technologies show significant disparities in read and write latencies [29, 47, 50] , prior work has proposed various mechanisms to mitigate the effects of slow SCM writes. At the application level, researchers have suggested replacing conventional software algorithms with the ones that generate less write traffic [7, 43] . At the hardware level, Fedorov et al. [10] augmented a conventional LRU eviction policy to reduce the eviction rate of written data. To mitigate head-of-line blocking of critical reads behind long latency iterative SCM writes, prior work has proposed enhanced request scheduling mechanisms [4, 32, 55] , which, for example, cancel or delay writes and allow reads to bypass them. Qureshi et al. [33] proposed a reconfigurable SCM hierarchy that is able to dynamically change its mode of operation between high performance and high capacity. Finally, Sampson et al. [37] suggested using fewer write iterations to improve SCM access latencies at the cost of some data precision.
We group this diverse list of prior work together because we obviate the need for complex hardware extensions. Our carefully organized SCM hierarchy can tolerate any read/write latency pair that meets a specified row buffer size. Furthermore, our insights show that SCM designers can sacrifice device speed to improve other non-performance characteristics. For example, Mellow Writes [53] shows that slowing down writes can increase the lifetime of ReRAM by orders of magnitude, while Zhang et. al illustrate a similar tradeoff for PCM [54] . When considering whether or not to adopt such a technique, our performance model provides concrete evidence to architects that extended latencies can indeed be tolerated given the opportunity to amortize them with large row buffers.
Conclusion
The arrival of emerging storage-class memory technologies has the potential to revolutionize datacenter economics, allowing online service providers to deploy servers with far greater capacities at decreased costs. However, directly using SCM as an alternative for DRAM raises significant challenges for server architects, as its higher activation latencies are unacceptable for datacenter applications with strict response time constraints. Although we show that fully replacing DRAM with SCM is not possible due to increases in memory access latency, we demonstrate that a carefully architected 3D stacked DRAM cache placed in front of the SCM allows the server to match the performance of a state-of-the-art DRAM-based system. The abundant spatial locality present in server applications favors a page based cache organization, which enables amortization of long SCM access latencies.
As SCMs come in a plethora of densities and performance grades, we provide a unified performance and cost model to prune the broad design space, and design the most cost efficient memory hierarchy given our DRAM cache architecture. Aided by our model, we present a case study of number of phase-change memory devices, demonstrating that 2-bit cells represent the only cost-effective solutions for scale out servers.
