Intel Optane DC Persistent Memory is a new kind of byte-addressable memory with higher density and lower cost than DRAM. This enables affordable systems that support up to 6TB of memory. In this paper, we use such a system for massive graphs analytics. We discuss how such a system should be deployed for such applications, and evaluate three existing shared-memory graph frameworks, Galois, GAP, and GraphIt, on large real-world web-crawls. We recommend algorithms and runtime optimizations for getting the best performance on such systems using the results of our study. We also show that for these applications, the Galois framework on Optane DC PMM is on average 2× faster than D-Galois, the stateof-the-art distributed graph framework, on a production cluster with similar compute power. Thus, Optane DC PMM yields benefits in productivity, performance, and cost for massive graph analytics.
INTRODUCTION
Graph analytics systems must process graphs with tens of billions of nodes and trillions of edges. Since the main memory of most single machines is limited to a few hundred GBs, shared-memory graph analytics systems like Ligra [40] , Galois [35] and GraphIt [50] cannot be used to perform in-memory processing of such large graphs. Two approaches have been used in the literature to circumvent this problem: (i) out-of-core processing and (ii) distributed-memory processing.
In out-of-core systems, the graph is stored in secondary storage (SSD/disk), and portions of the graph are read into DRAM under software control for in-memory processing. State-of-the-art systems in this space include X-Stream [38] , GridGraph [52] , Mosaic [31] and BigSparse [27] . Secondary storage devices do not support random accesses efficiently, so data must be fetched and written in blocks. As a consequence, algorithms that perform well on shared-memory machines often perform poorly in an out-of-core setting, and it is necessary to rethink algorithms and implementations when transitioning from in-memory graph processing to out-of-core processing. In addition, the graph may need to be preprocessed to organize the data into a layout that is friendly for out-of-core processing.
Large graphs can also be processed using distributed-memory clusters that have a sufficient number of machines and memory for in-memory processing of the graphs. The graph is partitioned among the machines in a cluster using one of many partitioning policies that have been studied in the literature [22] . Communication is required during the computation to synchronize updates to node values across the entire cluster. State-of-the-art systems in this space include D-Galois [18] and Gemini [51] . Distributed-memory graph analytics systems have the advantage that they can be scaled out by adding new machines to provide additional memory and compute power. Additionally, the overhead of communication can be reduced by choosing good partitioning policies, avoiding small messages, and optimizing metadata. However, communication remains the bottleneck in these systems [18] .
Intel ® Optane TM DC Persistent Memory (Intel Optane DC PMM) is a new memory technology that promises to revolutionalize this area. Intel Optane DC PMM is byte-addressable memory which has the same form factor as DDR4 DRAM modules but with higher memory density and lower cost. It has longer access times compared to DRAM, but it is much faster than SSD. It can be set up to use the DRAM in the system as a very large cache. This allows a single machine to have up to 6TB of storage at relatively low cost, and in principle, it can be used to run memory-hungry applications without requiring the substantial reworking of algorithms and implementations that is required for out-of-core or distributedmemory processing.
In this paper, we explore the use of Intel Optane DC PMM for analytics of very large graphs such as web-crawls up to 1TB in size. We first describe system options in setting up the machine with Intel Optane DC PMM and show how these options can be chosen to optimize performance for graph analytics applications. Then, we compare the performance of three shared-memory graph analytics frameworks -Galois [35] , GAP [5] , and GraphIt [50] -on a machine with this set-up. Our experiments show that Galois has the flexibility needed to exploit the potential of Intel Optane DC PMM, particularly for very large web-crawl graphs that have a relatively large diameter. We also find that shared-memory Galois algorithms running on a single machine with Intel ® Optane TM DC Persistent Memory mostly outperform the same algorithms when executed on a distributed cluster with up to 256 machines. An added bonus is that the Intel Optane DC PMM system supports more efficient shared-memory algorithms such as those using pointer-jumping, which are difficult to implement on distributed-memory machines.
The rest of this paper is organized as follows. Section 2 introduces Intel ® Optane TM DC Persistent Memory. Section 3 describes how to set up large-memory systems for efficient graph analytics. Section 4 discusses how to design graph algorithms for use on large-memory systems. Section 5 presents our experimental evaluation. Section 6 surveys related work.
INTEL ® OPTANE TM DC PERSISTENT MEMORY
Intel ® Optane TM DC persistent memory is a new memory technology that delivers a unique combination of affordable large capacity and persistence (non-volatility). As shown in Figure 1 , this memory adds one more level in the memory hierarchy which can improve the performance of applications that require large amounts of memory in application areas like in-memory databases and realtime data analytics. This memory comes in the same form factor as a DDR4 memory module and has same electrical and physical interfaces. However, it uses a different protocol than DDR4 which means that the CPU must have Intel Optane DC PMM support in its memory controller. Similar to the DRAM distribution in NUMA systems, the Intel Optane DC PMM are also distributed among sockets. Figure 1 shows an example of a two socket machine with 6TB of Intel Optane DC PMM split among the sockets. A study has shown that on six interleaved Optane DC PMMs, maximum read bandwidth is roughly 49 GB/s and maximum write bandwidth is roughly 14 GB/s [26] , and the read latency of a Optane DC PMM random load is 305ns [26] . Although Intel Optane DC PMM is slower than DDR4, the large capacity offered by these DIMMs enables us to analyze much larger datasets on a single machine than was possible earlier. Memory Mode: In memory mode, the operating system sees Intel Optane DC PMM as main memory, and DRAM acts as directmapped (physically indexed and physically tagged) cache called near-memory to deliver DRAM-like performance at substantially lower cost and power with no modifications to the application. Although the memory media is persistent, the memory controller in the CPU makes it look like volatile memory to the software. This enables a common two-socket system to provide up to 6TB of main memory, something which is difficult and expensive to do with DRAM (if it is even possible). Traditional code optimization techniques for caches which are well understood like blocking and other locality improvement transformations can be used to tune the applications to run well in this configuration.
Memory Configurations in
In addition, software needs to be optimized for certain asymmetries in current machines with Intel Optane DC PMM. Intel Optane DC PMM modules on a given socket can only use the DRAM present in its local NUMA node as near-memory. Therefore, in addition to NUMA allocation considerations, Intel Optane DC PMM has to take the near-memory hit rate into account as well. The cost of a local near-memory miss is much higher than the remote near-memory hit (as we show in our evaluation), so it is more beneficial to allocate memory such that system can utilize more DRAM as near-memory even if it means more remote NUMA accesses.
App-direct Mode: In this mode, Intel Optane DC PMM modules can be provisioned as persistent memory, and software can access this memory in a byte-addressable way. This mode provides flexibility for developers to utilize this functionality through operating system or middle-ware libraries (which will enable applications to run on these systems without any modifications) or through modifications to the applications themselves to get the best value from these persistent memory modules. The Persistent Memory Development Kit (PMDK) [2] is one such example of libraries that can make programming in app-direct mode efficient. One compelling case for app-direct mode is in large memory databases where indices can be stored in persistent memory to avoid rebuilding them at every reboot, which achieves significant reduction in restart time.
Intel Optane DC PMM modules can be easily configured and managed using an application programming interface or command line interface provided by ipmctl [16] OS utility in Linux 1 . When all the Intel Optane DC PMM modules are configured in app-direct mode using ipmctl, DRAM is used as the main volatile-memory. In this paper, we focus on memory mode. Detailed specifications for the particular machine with Intel ® Optane TM DC Persistent Memory used in our study are given in Section 5.1.
SETTING UP LARGE-MEMORY MACHINES FOR GRAPH ANALYTICS
We describe how large-memory systems like our Intel ® Optane TM DC Persistent Memory machine must be set up to get good performance on graph analytics applications. There are three main issues: NUMA-aware allocation, NUMA-aware migration, and page size.
NUMA-aware Allocation
NUMA-aware allocation allocates memory on the same NUMA node as the cores that will access the memory since this may increase bandwidth and reduce latency of memory accesses. NUMA-aware allocation policies fall into three main categories: (a) NUMA local, which chooses a particular NUMA node specified at allocation time (if there is not enough memory available on the preferred node, other NUMA nodes will be used), (b) NUMA interleaved, which allocates the memory by interleaving physical pages across NUMA nodes in a round-robin fashion, and (c) NUMA blocked, which blocks the physical pages to be allocated (blocks can be of different sizes) and distributes the blocks among NUMA nodes on the system. There are several ways to implement these policies. The allocation policy can be set globally by using OS utilities such as numactl on Linux. A more fine-grained allocation policy can be implemented at the application level by using an OS-provided NUMA allocation library (numa.h in Linux) which contains numa_alloc functions to use different NUMA policies for different memory allocations inside a single application. OS-based approaches, however, can only use the NUMA local or NUMA interleaved policies. Another way to get fine-grained NUMA-aware allocation is to manually allocate memory using anonymous mmap and have threads on different sockets inside the application touch the pages (referred to as first-touch) to allocate them on the desired NUMA nodes. This method, unlike OS-provided methods, allows applications to implement application-specific NUMA-aware allocation policies other than local or interleaved.
To demonstrate and understand the differences in local, interleaved, and blocked NUMA allocation policies on our Intel Optane DC PMM setup, we use a simple micro-benchmark that allocates m amount of memory and writes to each location once using t threads such that each thread get a contiguous block to write sequentially. Figure 3(a) shows the execution time of the microbenchmark on DDR4 DRAM and Intel Optane DC PMM for the NUMA local allocation policy using t = 96 and different m. In this policy, all the memory of socket 0 is used before memory from socket 1 is allocated. We observe that going from 80GB to 160GB increases the execution time by 2× for both DRAM and Intel Optane DC PMM: this is expected since we are increasing the work by 2×. Going from 160GB to 320GB also increases the work by 2×. For DRAM, a 320GB allocation spills over to the other socket (each socket has only 192GB) and this increases the effective bandwidth by 2×, so the execution time does not change much. In Intel Optane DC PMM , however, the 320GB is allocated entirely on socket 0 as our machine has 3TB per socket. Since there is no change in bandwidth, one would expect the performance to degrade by 2×, but it actually degrades by 5.6×. This is because the machine only gets to use 192GB of DRAM as near-memory; this cannot fit all 320GB, so the conflict miss rate increases by roughly 1.8×. We conclude that (i) near-memory conflict misses are detrimental to the performance for Intel Optane DC PMM and (ii) the NUMA local allocation policy is not suitable for m > 192GB on our setup.
The execution time of the microbenchmark on DDR4 DRAM and Intel Optane DC PMM for NUMA interleaved and blocked allocation policies using m = 320GB and different t is shown in Figure 3 (b). For DRAM, both policies are similar for different t as the 320GB allocation spills over to the second socket. However, when t ≤ 24 on Intel Optane DC PMM, only threads on socket 0 are used, so the blocked policy only allocates memory on socket 0 (because it uses first-touch), thus degrading performance by 39× as compared to 48 threads, where allocation is equally divided among sockets. In contrast, the NUMA interleaved policy for 24 threads uses both sockets, and improves performance by 9× although 50% of accesses are remote when t ≤ 24, but it performs worse than blocked when t = 48.
The main takeaways from these results are (1) the cost of local near-memory misses are much higher (9×) than remote nearmemory hits and (2) the NUMA policy that maximizes the amount of near-memory available should be selected using applicationlevel fine-grained control (recall that NUMA blocked cannot be used using OS utilities).
NUMA-aware Migration
When an OS-level NUMA allocation policy is not specified or when the application-level NUMA policy is not visible to the OS, the OS (Linux) can dynamically migrate data between NUMA nodes to increase the proportion of local NUMA accesses. NUMA page migrations are helpful for multiple applications sharing a single system as they try to move pages closer to the cores assigned to each application. However, for a single application running on the machine, this policy may not always be useful, especially when application has specified its own NUMA allocation policy using its knowledge of memory access patterns.
OS-directed migrations has many overheads: (a) it requires bookkeeping to track accesses to the pages to select pages for migration, and (b) migration changes the virtual-to-physical address mapping, which makes the Page Table Entries (PTEs) cached in CPU's translation Lookaside Buffers (TLBs) 2 stale and, therefore, causes TLB shootdown on each core. TLB shootdown involves slow operations such as issuing inter-process-interrupts (IPIs), and it also increases TLB misses. Researchers in the past have also looked at the impact of page granularity on migration overhead in systems [39] . Figure 4 shows the effect of NUMA migration for breadth-first search (bfs) 3 (similar trends are observed for other benchmarks) using Galois for different input graphs on both Intel Optane DC PMM and DDR4 DRAM. It also shows the effect of migrations for two page sizes: (a) 4KB small page size and (b) 2MB huge page size. The number on each bar presents the % change in the execution time when NUMA migration is turned off 4 . A positive number means turning migration off improves the performance. Galois uses the NUMA interleaved allocation policy in this experiment. Figure 4 shows that performance improves in most cases if NUMA migration is turned off: (1) 4KB small page size shows more performance improvement than 2MB huge pages, (2) the % improvement on small page size is higher for Intel Optane DC PMM as compared to DRAM, and (3) the % improvement for huge page size increases with the graph size on Intel Optane DC PMM system. Figure 5 shows breakdown of time spent in OS kernel and user code, and Table 1 show the number of pages migrated for bfs on clueweb12 for both small and huge pages. We observe that the number of migrations is in the millions for small pages and the hundreds for huge pages. The finer granularity of small pages makes them more prone to migrations. The number of small pages being 512× the number of huge pages also increases the bookkeeping overhead in the OS. This is also reflected in the amount of time spent in the OS kernel due to NUMA migrations, which goes away when we turn migrations off. Figure 5 also shows that the time spent in user code is not affected by the NUMA migrations, which shows that migrations are adding additional overhead without giving significant benefits. Another way to measure the efficacy of the page migrations is to measure the % of local DRAM accesses or local near-memory accesses in Intel Optane DC PMM . If migration is beneficial, then this should improve. Table 1 shows that this do not change much.
NUMA migrations hurt performance more on Intel Optane DC PMM as compared to DRAM due to (a) a higher cost of bookkeeping as memory accesses are more expensive and (b) a higher cost of TLB shootdown 5 as the near-memory is a direct-mapped cache 6 . Larger graphs exacerbate this as they use more pages.
These results of these experiments suggest that NUMA migration should be turned off for graph analytics applications. 
Page Size Selection
When memory sizes and workload sizes grow, the time spent handling TLB misses can become a performance bottleneck since large working sets necessitate many virtual-to-physical address translations that may not be cached in the TLB. This bottleneck can be tackled either (a) by increasing the TLB size in the hardware or (b) by increasing the page size. The TLB size is determined by the micro-architecture and, therefore, cannot easily be changed by a user. On the other hand, processors today allow users to use a variety of page sizes since different page sizes work best for different types of workloads. For example, x86 supports traditional 4KB small pages as well as 2MB and 1GB huge pages. In Linux, huge pages are reserved by writing the number of desired pages to /proc/sys/vm/nr_hugepages and passing MAP_HUGETLB flag to mmap; otherwise, small pages are used by default. We studied the impact of page size for graph analytics applications for a 4KB small page size and a 2MB huge page size. Figure 4 shows the performance of bfs (similar behavior was observed for other benchmarks) using Galois [35] for various large graphs on (a) Intel Optane DC PMM and (b) DDR4 DRAM. NUMA migration is turned off, and Galois uses NUMA interleaved allocation. We observe that using huge pages is always beneficial for graph analytics applications on large input graphs as huge pages reduce the number of the number of pages required by 512×, which reduces the number of TLB misses (3.2× for clueweb12 and 1.9× for wdc12 as seen in Table 2 ) and CPU cycles spent on page walking on TLB misses (7.3× for clueweb12 and 8.8× for wdc12 as seen in Table 2 ). We also observe that the benefits of huge pages are higher on Intel Optane DC PMM than on DRAM since TLB misses increase the near-memory access latency. Huge pages increase the TLB reach 7 , thereby reducing the TLB misses.
These results suggest that a huge page size of 2MB is good for graph analytics on Intel Optane DC PMM.
EFFICIENT ALGORITHMS FOR INTEL ® OPTANE TM DC PERSISTENT MEMORY
In graph algorithms, each node has one or more labels that are initialized at the start of the computation and updated repeatedly during the computation until a quiescence condition is reached. Label updates are performed by applying an operator to active nodes in the graph [36] . In some systems such as Galois [35] , an operator may read and update an arbitrary portion of the graph surrounding the active node, and this is called its neighborhood. However, most shared-memory systems such as Ligra [40] and GraphIt [50] support only a limited class of operators called vertex operators whose neighborhoods are only the immediate neighbors of the active node. A push-style operator updates the labels of the neighbors of the active node, while a pull-style operator updates the label of only the active node. Direction-optimizing implementations [4] can switch between push and pull style operators dynamically, but they require a reverse edge for every forward edge in the graph, which doubles the memory footprint of the graph.
To find active nodes in the graph, algorithms take one of two approaches. A topology-driven algorithm executes in rounds, and in each round, it applies the operator to all the graph nodes; the Bellman-Ford algorithm for single-source shortest path (sssp) computation is an example. These algorithms are simple to implement, but they may not be work-efficient if there are few active nodes in the graph in many rounds. To address this, data-driven algorithms track active nodes explicitly and apply the operator only to these nodes. At the start of the algorithm, some nodes are active; applying the operator to an active node may activate other nodes, and operator application continues until there are no active nodes in the graph. Dijkstra and delta-stepping sssp algorithms are examples. Active nodes can be tracked using a bit-vector of size N if there are N nodes in the graph: we call this a dense worklist [50] . Other implementations keep an explicit worklist of active nodes [35] : we call this a sparse worklist.
Algorithms for very large graphs
At present, very large graphs are usually analyzed using clusters or out-of-core systems, but these platforms support only vertex programs. Conventional wisdom in the field is that vertex programs are adequate for power-law graphs since they have a small diameter and information does not have to propagate many hops in these graphs. Although it is known that vertex programs do not perform well on high-diameter graphs like road networks, road networks are small enough that out-of-core and distributed-memory processing is not needed [36] .
Using the Intel Optane DC PMM system, we were able to use a single machine to perform analytics on very large graphs, and our results suggest that conventional wisdom in this area needs to be revised. The key issue is highlighted by Table 3 : clueweb12, uk14, and wdc12, which are real-world web-crawls, actually have a very high diameter compared to kron and rmat, the synthetic power-law graphs 8 . Figure 6 shows the execution time of different algorithms for bfs, cc, and sssp on Intel Optane DC PMM using the clueweb12, 7 TLB reach = (TLB size × page size). 8 Details of these graphs are given in Section 5.1. rmat32 and wdc12 graphs. For bfs, we also show execution times on a different machine, called Entropy (machine description in section 5.1), with large enough DRAM (1.5TB) to store these graphs. For bfs, a vertex program with direction optimization performs well for rmat32 since it has a low-diameter, but for the real-world web-crawls, which have much higher diameter, it is outperformed by an implementation with a push-style operator and a sparse worklist since this algorithm has a lower memory footprint, makes fewer memory accesses, and is more efficient in the later rounds when there are few active nodes. For sssp, the delta-stepping algorithm, which maintains a sparse worklist, significantly outperforms the implementation with the dense worklist. For cc, label propagation combined with short-cutting (denoted LabelProp-SC) [43] and pointer-jumping (denoted Pointer-jump), both which use nonvertex operators, significantly outperform the algorithm that uses a vertex operator for the real-world web-crawls.
These findings do not apply only to Intel Optane DC PMM: Figure 7 shows the same experiments for bfs but conducted on Entropy with enough DRAM memory to fit the graphs. The trends are similar to those on the machine with Intel Optane DC PMM.
To summarize, large real-world web-crawls, which are the largest graphs available today, actually have a high diameter, unlike synthetically generated rmat and kron graphs. Therefore conclusions drawn from experiments with rmat and kron graphs can be misleading. On distributed-memory and out-of-core platforms, one is forced to use vertex programs anyway but on machines with Intel Optane DC PMM, it is advantageous to use algorithms with nonvertex operators and explicit worklists of active nodes. Frameworks that support only vertex operators or that do not have worklists are at a disadvantage on this platform when processing large real-world web-crawls.
EXPERIMENTAL EVALUATION
Section 5.1 describes the experimental setup. In Section 5.2, three shared-memory graph analytics systems -Galois [35] , GraphIt [50] , and GAP [5] -are evaluated on the Intel Optane DC PMM machine using a number of graph analytics applications, and it is shown that Galois performs best. The rest of the study uses Galois. Section 5.3 describes experiments with medium-sized graphs which are stored either in Intel Optane DC PMM or in DRAM. These experiments provide end-to-end estimates of the overhead of executing the applications with data in Intel Optane DC PMM rather than in DRAM. Section 5.4 describes experiments with very large graphs that fit only in Intel Optane DC PMM, and performance is compared with distributed-memory execution on a production cluster with up to 128 Skylake nodes using the D-Galois [18] system.
Experimental Setup
Intel Optane DC PMM experiments were conducted on a single two socket machine with Intel's second generation Xeon scalable processor ("Cascade Lake") with 48 cores (we use up to 96 threads with hyperthreading) with a clock rate of 2.2 Ghz. The machine has 6TB of Intel Optane DC PMM, 384GB of DDR4 RAM, and 32KB L1 data cache, as shown in Figure 1 . The system has a data TLB which is 4-way associative with 64 entries for 2KB pages (small pages) and 4-way associative with 32 entries for 2MB pages (huge pages). Code is compiled with g++ 7.3. We used the same machine for DRAM experiments by configuring all the Intel Optane DC PMM modules in app-direct mode using ipmctl utility to use DRAM as the main volatile-memory.
To collect hardware counters and analyze performance, we used Intel ® Vtune TM Amplifier [15] and Intel ® Platform Profiler [14] . We also conducted some experiments on a large DRAM four socket machine (referred to as Entropy) with Intel Xeon Platinum 8176 ("Skylake") processor with 112 cores with a clock rate of 2.2 Ghz and 1.5TB of DDR4 DRAM.
Distributed-memory experiments were conducted on the Stampede2 [42] cluster at the Texas Advanced Computing Center using up to 128 Intel Xeon Platinum 8160 ("Skylake") nodes with 48 cores with a clock rate of 2.1 Ghz, 192GB of DDR4 RAM, and 32KB L1 data cache. Machines in the cluster are connected with a 100Gb/s Intel Omni-Path interconnect. We use LCI [17] for message passing between hosts 9 . Code is compiled with g++ 7.1. Table 3 specifies the input graphs: clueweb12 [37] , uk14 [6, 7] , and wdc12 [34] are web-crawls (wdc12 is the largest publicly available one), and we use them throughout our study; kron30 and rmat32 are randomized scale-free graphs generated using kron [30] and rmat [9] generators (using weights of 0.57, 0.19, 0.19, and 0.05, as suggested by graph500 [1] ). kron30 fits into DRAM, so we use it to illustrate differences in workloads that fit into DRAM and those that do not. rmat32 is also a synthetic powerlaw graph, but it is does not fit in DRAM on our machine. All graphs are unweighted, so we generate random weights for all of them for sssp computation.
Our evaluation uses 6 benchmarks: single-source betweenness centrality (bc), breadth-first search (bfs), connected components (cc), k-core decomposition (kcore), pagerank (pr), and single-source shortest path (sssp). The only benchmark that uses weights is sssp. The source node for bc, bfs, and sssp is the maximum out-degree node. The tolerance for pr is 10 −6 . The k in kcore is 100. All benchmarks are run until convergence except for pr, which is run for up to 100 rounds. We present the mean of 3 runs.
Galois, GAP and GraphIt on Intel Optane DC PMM
To choose a shared-memory graph analytics system for our experiments, we evaluate (1) Galois [35] , which is a library and runtime for graph processing, (2) GAP [5] , which is a benchmark suite of graph applications, and (3) GraphIt [50] , which is a domain-specific language (DSL) and optimizing compiler for graph computations. They exemplify different approaches to shared-memory graph analytics.
Galois is a C++-based general-purpose programming system based on a sophisticated runtime that permits optimizations to be specified in the program at compile-time or at runtime, giving the application programmer a large design space of implementations that can be explored. However, it requires more programmer effort than GAP and GraphIt. GAP is a benchmark suite of common graph analytics algorithms; the code is written by expert programmers, so the user cannot choose optimizations. GraphIt is a DSL that supports only vertex programs, and it has a sophisticated compiler that uses auto-tuning to generate optimized code; the optimizations are under the control of the programmer. For Galois and GraphIt, we attempt to choose the best-performing set of optimizations for every benchmark and input. We do not modify the internals of any system in the experiments reported here.
The kcore application is not implemented in GAP and GraphIt, so we omit it in the comparisons reported in this section. We omit the two largest graphs (rmat32 and wdc12) for GAP and GraphIt because neither of them can handle graphs that have more than 2 31 −1 nodes (they use a signed 32-bit int for storing node IDs). GAP and GraphIt do not use NUMA allocation policies within their applications, so we use the OS utility numactl to choose the NUMA interleaved policy. For Galois, we chose the best-performing algorithm using a runtime option, but we did not modify the program to try different worklists or chunk-sizes. Galois allows application programmers to choose NUMA interleaved or blocked allocation policies for each application by modifying a template argument in the program, and we choose interleaved for bfs and sssp and blocked for bc, cc, and pr. For GraphIt, we used the optimizations recommended by the authors [50] for scale-free graphs (we explored a few other optimizations as well, but the recommended one was the best). We disable NUMA migration for all experiments. Figure 8 shows the execution times of the benchmarks on Intel Optane DC PMM for clueweb12 and uk14 (GraphIt does not have bc). Galois is generally much faster than GraphIt and GAP: on the average, Galois is 3.6× and 1.6× faster than GraphIt and GAP, respectively. There are many reasons for these performance differences.
Algorithms and implementation choices are part of the story as discussed in Section 4.1. For bfs, GAP and GraphIt 10 use directionoptimization that accesses both in-edges and out-edges (increasing memory accesses), while Galois does not. For sssp, GAP and Galois use delta-stepping, which requires a complicated worklist, while GraphIt does not because it does not support such worklists. For cc, GAP and Galois use a union-find based pointer-jumping algorithm, while GraphIt uses a label propagation algorithm because it supports only vertex programs. For all algorithms, GAP and GraphIt uses a dense worklist to store the frontier, while Galois uses a sparse worklist except for pr (large diameter graphs tend to have sparse frontiers). All three systems use the same algorithm for pr.
Another key difference is the way in which the three systems perform memory allocations. Galois is the only framework that uses a huge page size of 2MB, whereas GAP and GraphIt use a small page size of 4KB. As shown in Section 3, huge pages can significantly reduce the cost of memory accesses. Galois is also the only one to provide NUMA blocked allocation, and we chose that policy because it performed observably better than the interleaved policy (the performance difference was within 18%). In addition, GAP and GraphIt allocate memory for both in-edges and out-edges of the graph, while Galois allocates memory only for whichever direction is needed by the algorithm. This not only increases the memory footprint for GAP and GraphIt but may also lead to conflict misses in near-memory when both in-edges and out-edges are accessed. 5.3 Medium-size graphs: using Intel Optane DC PMM vs. DDR4 DRAM kron30 and clueweb12 (Table 3 ) fit in the 384 GB of DRAM available on the machine (using Galois). We used these graphs to measure the end-to-end overhead of using Intel Optane DC PMM for graphs that are small enough to fit in DRAM. We choose the algorithms in Galois that perform best on 96 threads (including newly implemented ones, so the execution time may be faster than those shown in Figure 8 , which are directly from the system). Figure 9 shows the strong scaling results on DRAM and on Intel Optane DC PMM with DRAM as cache. kron30 requires ∼ 136GB, which is a third of the DRAM available, so Intel Optane DC PMM delivers performance almost identical to DRAM by caching the graph in DRAM effectively. On the other hand, clueweb12 requires ∼ 365GB, which is quite close to the DRAM available, so there are significantly more conflict-misses (≈ 26%) in the near-memory of Intel Optane DC PMM. On 96 threads, Intel Optane DC PMM can take up to 65% more execution time than DRAM, but on the average, it takes only 7.3% more time than DRAM.
Another trend is that if the number of threads is less than 24, Intel Optane DC PMM can be much slower than DRAM because of the way Galois allocates memory. Interleaved and blocked allocation policies in Galois interleave and block among the threads and not among the sockets. If the number of threads is less than 24, all threads run in a single socket and all memory ends up being allocated there, leading to under-utilization of the DRAM in the entire system, and this results in more conflict-misses in near-memory. This can be overcome by changing the allocation policy in Galois if it is important to run with a small numbers of threads.
Very large graphs: using Intel Optane DC PMM vs. a Cluster
For very large graphs that do not fit in DRAM, the conventional choices are to use either a distributed or an out-of-core system.
We focus on distributed execution in this section, using the stateof-the-art D-Galois system [18] on the Stampede2 [42] cluster at TACC. To partition graphs between machines, we follow the recommendations of a previous study [22] and use Outgoing Edge Cut (OEC) for 5 and 20 hosts and Cartesian Vertex Cut (CVC) for 256 hosts. On each machine, D-Galois uses the same computation runtime as Galois. D-Galois supports only vertex programs, which simplifies communication and synchronization. Therefore, it cannot support some of the more efficient non-vertex program algorithms in the Galois system. We exclude graph loading, partitioning, and construction time in the reported numbers. For logistical reasons, it is difficult to ensure that both platforms use exactly same resources (threads and memory), so an applesto-apples comparison is hard. However, Figure 10 shows the big picture. The bars labeled O_ in each figure show times on the Intel Optane DC PMM system with the following configurations:
• OB: Performance using the best algorithm in Galois for that problem and all 96 threads • OA: Performance using the best vertex programs in Galois for that problem and all 96 threads • OS: Same as OA but using only 80 threads
The bars labeled D_ in each figure show times on the Stampede2 system with the following configurations:
• DB: Performance using D-Galois vertex programs on 256 machines (12288 threads) • DM: Performance using D-Galois vertex programs using the minimum number of hosts required to hold graph in memory (5 hosts for clueweb12 and uk14, and 20 hosts for wdc12; 48 threads per host) • DS: Same as DM but using a total of 80 threads across all machines Some of the key points are the following. For bars DS and OS, the algorithm and resources are roughly the same. In most cases, OS is similar or better than DS. The only notable exception to this is pr, and this may be because of better spatial locality in D-Galois (we are investigating this). On the average, OS is 1.9× faster than DS for all inputs and benchmarks. Bars OB and OA show the advantages of using non-vertex programs on the Intel Optane DC PMM system. Bars DB and OB show that with the more complex algorithms that can be implemented on the Intel Optane DC PMM system, performance on this system matches the performance of vertex programs on a cluster with vastly more cores and memory for bc, bfs, kcore, and sssp.
The main takeaway from these results is that Intel Optane DC PMM enables us to perform analytics on massive graphs using shared-memory frameworks like Galois out-of-the-box while yielding performance comparable or better than that of a cluster with the same resources. It boosts both programming productivity and performance.
Discussion and Summary
While our study was specific to Intel Optane DC PMM, we believe most of the main lessons summarized below also apply to other large-memory systems.
• The programming model must allow users to write workefficient algorithms that need not be vertex programs.
• Synthetic power-law graphs like kron and rmat can be misleading because large real-world web-crawls have a nontrivial diameter. They require work-efficient algorithms that reduce the number of memory accesses to get better performance.
• On NUMA systems, the runtime must manage memory allocation instead of delegating it to the operating system.
It must exploit huge pages and NUMA blocked allocation. NUMA migration is not useful. Galois performs well on Intel Optane DC PMM because it incorporates these lessons.
RELATED WORK
Shared-Memory Graph Processing. Shared-memory graph processing frameworks such as Galois [35] , Ligra [40, 41] , Julienne [19] , Polymer [49] , and GraphIt [50] provide users with abstractions to program graph computations that efficiently leverage a machine's underlying properties such as NUMA, memory locality, and multicores. Shared-memory machines are limited by the amount of available main memory on the system in which it loads the graph into memory for processing: if a graph cannot fit, then out-of-core or distributed processing. However, if the graph fits in memory, the cost of shared memory systems is less than out-of-core or distributed systems as they do not suffer disk reading overhead or communication overhead, respectively.
Intel Optane DC PMM increases the amount of available memory to shared-memory graph processing systems, and our evaluation shows that algorithms run with Intel Optane DC PMM are competitive or better than D-Galois [18] , a state-of-the-art distributed graph analytics system. This is consistent with past work in which it was shown that shared-memory graph processing on large graphs can be efficient [20] , and our findings extend to cases where a user has large amounts of main memory (it is not limited to Intel Optane DC PMM).
Out-of-core Graph Processing. Out-of-core graph processing systems such as GraphChi [29] , X-Stream [38] , GridGraph [52] Mosaic [31] do graph computation by loading appropriate portions of a graph into memory and writing it back out to disk in a disciplined manner to reduce disk access overhead. Therefore, these systems are not limited by main memory like shared-memory systems. The overhead of disk operations, however, greatly impacts performance of these systems compared to shared-memory systems. With the advent of Intel Optane DC PMM , these systems are not as necessary as before as users are able to increase their main memory to run graph computations on large graphs without a significant performance decrease as our evaluation shows.
Distributed Graph Processing. Distributed graph processing systems such as PowerGraph [24] , Gemini [51] , D-Galois [18] , and others [8, 10, 23, 25, 32, 46] are able to process large graphs by distributing the graph among many machines which increases both available memory as well as computational power. However, since computation is spread among many machines, communication among the machines is required, and this can add significant overhead to the runtime of an algorithm. Additionally, getting access to a distributed cluster can be expensive to an average user, and this limits the usability of such systems. Using Intel Optane DC PMM to increase the memory available to shared memory systems solves both the memory issue and the cost issue that makes distributed systems difficult to use. Persistent Memory. Graph processing is not the only area in which Intel Optane DC PMM can be used. Prior work on nonvolatile memory includes file systems designed for persistent memory [11, 13, 21, 47] , making sure access to persistent memory is efficient while being semantically consistent [12, 28, 33, 45] , database systems in persistent memory [3, 44, 48] , and countless other areas in which non-volatile memory is expected to improve performance.
CONCLUSIONS
Intel ® Optane TM DC Persistent Memory is a new kind of byteaddressable memory with higher density and lower cost than DRAM. This paper described a system with Intel Optane DC PMM and showed how it could be used effectively to perform analytics on very large graphs that do not fit in DRAM on most machines. We also showed that the Galois framework on a machine with Intel Optane DC PMM is on average 2× faster than D-Galois, the stateof-the-art distributed graph framework, on a production cluster with similar compute power. We conclude that Intel Optane DC PMM is well-suited for large-scale graph analytics.
