Abstract-Modern shared-memory systems embrace the NUMA architecture which has proven to be more scalable than the SMP architecture. In many ways, a NUMA system resembles a shared-nothing distributed system: physically distinct processing units and memory regions. Memory accesses to remote NUMA domains are more expensive than local accesses. This poses the opportunity to transfer the know-how and design of distributed graph processing to develop shared-memory graph processing solutions optimized for NUMA systems. To this end, we explore if a distributed-memory like middleware that makes graph partitioning and communication between partitions explicit, can improve the performance on a NUMA system. We design and implement a NUMA aware graph processing framework that embraces design philosophies of distributed graph processing system: in particular explicit partitioning and inter-partition communication, and at the same time exploits optimization opportunities specific to single-node systems. We demonstrate up to 13.9× speedup over a state-of-the-art NUMAaware framework, Polymer and up to 3.7× scalability on a foursocket NUMA machine using graphs with tens of billions of edges.
I. INTRODUCTION
Graph processing is at the core of a wide range of big data problems, such as analyzing social networks, bioinformatics, transport network analysis, financial and business analytics, to name a few. Additionally, graph processing has found new applications in machine learning and data mining.
Graph algorithms incur highly irregular data-dependent memory access patterns, which lead to poor data locality. Further, most of the graph algorithms have a low compute-tomemory access ratio, i.e., they are memory-bound. Many realworld graphs are massive: some have hundreds of billions of edges -therefore, have huge memory footprint. For example, the Facebook graph [1] have more than 100 billion edges, which requires over two terabytes of memory.
To process such huge graphs, traditionally frameworks like Google's Pregel [2] and GraphLab [3] running on large sharednothing clusters have been used, as they provide large aggregated memory. Most of these frameworks use the Bulk Synchronous Parallel (BSP) Processing Model [4] . Here, the graph is partitioned explicitly among the nodes, and as these sharednothing clusters are not cache-coherent, the communication between different nodes is explicit. This is in contrast with graph processing frameworks [5] , [6] that target single-node shared-memory systems, and treat shared-memory system as if it is based on Symmetric Multi-Processor (SMP) architecture. The SMP architecture provides shared-memory and the access time to any location in memory is uniform, therefore, there is no need for data partitioning.
Non-Uniform Memory Access (NUMA) architecture machines introduce a dilemma: on the one side, they provide shared memory -hence the graph processing frameworks that treat shared-memory system as SMP architecture, can be directly used. On the other side, the cost of memory accesses is non-uniform (i.e, a socket has faster access to the local memory associated with it, than to remote/nonlocal memory associated with other sockets), thus explicit data placement is needed to obtain maximum performance and a graph processing framework developed in the style of frameworks that target distributed systems may prove to offer advantages.
This paper explores whether a distributed-memory like middleware that makes graph partitioning and communication between partitions explicit, can improve the performance on a NUMA system.
To test this hypothesis, this paper describes the design and implementation of our NUMA-aware graph processing framework that treats the NUMA platform as a distributed system, hence embraces its design principles; in particular explicit partitioning and communication, and evaluates it against the state-of-the-art NUMA-oblivious [7] and NUMAaware [8] graph processing frameworks. It further describes optimization techniques to reduce communication overhead. And finally, provides a set of practical guidelines for choosing the appropriate partitioning and communication strategies.
Contributions. The contributions of this paper are: 1) Design Explorations: Given their close resemblance, there exists opportunities to transfer the wisdom of distributed graph processing to develop shared-memory graph processing solutions optimized for NUMA systems (which is surprisingly little-explored). To this end, we explore a reference distributed design (Section III). In particular, we evaluate the performance of a fully distributed (referred to as NUMA 2-Box design, §III-B1) and one shared-memory (referred to as NUMA 1-Box design, §III-B2) inter-partition communication strategies (where each partition belongs to a NUMA domain), and how they compare against a NUMA-oblivious implementation. We found that, on a NUMA platform, a graph processing solution based on the design philosophies that targets shared-nothing distributed system, consistently outperforms the state-of-theart NUMA-oblivious shared-memory solution (Section V). Additionally, we explore two distributed graph partitioning techniques for NUMA, and introduce a new partitioning technique (Section III-A) that leads to load balance of up to 95% and overall performance improvement of up to 5.3×.
2) A New NUMA-aware Design: Based on our design explorations, we propose a new design (referred to as NUMA 0-Box, §III-B3), that takes into account distributed sharedmemory nature of NUMA, and consists of explicit graph partitioning and implicit communication. It improves data locality through NUMA-aware partitioning and at the same time minimizes the overhead of remote accesses by overlapping remote memory operations with computation.
Evaluation shows, this new design offers, for BFS up to 2.37×, SSSP up to 2.27× and PageRank up to 1.89× improvement in time-to-solution over the respective NUMAoblivious implementations. This design, however, did not improve performance of PageRank over the NUMA 2-Box design (explained in Section V).
3) Evaluation: We evaluate the aforementioned three designs for the following applications: PageRank, BFS and SSSP, using synthetic graphs (with up to 128 billion undirected edges), on a Intel NUMA platform with four sockets and 1.5TB memory. Summary of our findings are the following:
(i) We compare the three graph partitioning strategies and find that our proposed approach offers up to 5.3× speedup and 95% load balanced partitions.
(ii) We demonstrate strong scaling on up to four sockets: maximum speedup (over one socket) achieved by PageRank is 3.7×, BFS is 2.9× and SSSP is 2.8×. (iii) We show RMAT scaling using up to Scale 32 graph. Our BFS implementation achieves a maximum of 39 giga traversed edges per second (GTEPS). (iv) Finally, we compare our work with a recent NUMA-aware graph processing framework, Polymer and demonstrate that our solution consistently outperforms Polymer, by up to 13.9× for BFS. Also, our solution is ∼4.4× more memory efficient. (Section V) II. BACKGROUND
A. Non-Uniform Memory Access Architecture
NUMA is a shared memory architecture consisting of a set of processors (often called sockets), each with their own local memory. Each socket is connected with other sockets through an interconnect (Quick-Path Interconnect in Intel systems). Accessing socket-local memory takes distinctively lower time than accessing remote memory over the interconnect.
NUMA addresses the scalability issues of SMP or UMA (Uniform Memory Access) architecture, where all the processors share the same memory bus, therefore, accessing any part of memory from any processor takes uniform time. This works well as long as the number of processors is limited. As the number of processors increase, contention on the shared bus increases, thereby inhibiting it to scale.
NUMA distributes memory to each processors: each processor has fast access to its local memory, while to access local memory of another processor, it has to traverse over the slow interconnect. This reduces contention over interconnect, as well as provides the opportunity to scale the system. Note that scalability comes at the cost of remote memory access. Obtaining maximum performance requires careful placement of data to avoid/minimize remote memory accesses with lower latency and higher bandwidth.
We benchmark our testbed, a four socket Intel Xeon system (details in Section IV-C), by extending the Stream [9] benchmark to measure local and remote, read and write memory bandwidth. We run the experiments for array sizes from 1 MB and up to 1 GB (at which point the memory bandwidth is saturated). From Table I , we observe that local memory bandwidth is up to 40% higher than remote. Interestingly, remote sequential access is 6× to 9× faster than local random access. This observation is important for graph processing, which incurs highly irregular memory access patterns.
B. BSP-style Graph Processing
Bulk Synchronous Parallel Model. In the BSP model (Figure 1 ), workload is partitioned among the processing units and the processing consists of a sequence of rounds or supersteps. Each superstep consists of three phases (executed in order): computation, communication, and synchronization phase.
This sequence of supersteps continues until all the processing units have finished processing their respective partitions. Finally, the result is aggregated from all the partitions.
BSP-style Graph Processing. Graph computation can be modeled as Gather-Apply-Scatter (GAS) [3] , where the graph processing follows sequences of gather, apply and scatter. In gather phase, vertices collect information from their neighbors to update their local state in apply phase, and then in scatter phase, they communicate their updated value to their neighbors. For example, in PageRank, a vertex computes its rank by gathering rank of its in-degree neighbors, and scatters its new rank to its out-degree neighbors. The BSP processing model inherently matches with this iterative nature of graph algorithms, where gather, apply and scatter phases resembles the three minor-steps of a BSP model.
For distributed graph processing, initial step is to partition the graph. Each partition has a set of local vertices and edges. Since an edge is associated with two vertices, which could be on different partitions, a map is maintained for the remote vertices (vertex 4 in PID0 and vertex 1 in PID1, in Fig. 1 ) on each partition. Each partition maintains algorithm specific state buffer(s) (such as rank array in PageRank) for local vertices (buffer S 0 in PID0 and S 1 in PID1, in Fig. 1 ) as well as remote vertices (buffer S 1 for remote vertices in PID0, in Fig. 1 ).
In the context of graph processing, a BSP superstep is composed of the following three steps:
In computation phase, processing units work in parallel, and execute the graph algorithm specific kernel on the set of vertices belonging to their partition, and update their local state buffer (buffer S 0 and S 1 in Fig. 1 ). The local state of active remote vertices is also updated and aggregated locally in the respective buffer (buffer S 1 and S 0 in Fig. 1) .
In communication phase, each partition exchange the messages for the boundary edges, and applies the remote updates received to their local state buffers. In Fig. 1 , both the partitions transfer the state of remote vertices to make sure local and remote states of the vertices are same, i.e. S 0 and S 0 are the same, and S 1 and S 1 are the same. Finally, synchronization phase ensures that all the partitions are updated with the latest state of the remote vertices, before the superstep cycle restarts.
The process terminates when all the partitions have finished processing. After termination, final result is aggregated from all the partition through a global reduction.
III. A BSP-STYLE NUMA-AWARE DESIGN
Intuition. As described in previous section, BSP processing model inherently matches with graph computation. On a NUMA machine, the expected benefits of BSP processing model are: (i) Explicit partitioning -allows better control over data placement, and simpler experimentation with load balancing techniques, and (ii) Choice of communication designsallows exploring different communication designs for interpartition communication because of the distributed sharedmemory nature of NUMA systems.
A. Graph Partitioning
Distributed graph processing begins with partitioning the graph and allocating it on the processing units. Explicitly partitioning the graph enables implementing and experimenting with different partitioning strategies for better load balance and overall performance improvement. We have implemented two traditional graph partitioning strategies: Random Partitioning, popular in distributed systems like Google's Pregel [2] and GraphLab [3] , and Sorted/Degree-aware partitioning, used (and shown to perform better than Random partitioning) in heterogeneous distributed system like Totem [7] . We also introduce a new partitioning strategy that leads to better load balance and high performance.
Random Partitioning. In this technique, vertices are assigned randomly to the processing units, with even share of edges. Random partitioning is a popular strategy among the distributed graph processing systems, like Pregel and GraphLab, targeting graphs having power-law vertex degree distribution [10] . It increases the probability of each partition having equal variability in terms of vertex degree.
Sorted or Degree-aware Partitioning. In this approach the vertices are first sorted by degree, and then they are assigned to the processing units as a contiguous chunk of vertices with even share of edges. This strategy leads to better locality since the likelihood of having most of the neighbors in the same partition increases. It has been shown to perform better than random partitioning in Totem [7] , a heterogeneous distributed system.
New Strategy -Hybrid Partitioning. We observed that Random partitioning leads to better load balance but suffers from poor data locality. Sorted or Degree-aware partitioning, on the other hand, achieves better data locality, but suffers from severe load imbalance. With these observations, we designed and implemented a hybrid partitioning technique that alleviates this problem. In the first step we randomly assign the vertices to the processing units, same as random partitioning. And then we sort the vertex list of individual subgraphs by degree. Random partitioning makes sure that each partition has equal variability in terms of vertex degree (thereby increasing the chance that the generated load is well balanced). Sorting individual vertex lists improves data locality. Later we discuss its performance against traditional strategies in Fig. 5 .
B. Design Opportunities for NUMA-aware Graph Processing
Since we partition the graph and place one partition per NUMA node, in the computation phase, it enables processing partitions in parallel with all the memory accesses served from the local memory. In communication phase, since NUMA systems are essentially a shared-memory system, we have the opportunity to explore optimizations to reduce communication overhead. We explore three alternatives that address the motivation of this paper: To what degree, designing for NUMA as for a distributed memory system can enable performance (by explicitly presenting locality), in spite of inherent overheads (message exchange), in an application agnostic way. 1) NUMA 2-Box Design: In this design we fully embrace the design philosophy of a distributed system, thereby assuming NUMA as a shared-nothing distributed system -where nodes are independent and are connected with each other through the interconnect. As shown in Fig. 2 , it has two message buffers (outbox and inbox). In computation phase, each partition updates its local state buffer for local vertices, while updates for remote vertices are aggregated and stored locally in respective outbox. In communication phase, the partitions transfer the respective outbox to the corresponding remote inbox (solid blue lines in Fig. 2) , and apply the remote updates from the inbox to the local state buffer (red lines from inbox to buffer S).
The advantages of this design are: (i) Zero remote memory accesses, as all the accesses are local in both computation and communication phases, and the message buffer (out/in box) is explicitly copied to the remote partition's message buffer 2) NUMA 1-Box Design: Since NUMA is a shared memory system, instead of having two explicit message boxes, one at source and another at destination, only one buffer can be physically allocated on the partition, and the pointer to the box could be swapped during communication. In this design we allocate only one message buffer, at the source, and assign it to outbox, because of the fact that outbox in source partition is inbox in the destination partition. Compute phase is same as in the previous design. But in communication phase, it only swaps the pointer address of outbox to inbox, and performs sequential/coalesced read/write for the remote updates.
The advantages of this design are: (i) explicit message box transfer is not required, and (ii) message reduction, same as previous design. But, it suffers from doing remote sequential access to the message buffer, which is bounded by the slow local random access to update the local state buffer.
3) NUMA 0-Box Design: In this design we consider the fact that NUMA is a distributed shared-memory system. We do explicit partitioning as if NUMA is a distributed system, but we access the state buffers as if we are on a shared-memory system. As shown in Fig. 2 pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3  pid0  pid1  pid2  pid3   ss1  ss2  ss3  ss4  ss5  ss6  ss7  ss8 Frequency (Millions) (log-scale) Fig. 3) . But, it performs poorly for algorithms like PageRank where there is a message via every boundary edge, compared to NUMA 2-Box design, where number of messages equals the number of remote vertices (not edges). No message aggregation increases the number of remote memory accesses severely.
IV. EXPERIMENT DESIGN

A. System Implementation
To implement our NUMA-aware design, we extend a state-of-the-art NUMA-oblivious graph processing framework, Totem [7] , that presumes SMP based CPUs. It does hybrid graph processing on CPU and GPUs, where GPUs have discrete memory, thereby follows distributed systems design. Similar to distributed systems, it follows Bulk Synchronous Parallel processing model, and does communication between CPU and GPU with message buffers. We use Totem because: First, in our previous study [11] we observed that it outperforms state-of-the-art graph processing frameworks including Intel's GraphMat [12] , and Galois [5] by up to an order of magnitude. Second, and most importantly, its processing model, BSP, matches with our needs. Fig. 4 presents the high-level design of our framework. As input, user provides the graph (workload), graph kernel (e.g. BFS, PageRank), partitioning and communication strategy, along with optimization options. We allocate all the data structures belonging to a partition on its respective NUMA node using the libnuma library. To launch concurrent computation on the partitions, we leverage nested parallelism offered in OpenMP. In the first level of parallelism, we create as many threads as the number of NUMA nodes available. From each of these threads, child threads, equal to the numbers of cores available on each NUMA node, are spawned on the respective NUMA nodes. Similarly, during communication phase, especially in NUMA 2-Box design, parent thread on each NUMA node transfers the content of outbox to inbox of remote partition, and then child threads applies the updates from inbox to the respective local state buffers. This process continues until the global finish flag is set.
In our experiments we evaluate how our NUMA-aware framework performs against this state-of-the-art NUMAoblivious framework. Further, we run the NUMA-oblivious framework with numactl interleave command, which allocates memory on all the NUMA nodes in a round robin fashion, instead of Linux's first-touch policy, that allocates data on the memory node touched first by the thread.
B. Graph Algorithms
We consider PageRank, Breadth-First Search -Top Down (BFS-TD), Breadth-First Search -Direction Optimized (BFS-DO) [13] , and Single-Source Shortest Path (SSSP). We use these algorithms since they have been widely studied in the context of high-performance graph processing systems and have been used in past studies [5] - [8] . BFS and SSSP are also used as benchmarks for the Graph500 competition, to rank supercomputers for data intensive, irregular applications.
C. Testbed
To explore benefits of our designs we use a four socket Intel Xeon machine (E7-4870 v2, Ivy Bridge), having 60 cores, 1536 GB of Memory and L3 Cache of 120 MB. Table I depicts the key memory characteristics of the machine.
D. Workload
We consider large Recursive MATrix (RMAT) scale-free graphs from scale 28 to 32. All graphs are generated using the RMAT generator [10] with the following parameters: (A,B,C) = (0.57, 0.19, 0.19) and an average vertex degree of 16. All the graphs were made undirected, following the Graph500 standard. We use RMAT graphs to evaluate our design because: First, it is adopted by today's widely accepted Graph500 benchmark. Second, RMAT graphs have similar characteristics to real-world graphs: they have a low diameter and a 'power-law' [10] (highly heterogeneous) vertex degree distribution. We use 64-bit vertex and edge id to store the graph in-memory, as we store partition id in the highest ordered bits. The largest graph we run, RMAT32, has the edgelist of size 1TB. For evaluation, we run the experiments 20 times for each workload and report the average. For BFS and SSSP, we use different randomly generated source vertex. We use weights, for SSSP, in the range of (0, 1M] so as to have highly diverse weight distribution. For PageRank, we run each experiment with five PageRank iterations and normalize the execution time to one iteration.
E. Experimental Methodology 1) Partitioning:
Explicit partitioning enables implementation and experimentation with different partitioning strategies to achieve better load balancing and data locality. We experiment with the three partitioning techniques that we described in Section III-A. We observe that careful partitioning helps in achieving better load balance and overall performance gain, regardless of the graph applications. We define load imbalance as ratio between computation time of the slowest partition to that of the fastest partition.
2) Performance Evaluation of Designs:
We evaluate the NUMA-aware designs introduced earlier by comparing against NUMA-oblivious Totem and running Totem with numactl, and report the execution time. Consistent with the standard practice in the domain, 'execution time' does not include time spent in pre-or post-processing steps such as graph loading, partitioning and result aggregation. We further evaluate our designs for strong scaling w.r.t resources. For scalability, we consider largest graph that could fit in the memory of one socket (384 GB). For PageRank and BFS we consider RMAT30, with edgelist size 256 GB, and for SSSP we consider weighted RMAT29, with weighted edgelist size 192 GB.
3) Comparison with Existing Work -Polymer:
Finally, we compare against Polymer [8] , the only NUMA-aware single-node graph processing framework (to the best of our knowledge). It has shown to perform better than state-ofthe-art single-node graph processing frameworks Galois [5] , Ligra [6] , and X-Stream [14] .
V. EXPERIMENTAL RESULTS
In this section we present and discuss the performance of our NUMA-aware framework, described in previous section, through experimental results with graph algorithms and workloads mentioned above. 
Q1. What are the benefits of graph partitioning on a NUMA machine?
Fig . 5 shows the impact of the two traditional partitioning strategies and our hybrid strategy, on load balancing using the RMAT31 graph for PageRank (which has fixed workload in every superstep) and BFS-DO (which has dynamic workload per superstep) algorithms, respectively. For PageRank, Random strategy leads to a load imbalance of only 1.03× (i.e. the slowest partition is only 3% slower than the fastest partition). Sorted/Degree-aware strategy suffers a load imbalance of 1.46×, but performs 1.69× better than Random partitioning. This is because, random partitioning strategy increases the probability that each partition has equal variability in terms of vertex degree. This is why we observe better load balance. On the other hand, it also increases the probability that the neighbors of a vertex are scattered in memory, leading to poor data locality. While with Sorted strategy, first partition gets the most dense graph (containing few high degree vertices, for e.g., PID-0 in Figure 5 ) and the last partition gets the most sparse graph (containing most of the vertices with low degree). Sorted strategy leads to better data locality and over all performance for PageRank, but since dense partition gets processed faster than the sparse partition, it leads to higher load imbalance compared to Random strategy. Hybrid strategy achieves load imbalance of only 1.05×. This leads to an overall performance improvement of 1.18× and 2× against Sorted and Random strategies, respectively.
For BFS-DO, where workload changes drastically in every superstep, we observe significantly higher load imbalance, of 10.1×, with Sorted strategy. With Sorted strategy, initial three supersteps are executed with Top-Down kernel, followed by three supersteps with Bottom-up kernel, and the remaining again with Top-Down kernel. In Top-Down approach, frontier builds up quickly for dense partition (since it has high degree vertices), hence we observe that in superstep 3, the dense partition (PID-0) takes significant amount of time because of processing the huge frontier. Random strategy achieves load imbalance of 1.35×. On the other hand, Hybrid strategy achieves load imbalance of only 1.13×. Since, BFS-DO is a memory bound algorithm and cache sensitive, better load balance and locality leads to better performance. Random strategy performs 3.4× better than Sorted strategy, while Hybrid strategy performs 5.3× and 1.55× better than Sorted and Random strategies, respectively.
Since our hybrid strategy achieves better overall performance, in all the following experiments, we use hybrid partitions to evaluate the NUMA-aware designs.
Key Insights Gained from Evaluating Partitioning Strategies
i) Better load balance does not mean better performance (as observed in Random vs Sorted strategy).
ii) The hybrid partitioning strategy strikes the right balance between load balance and locality, hence offers improved performance.
iii) Graph partitioning is an NP-Complete problem. There exists sophisticated partitioning strategies, such as the vertexcut approach [6] , that offer improved load balancing across partitions, however, are costly. For them, graph partitioning takes much longer than the simpler partitioning techniques we explored in this paper.
Q2. How does the explicit communication design perform compared to a NUMA-agnostic solution?
As observed in Fig. 6 , For RMAT31 graph, NUMA 2-Box is 2.07× and 1.63× faster than NUMA-oblivious framework Totem for PageRank and BFS-DO algorithm, respectively.
Since PageRank has a high compute-to-memory-access ratio, most of the time is spend in computation phase. Further, because of having explicit 2-Box communication, all the remote updates are send in a batch and all the accesses are local (in both computation and communication phase). This leads to spending only 3% of execution time in communication phase.
BFS-DO has a low compute-to-memory-access ratio and as shown previously in Fig. 3 , relatively few remote vertices have messages in each superstep. This leads to higher communication cost of 26.9% of execution time. Further, Fig. 7 , presents that NUMA 2-Box performs better than both Totem and numactl for all the algorithms (up to 2.08×, 1.88×, and 1.91× against Totem, and 20%, 26%, and 33% against numactl, for PageRank, BFS-DO and BFS-TD, respectively), except SSSP. For SSSP, it performs up to 33% better than Totem, but is up to 91% slower than numactl. As mentioned earlier, for SSSP, the optimizations, to activate the neighbors in the same iterations, were done to reduce the number of supersteps assuming SMP based architecture. This leads to the partition with source vertex spend much more time in the computation phase than others, in the initial few supersteps. Note that our NUMA-aware design is application agnostic and we do not modify the applications.
Q3. Can we further optimize the explicit communication step by taking into account the shared-memory architecture?
In short, no. Our experiments with NUMA 1-Box design, as shown in Fig. 6 and Fig. 7 , suggests that it is better to use NUMA 2-Box design than NUMA 1-Box. Though, NUMA 1-Box does perform better than Totem and is competitive with numactl. NUMA 1-Box design takes more time during communication phase, as fast (relatively) remote sequential accesses are bounded by slow local random updates to the state buffer.
Q4. Can we create a hybrid design of distributed and shared-memory SMP system? How well does it perform?
NUMA 0-Box design performs better for algorithms like BFS and SSSP, where in each superstep there are messages from selective boundary edges only, not from all. As shown in Fig. 6 , for BFS-DO, overlapping computation and communication (by directly updating the remote vertices) in every superstep leads to better performance. Note that the gain, almost equivalent to communication time in NUMA 2-Box, is achieved because of implicit communication, since the number of remote updates in every superstep is much less than the number of remote vertices (∼22× for RMAT31 graph), as shown previously in Fig. 3 .
In both Fig. 6 and Fig. 7 , NUMA 0-Box design performs better than Totem, numactl as well as other NUMA designs 
Key Insights from Communication Designs
i) Although explicit communication can be perceived as extravagant for a cache-coherent shared memory system, its performance benefits on a NUMA system are indisputable.
ii) Performance gain in NUMA 2-Box, compared to NUMA 1-Box, comes from doing only local accesses during computation, and copying data in bulk from source partition to destination partition in a sequential manner. In NUMA 1-Box design, even though the remote updates happen sequentially, it is bounded by slow local random writes to the local state buffers during computation.
iii) NUMA 2-Box design leads to zero remote memory accesses and reduces the number of messages sent during the communication phase to only the number of remote vertices, regardless of the number of edges associated with them. v) For BFS and SSSP, the partition with the source vertex ends up spending a lot of time in the initial supersteps in NUMA 2-Box and 1-Box designs, as the active vertices are confined in the partition with the source vertex. This degrades the performance of the NUMA 2-Box and 1-Box designs for these algorithms. While for PageRank, NUMA 0-Box ends up doing remote random access for remote edges, and no message aggregation like NUMA 2-Box design is possible, which degrades its performance.
Q5. How do the NUMA-aware designs scale?
Here we evaluate scalability of our designs through strong scaling experiment on four sockets of our testbed, using the largest graph that could fit into the memory of one socket. Fig. 8 presents that our NUMA design fills the performance gap left by Totem by scaling to as much as 3.7×, 2.9×, 2.7× and 2.8× compared to 2.0×, 1.7×, 2.1×, and 1.3× achieved by Totem for PageRank, BFS-DO, BFS-TD and SSSP algorithms, respectively.
Q6. How does this work performs compared to a NUMAoptimized solution -Polymer?
We compare our work with Polymer, to the best of our knowledge, the only NUMA-optimized framework. Table II summarizes the performance of the best performing NUMA design against Polymer. Our design outperforms it by up to 3.5×, 13.9× and 2.6× for PageRank, BFS-TD (Polymer does not have BFS-DO) and SSSP algorithms respectively. Polymer does vertex-cut partitioning, and consumes ∼5.7× more memory than the size of the respective edge-list of the graph, and ∼4.4× more memory than our NUMA designs. 
REFERENCES
APPENDIX B ARTIFACT APPENDIX
A. Abstract
This artifact contains the NUMA-aware graph processing software, along with the scripts used to produce the results in this paper. This artifact: (i) supports execution in NUMAoblivious, numactl and NUMA-aware mode by setting a knob in the input, (ii) provides options for experimenting with different partitioning and communication strategies, and (iii) includes the following algorithms: PageRank, BFS (Top Down and Direction-Optimized implementations) and SSSP.
B. Artifact check-list (meta-information)
• Algorithms: PageRank, Breadth-First Search (Top-Down and Direction-Optimized), and Single-Source Shortest Path.
• Program: C/C++ code • Compilation: GCC 4.8.4 (or above) with -O3 flag, CUDA 8.
• Data set: undirected RMAT Graphs (edge-list size from 64GB (RMAT28) to 1TB (RMAT32)). 2) Hardware dependencies: A NUMA machine (with at least 4 sockets is recommended). The artifact has been tested on Intel Xeon multi-socket servers. It does not require a GPU installed on the server. The testbed we have used is a 4 socket Intel Xeon CPU E7-4870.v2, with 60 cores and 1.5 TB of DDR3 memory.
3) Software dependencies: The software extends the TOTEM framework which requires at least CUDA 8 (however, a GPU is not required), the Intel Threading Building Blocks, and GCC version 4.8.3 or later. Additional requirement for compilation is libnuma. The software was tested on Ubuntu 18.04 and Fedora 22.
4) Data sets:
All the datasets are generated using the RMAT generator. The RMAT generator is available in TOTEM as well (please refer to the documentation). All the graphs were made undirected by creating two directed edges, one in each direction, for each undirected edge. Real-world/new graphs should be converted to the format described in the documentation. A sample dataset, undirected RMAT28 graph, is available in TOTEM format at http://ece.ubc.ca/ ∼ taasawat/ RMAT28 undir vid32eid64.tbin.
D. Installation
Follow the instruction in the documentation to clone the software from github, and build it using a dedicated make file. The datasets should be stored in 'data/' folder.
E. Experiment workflow
Build the executable from the root directory of the git repo by using the available make file. For convenience, scripts are available with name run X.sh, where 'X' is the benchmark (pagerank/bfs/sssp). These scripts include commands to run NUMA-oblivious Totem, Totem with numactl and the NUMA-aware designs, with different optimization options and datasets. For scalability experiments, the above scripts are suffixed by ' scalability' keyword.
F. Evaluation and expected result
It first outputs the summary of input and different optimization options, followed by the summary of time consumed in different phases of the execution along with the execution rate (traversed edges per second). For PageRank, it outputs the execution time of five PageRank rounds, by default.
The performance of NUMA-oblivious Totem, numactl and NUMA-aware designs depends on the specific NUMA server used, but the relative performances highlighted in the paper are expected to be the same.
G. Notes
To know more about our NUMA-aware graph processing framework, please visit our web page: http://netsyslab.ece.ubc. ca/wiki/index.php/Totem.
