There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publiclyavailable real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory.
INTRODUCTION
Today, the largest publicly-available graph, the Hyperlink Web graph, consists of over 3.5 billion vertices and 128 billion edges [61] . This graph presents a significant computational challenge for both distributed and shared memory systems. Indeed, very few algorithms have been applied to this graph, and those that have often Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SPAA '18, July [16] [17] [18] 2018 , Vienna, Austria © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5799-9/18/07. . . $15.00 https://doi.org/10.1145/3210377.3210414 take hours to run [26, 45, 54] , with the fastest times requiring between 1-6 minutes using a supercomputer [84, 85] . In this paper, we show that a wide range of fundamental graph problems can be solved quickly on this graph, often in minutes, on a single commodity shared-memory machine with a terabyte of RAM. 1 For example, our k-core implementation takes under 3.5 minutes on 72 cores, whereas Slota et al. [85] report a running time of about 6 minutes for approximate k-core on a supercomputer with over 8000 cores. They also report that they can identify the largest connected component on this graph in 63 seconds, whereas we can identify all connected components in 38.3 seconds. Another recent result by Stergiou et al. [86] solves connectivity on the Hyperlink 2012 graph in 341 seconds on a 1000 node cluster with 12000 cores and 128TB of RAM. Compared to this result, our implementation is 8.9x faster on a system with 128x less memory and 166x fewer cores. However, we note that they are able to process a significantly larger private graph that we would not be able to fit into our memory footprint. A more complete comparison between our work and existing work, including disk-based systems [26, 45, 54] , is given in Section 6.
Importantly, all of our implementations have strong theoretical bounds on their work and depth. There are several reasons that algorithms with good theoretical guarantees are desirable. For one, they are robust as even adversarially-chosen inputs will not cause them to perform extremely poorly. Furthermore, they can be designed on pen-and-paper by exploiting properties of the problem instead of tailoring solutions to the particular dataset at hand. Theoretical guarantees also make it likely that the algorithm will continue to perform well even if the underlying data changes. Finally, careful implementations of algorithms that are nearly work-efficient can perform much less work in practice than work-inefficient algorithms. This reduction in work often translates to faster running times on the same number of cores [28] . We note that most running times that have been reported in the literature on the Hyperlink Web graph use parallel algorithms that are not theoretically-efficient.
In this paper, we present implementations of parallel algorithms with strong theoretical bounds on their work and depth for connectivity, biconnectivity, strongly connected components, low-diameter decomposition, maximal independent set, maximal matching, graph coloring, single-source shortest paths, betweenness centrality, minimum spanning forest, k-core decomposition, approximate set cover, and triangle counting. We describe the techniques used to achieve good performance on graphs with billions of vertices and hundreds of billions of edges and share experimental results for the Hyperlink 2012 and Hyperlink 2014 Web crawls, the largest and second largest publicly-available graphs, as well as several smaller real-world graphs at various scales. Some of the algorithms we is the 72-core time using hyper-threading, and (SU) is the parallel speedup. Theoretical bounds for the algorithms and the variant of the MT-RAM used are shown in the last three columns. We mark times that did not finish in 5 hours with -. *SCC was run on the directed version of the graph. † We say that an algorithm has O (f (n)) cost with high probability (w.h.p.) if it has O (k · f (n)) cost with probability at least 1 − 1/n k .
describe are based on previous results from Ligra, Ligra+, and Julienne [28, 76, 80] , and other papers on efficient parallel graph algorithms [11, 40, 81] . However, most existing implementations were changed significantly in order to be more memory efficient. Several algorithm implementations for problems like strongly connected components, minimum spanning forest, and biconnectivity are new, and required implementation techniques to scale that we believe are of independent interest. We also had to extend the compressed representation from Ligra+ [80] to ensure that our graph primitives for mapping, filtering, reducing and packing the neighbors of a vertex were theoretically-efficient. We note that using compression techniques is crucial for representing the symmetrized Hyperlink 2012 graph in 1TB of RAM, as storing this graph in an uncompressed format would require over 900GB to store the edges alone, whereas the graph requires 330GB in our compressed format (less than 1.5 bytes per edge). We show the running times of our algorithms on the Hyperlink 2012 graph as well as their work and depth bounds in Table 1 . To make it easy to build upon or compare to our work in the future, we describe a benchmark suite containing our problems with clear I/O specifications, which we have made publicly-available. 2 We present an experimental evaluation of all of our implementations, and in almost all cases, the numbers we report are faster than any previous performance numbers for any machines, even much larger supercomputers. We are also able to apply our algorithms to the largest publicly-available graph, in many cases for the first time in the literature, using a reasonably modest machine. Most importantly, our implementations are based on reasonably simple algorithms with strong bounds on their work and depth. We believe that our implementations are likely to scale to larger graphs and lead to efficient algorithms for related problems.
RELATED WORK
Parallel Graph Algorithms. Parallel graph algorithms have received significant attention since the start of parallel computing, and many elegant algorithms with good theoretical bounds have been developed over the decades (e.g., [3, 8, 21, 32, 44, 49, 53, 62-64, 68, 69, 75, 87] ). A major goal in parallel graph algorithm design is to find work-efficient algorithms with polylogarithmic depth. While many suspect that work-efficient algorithms may not exist for all 2 https://github.com/ldhulipala/gbbs parallelizable graph problems, as inefficiency may be inevitable for problems that depend on transitive closure, many problems that are of practical importance do admit work-efficient algorithms [48] . For these problems, which include connectivity, biconnectivity, minimum spanning forest, maximal independent set, maximal matching, and triangle counting, giving theoretically-efficient implementations that are simple and practical is important, as the amount of parallelism available on modern systems is still modest enough that reducing the amount of work done is critical for achieving good performance. Aside from intellectual curiosity, investigating whether theoretically-efficient graph algorithms also perform well in practice is important, as theoretically-efficient algorithms are less vulnerable to adversarial inputs than ad-hoc algorithms that happen to work well in practice.
Unfortunately, some problems that are not known to admit workefficient parallel algorithms due to the transitive-closure bottleneck [48] , such as strongly connected components (SCC) and singlesource shortest paths (SSSP) are still important in practice. One method for circumventing the bottleneck is to give work-efficient algorithms for these problems that run in depth proportional to the diameter of the graph-as real-world graphs have low diameter, and theoretical models of real-world graphs predict a logarithmic diameter, these algorithms offer theoretical guarantees in practice [12, 72] . Other problems, like k-core are P-complete [4] , which rules out polylogarithmic-depth algorithms for them unless P = NC [38] . However, even k-core admits an algorithm with strong theoretical guarantees that is efficient in practice [28] .
Parallel Graph Processing Frameworks. Motivated by the need to process very large graphs, there have been many graph processing frameworks developed in the literature (e.g., [36, 52, 56, 65, 76] among many others). We refer the reader to [59, 89] for surveys of existing frameworks. Several recent graph processing systems evaluate the scalability of their implementations by solving problems on massive graphs [26, 28, 45, 54, 84, 86] . All of these systems report running times either on the Hyperlink 2012 graph or Hyperlink 2014 graphs, two web crawls released by the WebDataCommons that are the largest and second largest publicly-available graphs respectively. We describe these recent systems and give a detailed comparison of how our implementations perform compare to their codes in Section 6. We review existing parallel graph algorithm benchmarks in the full version of our paper [29] .
PRELIMINARIES
Graph Notation. We denote an unweighted graph by G (V , E), where V is the set of vertices and E is the set of edges in the graph. A weighted graph is denoted by G = (V , E, w ), where w is a function which maps an edge to a real value (its weight). The number of vertices in a graph is n = |V |, and the number of edges is m = |E|. Vertices are assumed to be indexed from 0 to n − 1. For undirected graphs we use N (v) to denote the neighbors of vertex v and deg(v) to denote its degree. For directed graphs, we use in-deg(v) and out-deg(v) to denote the in and out-neighbors of a vertex v. We use diam(G) to refer to the diameter of the graph, or the longest shortest path distance between any vertex s and any vertex v reachable from s. ∆ is used to denote the maximum degree of the graph. We assume that there are no self-edges or duplicate edges in the graph. We refer to graphs stored as a list of edges as being stored in the edgelist format and the compressed-sparse column and compressed-sparse row formats as CSC and CSR respectively. Atomic Primitives. We use three common atomic primitives in our algorithms: test-and-set (TS), fetch-and-add (FA), and prioritywrite (PW). A test-and-set(&x ) checks if x is 0, and if so atomically sets it to 1 and returns true; otherwise it returns false. A fetch-and-add(&x ) atomically returns the current value of x and then increments x. A priority-write(&x, v, p) atomically compares v with the current value of x using the priority function p, and if v has higher priority than the value of x according to p it sets x to v and returns true; otherwise it returns false. Model. In the analysis of algorithms we use the following workdepth model, which is closely related to the PRAM but better models current machines and programming paradigms that are asynchronous and allow dynamic forking. We can simulate the model on the CRCW PRAM equipped with the same operations with an additional O (log * n) factor in the depth due to load-balancing. Furthermore, a PRAM algorithm using P processors and T time can be simulated in our model with PT work and T depth.
The Multi-Threaded Random-Access Machine (MT-RAM) [10] consists of a set of threads that share an unbounded memory. Each thread is basically equivalent to a Random Access Machine-it works on a program stored in memory, has a constant number of registers, and has standard RAM instructions (including an end to finish the computation). The MT-RAM extends the RAM with a fork instruction that takes a positive integer k and forks k new child threads. Each child thread receives a unique integer in the range [1, . . . , k] in its first register and otherwise has the identical state as the parent, which has a 0 in that register. They all start by running the next instruction. When a thread performs a fork, it is suspended until all the children terminate (execute an end instruction). A computation starts with a single root thread and finishes when that root thread ends. This model supports what is often referred to as nested parallelism. If the root thread never does a fork, it is a standard sequential program.
A computation can be viewed as a series-parallel DAG in which each instruction is a vertex, sequential instructions are composed in series, and the forked subthreads are composed in parallel. The work of a computation is the number of vertices and the depth is the length of the longest path in the DAG. We augment the model with three atomic instructions that are used by our algorithms: test-and-set (TS), fetch-and-add (FA), and priority-write (PW) and discuss our model with these operations as the TS, FA, and PW variants of the MT-RAM. As is standard with the RAM model, we assume that the memory locations and registers have at most O (log M ) bits, where M is the total size of the memory used. More details about the model can be found in [10] . Parallel Primitives. The following parallel procedures are used throughout the paper. Scan takes as input an array A of length n, an associative binary operator ⊕, and an identity element ⊥ such that ⊥ ⊕ x = x for any x, and returns the array
. Scan can be done in O (n) work and O (log n) depth (assuming ⊕ takes O (1) work) [44] . Reduce takes an array A and a binary associative function f and returns the sum of the elements in A with respect to f . Filter takes an array A and a predicate f and returns a new array containing a ∈ A for which f (a) is true, in the same order as in A. Reduce and filter can both be done in O (n) work and O (log n) depth (assuming f takes O (1) work). Ligra, Ligra+, and Julienne. We make use of the Ligra, Ligra+, and Julienne frameworks for shared-memory graph processing in this paper and review components from these frameworks here [28, 76, 80] . Ligra provides data structures for representing a graph G = (V , E), vertexSubsets (subsets of the vertices). We make use of the edgeMap function provided by Ligra, which we use for mapping over edges. edgeMap takes as input a graph G (V , E), a vertexSubset U , and two boolean functions F and C. edgeMap applies F to (u, v) ∈ E such that u ∈ U and C (v) = true (call this subset of edges E a ), and returns a vertexSubset U ′ where u ∈ U ′ if and only if (u, v) ∈ E a and F (u, v) = true. F can side-effect data structures associated with the vertices. edgeMap runs in O ( u ∈U deg(u)) work and O (log n) depth assuming F and C take O (1) work. edgeMap either applies a sparse or dense method based on the number of edges incident to the current frontier. Both methods run in O ( u ∈U deg(u)) work and O (log n) depth. We note that in our experiments we use an optimized version of the dense method which examines in-edges sequentially and stops once C returns false. This optimization lets us potentially examine significantly fewer edges than the O (log n) depth version, but at the cost of O (in-deg(v)) depth.
ALGORITHMS
In this section we describe I/O specifications of our benchmark, discuss related work and present the theoretically-efficient algorithm we implemented for each problem. We cite the original papers that our algorithms are based on in Table 1 . We mark implementations based on prior work with a † and discuss the related work, algorithms, and implementations for these problems in the full version of our paper [29] . The full version also contains self-contained descriptions of all of our algorithms. 
Output: S, a mapping from each vertex v to the centrality contribution from all (src, t ) shortest paths that pass through v.
Low-Diameter Decomposition
Output: L, a mapping from each vertex to a cluster ID representing
. . , V k such that the shortest path between two vertices in V i using only vertices in V i is at most d, and the number
, an undirected graph. Output: L, a mapping from each vertex to a unique label for its connected component.
Biconnectivity
Input: G = (V , E), an undirected graph. Output: L, a mapping from each edge to the label of its biconnected component.
Sequentially, biconnectivity can be solved using the Hopcroft-Tarjan algorithm [42] . The algorithm uses depth-first search (DFS) to identify articulation points and requires O (m + n) work to label all edges with their biconnectivity label. It is possible to parallelize the sequential algorithm using a parallel DFS, however, the fastest parallel DFS algorithm is not work-efficient [2] . Tarjan and Vishkin present the first work-efficient algorithm for biconnectivity [87] (as stated in the paper the algorithm is not work-efficient, but it can be made so by using a work-efficient connectivity algorithm). Another approach relies on the fact that biconnected graphs admit open ear decompositions to solve biconnectivity efficiently [57, 70] .
In this paper, we implement the Tarjan-Vishkin algorithm for biconnectivity in O (m) work and O (max(diam(G) log n, log 3 n)) depth on the FA-MT-RAM. Our implementation first computes connectivity labels using the algorithm from Section 4, which runs in O (m) work and O (log 3 n) depth w.h.p. and picks an arbitrary source vertex from each component. Next, we compute a spanning forest rooted at these sources using breadth-first search, which runs in O (m) work and O (diam(G) log n) depth. We note that the connectivity algorithm can be modified to compute a spanning forest in the same work and depth as connectivity, which would avoid the breadth-first-search. We compute Low, High, and Size for each vertex by running leaffix and rootfix sums on the spanning forests produced by BFS with fetch-and-add, which requires O (n) work and O (diam(G)) depth. Finally, we compute an implicit representation of the biconnectivity labels for each edge, using an idea from [7] . This step computes per-vertex labels by removing all critical edges and computing connectivity on the remaining graph. The resulting vertex labels can be used to assign biconnectivity labels to edges by giving tree edges the connectivity label of the vertex further from the root in the tree, and assigning non-tree edges the label of either endpoint. Summing the cost of each step, the total work of this algorithm is O (m) in expectation and the total depth is O (max(diam(G) log n, log 3 n)) w.h.p.
Minimum Spanning Forest
Input: G = (V , E, w ), a weighted graph. Output: T , a set of edges representing a minimum spanning forest of G.
Borůvka gave the first known sequential and parallel algorithm for computing a minimum spanning forest (MSF) [18] . Significant effort has gone into finding linear-work MSF algorithms both in the sequential and parallel settings [21, 47, 68] . Unfortunately, the linear-work parallel algorithms are highly involved and do not seem to be practical. Significant effort has also gone into designing practical parallel algorithms for MSF; we discuss relevant experimental work in Section 6. Due to the simplicity of Borůvka, many parallel implementations of MSF use variants of it.
In this paper, we present an implementation of Borůvka's algorithm that runs in O (m log n) work and O (log 2 n) depth on the PW-MT-RAM. Our implementation is based on a recent implementation of Borůvka by Zhou [90] that runs on the edgelist format. We made several changes to the algorithm which improve performance and allow us to solve MSF on graphs stored in the CSR/CSC format, as storing an integer-weighted graph in edgelist format would require well over 1TB of memory to represent the edges in the Hyperlink2012 graph alone. Our code uses an implementation of Borůvka that works over an edgelist; to make it efficient we ensure that the size of the lists passed to it are much smaller than m. Our approach is to perform a constant number of filtering steps. Each filtering step solves an approximate k'th smallest problem in order to extract the lightest 3n/2 edges in the graph (or all remaining edges) and runs Borůvka on this subset of edges. We then filter the remaining graph, packing out any edges that are now in the same component. This idea is similar to the theoretically-efficient algorithm of Cole et al. [21] , except that instead of randomly sampling edges, we select a linear number of the lowest weight edges. Each filtering step costs O (m) work and O (log m) depth, but as we only perform a constant number of steps, they do not affect the work and depth asymptotically. In practice, most of the edges are removed after 3-4 filtering steps, and so the remaining edges can be copied into an edgelist and solved in a single Borůvka step. We also note that as the edges are initially represented in both directions, we can pack out the edges so that each undirected edge is only inspected once (we noticed that earlier edgelist-based implementations stored undirected edges in both directions). Tarjan's algorithm is the textbook sequential algorithm for computing the strongly connected components (SCCs) of a directed graph [25] . As it uses depth-first search, we currently do not know how to efficiently parallelize it [2] . The current theoretical state-ofthe-art for parallel SCC algorithms with polylogarithmic depth reduces the problem to computing the transitive closure of the graph. This requiresÕ (n 3 ) work using combinatorial algorithms [35] , which is significantly higher than the O (m+n) work done by sequential algorithms. As the transitive-closure based approach performs a significant amount of work even for moderately sized graphs, subsequent research on parallel SCC algorithms has focused on improving the work while potentially sacrificing depth [12, 24, 34, 72] . Conceptually, these algorithms first pick a random pivot and use a reachability-oracle to identify the SCC containing the pivot. They then remove this SCC, which partitions the remaining graph into several disjoint pieces, and recurse on the pieces.
Strongly Connected Components
In this paper, we present the first implementation of the SCC algorithm from Blelloch et al. [12] . Our implementation runs in in O (m log n) expected work and O (diam(G) log n) depth w.h.p. on the PW-MT-RAM. One of the challenges in implementing this SCC algorithm is how to compute reachability information from multiple vertices simultaneously and how to combine the information to (1) identify SCCs and (2) refine the subproblems of visited vertices. In our implementation, we explicitly store R F and R B , the forward and backward reachability sets for the set of centers that are active in the current phase, C A . The sets are represented as hash tables that store tuples of vertices and center IDs, (u, c i ), representing a vertex u in the same subproblem as c i that is visited by a directed path from c i . We explain how to make the hash table technique practical in Section 5. The reachability sets are computed by running simultaneous breadth-first searches from all active centers. In each round of the BFS, we apply edgeMap to traverse all out-edges (or in-edges) of the current frontier. When we visit an edge (u, v) we try to add u's center IDs to v. If u succeeds in adding any IDs, it test-and-set's a visited flag for v, and returns it in the next frontier if the test-and-set succeeded. Each BFS requires at most O (diam(G)) rounds as each search adds the same labels on each round as it would have had it run in isolation.
After computing R F and R B , we deterministically assign (with respect to the random permutation of vertices generated at the start of the algorithm) vertices that we visited in this phase a new label, which is either the label of a refined subproblem or a unique label for the SCC they are contained in. We first intersect the two tables and perform, for any tuple (v, c i ) contained in the intersection, a priority-write with min on the memory location corresponding to v's SCC label with c i as the label. Next, for all pairs (v, c i ) in R F ⊕ R B we do a priority-write with min on v's subproblem label, which ensures that the highest priority search that visited v sets its new subproblem.
We implemented an optimized search for the first phase, which just runs two regular BFSs over the in-edges and out-edges from a single pivot and stores the reachability information in bit-vectors instead of hash-tables. It is well known that many directed realworld graphs have a single massive strongly connected component, and so with reasonable probability the first vertex in the permutation will find this giant component [20] . We also implemented a 'trimming' optimization that is reported in the literature [60, 83] , which eliminates trivial SCCs by removing any vertices that have zero in-or out-degree. We implement a procedure that recursively trims until no zero in-or out-degree vertices remain, or until a maximum number of rounds are reached.
Maximal Independent Set and Maximal Matching
Problem: Maximal Independent Set Input: G = (V , E), an undirected graph. Output: U ⊆ V , a set of vertices such that no two vertices in U are neighbors and all vertices in V \ U have a neighbor in U .
Problem: Maximal Matching Input: G = (V , E), an undirected graph. Output: E ′ ⊆ E, a set of edges such that no two edges in E ′ share an endpoint and all edges in E \ E ′ share an endpoint with some edge in E ′ .
Maximal independent set (MIS) and maximal matching (MM) are easily solved in linear work sequentially using greedy algorithms. Many efficient parallel maximal independent set and matching algorithms have been developed over the years [3, 8, 11, 43, 49, 53] . Blelloch et al. show that when the vertices (or edges) are processed in a random order, the sequential greedy algorithms for MIS and MM can be parallelized efficiently and give practical algorithms [11] . Recently, Fischer and Noever showed an improved depth bound for this algorithm [33] .
In this paper, we implement the rootset-based algorithm for MIS from Blelloch et al. [11] which runs in O (m) expected work and O (log 2 n) depth w.h.p. on the FA-MT-RAM. To the best of our knowledge this is the first implementation of the rootset-based algorithm; the implementations from [11] are based on processing appropriately-sized prefixes of an order generated by a random permutation P. Our implementation of the rootset-based algorithm works on a priority-DAG defined by directing edges in the graph from the higher-priority endpoint to the lower-priority endpoint. On each round, we add all roots of the DAG into the MIS, compute N (roots), the neighbors of the rootset that are still active, and finally decrement the priorities of N (N (roots)). As the vertices in N (roots) are at arbitrary depths in the priority-DAG, we only decrement the priority along an edge
The algorithm runs in O (m) work as we process each edge once; the depth bound is O (log 2 n) as the priority-DAG has O (log n) depth w.h.p. [33] , and each round takes O (log n) depth. We were surprised that this implementation usually outperforms the prefix-based implementation from [11] , while also being simple to implement.
Our maximal matching implementation is based on the prefixbased algorithm from [11] that takes O (m) expected work and O (log 3 m/ log log m) depth w.h.p. on the PW-MT-RAM (using the improved depth shown in [33] ). We had to make several modifications to run the algorithm on the large graphs in our experiments. The original code from [11] uses an edgelist representation, but we cannot directly use this implementation as uncompressing all edges would require a prohibitive amount of memory for large graphs. Instead, as in our MSF implementation, we simulate the prefix-based approach by performing a constant number of filtering steps. Each filter step packs out 3n/2 of the highest priority edges, randomly permutes them, and then runs the edgelist based algorithm on the prefix. After computing the new set of edges that are added to the matching, we filter the remaining graph and remove all edges that are incident to matched vertices. In practice, just 3-4 filtering steps are sufficient to remove essentially all edges in the graph. The last step uncompresses any remaining edges into an edgelist and runs the prefix-based algorithm. The filtering steps can be done within the work and depth bounds of the original algorithm.
Graph Coloring
Input: G = (V , E), an undirected graph. Output: C, a mapping from each vertex to a color such that for each edge (u, v) ∈ E, C (u) C (v), using at most ∆ + 1 colors.
As graph coloring is NP-hard to solve optimally, algorithms like greedy coloring, which guarantees a (∆ + 1)-coloring, are used instead in practice, and often use much fewer than (∆ + 1) colors on real-world graphs [40, 88] . Jones and Plassmann (JP) parallelize the greedy algorithm using linear work, but unfortunately adversarial inputs exist for the heuristics they consider that may force the algorithm to run in O (n) depth. Hasenplaugh et al. introduce several heuristics that produce high-quality colorings in practice and also achieve provably low-depth regardless of the input graph. These include LLF (largest-log-degree-first), which processes vertices ordered by the log of their degree and SLL (smallest-log-degree-last), which processes vertices by removing all lowest log-degree vertices from the graph, coloring the remaining graph, and finally coloring the removed vertices. For LLF, they show that it runs in O (m + n) work and O (L log ∆ + log n) depth, where L = min{ √ m, ∆} + log 2 ∆ log n/ log log n in expectation.
In this paper, we implement a synchronous version of Jones-Plassmann using the LLF heuristic in Ligra, which runs in O (m + n) work and O (L log ∆ + log n) depth on the FA-MT-RAM. The algorithm is implemented similarly to our rootset-based algorithm for MIS. In each round, after coloring the roots we use a fetch-andadd to decrement a count on our neighbors, and add the neighbor as a root on the next round if the count is decremented to 0. k-core E) , an undirected graph. Output: D, a mapping from each vertex to its coreness value.
k-cores were defined independently by Seidman [73] , and by Matula and Beck [58] who also gave a linear-time algorithm for computing the coreness value of all vertices, i.e. the maximum k-core a vertex participates in. Anderson and Mayr showed that k-core (and therefore coreness) is in NC for k ≤ 2, but is P-complete for k ≥ 3 [4] . The Matula and Beck algorithm is simple and practical-it first bucket-sorts vertices by their degree, and then repeatedly deletes the minimum-degree vertex. The affected neighbors are moved to a new bucket corresponding to their induced degree. As each edge in each direction and vertex is processed exactly once, the algorithm runs in O (m +n) work. In [28] , the authors give a parallel algorithm based on bucketing that runs in O (m+n) expected work, and ρ log n depth w.h.p. ρ is the peeling-complexity of the graph, defined as the number of rounds to peel the graph to an empty graph where each peeling step removes all minimum degree vertices.
Our implementation of k-core in this paper is based on the implementation from Julienne [28] . One of the challenges to implementing the peeling algorithm for k-core is efficiently computing the number of edges removed from each vertex that remains in the graph. A simple approach is to just fetch-and-add a counter per vertex, and update the bucket of the vertex based on this counter, however this incurs significant contention on real-world graphs with vertices with large degree. In order to make this step faster in practice, we implemented a work-efficient histogram which compute the number of edges removed from remaining vertices while incurring very little contention. We describe our histogram implementation in Section 5.
Approximate Set Cover †
Input: G = (V , E), an undirected graph representing a set cover instance. Output: S ⊆ V , a set of sets such that ∪ s ∈s N (s) = V with |S | being an O (log n)-approximation to the optimal cover.
Triangle Counting †
Input: G = (V , E), an undirected graph. Output: T G , the total number of triangles in G.
IMPLEMENTATIONS AND TECHNIQUES
In this section, we introduce several general implementation techniques and optimizations that we use in our algorithms. Due to lack of space, we describe some techniques, such as a more cachefriendly sparse edgeMap that we call edgeMapBlocked, and compression techniques in the full version of our paper [29] .
A Work-efficient Histogram Implementation. Our initial implementation of the peeling-based algorithm for k-core algorithm suffered from poor performance due to a large amount of contention incurred by fetch-and-adds on high-degree vertices. This occurs as many social-networks and web-graphs have large maximum degree, but relatively small degeneracy, or largest non-empty core (labeled k max in Table 2 ). For these graphs, we observed that many early rounds, which process vertices with low coreness perform a large number of fetch-and-adds on memory locations corresponding to high-degree vertices, resulting in high contention [77] . To reduce contention, we designed a work-efficient histogram implementation that can perform this step while only incurring O (log n) contention w.h.p. The Histogram primitive takes a sequence of (K, T) pairs, and an associative and commutative operator R : T × T → T and computes a sequence of (K, T) pairs, where each key k only appears once, and its associated value t is the sum of all values associated with keys k in the input, combined with respect to R.
A useful example of histogram to consider is summing for each v ∈ N (F ) for a vertexSubset F , the number of edges (u, v) where u ∈ F (i.e., the number of incoming neighbors from the frontier). This operation can be implemented by running histogram on a sequence where each v ∈ N (F ) appears once per (u, v) edge as a tuple (v, 1)
Session 10
SPAA'18, July 16-18, 2018, Vienna, Austria using the operator +. One theoretically efficient implementation of histogram is to simply semisort the pairs using the work-efficient semisort algorithm from [39] . The semisort places pairs from the sequence into a set of heavy and light buckets, where heavy buckets contain a single key that appears many times in the input sequence, and light buckets contain at most O (log 2 n) distinct keys (k, v) keys, each of which appear at most O (log n) times w.h.p. (heavy and light keys are determined by sampling). We compute the reduced value for heavy buckets using a standard parallel reduction. For each light bucket, we allocate a hash table, and hash the keys in the bucket in parallel to the table, combining multiple values for the same key using R. As each key appears at most O (log n) times w.h.p, we incur at most O (log n) contention w.h.p. The output sequence can be computed by compacting the light tables and heavy arrays. While the semisort implementation is theoretically efficient, it requires a likely cache miss for each key when inserting into the appropriate hash table. To improve cache performance in this step, we implemented a work-efficient algorithm with O (n ϵ ) depth based on radix sort. Our implementation is based on the parallel radix sort from PBBS [78] . As in the semisort, we first sample keys from the sequence and determine the set of heavy-keys. Instead of directly moving the elements into light and heavy buckets, we break up the input sequence into O (n 1−ϵ ) blocks, each of size O (n ϵ ), and sequentially sort the keys within a block into light and heavy buckets. Within the blocks, we reduce all heavy keys into a single value and compute an array of size O (n ϵ ) which holds the starting offset of each bucket within the block. Next, we perform a segmentedscan [9] over the arrays of the O (n 1−ϵ ) blocks to compute the sizes of the light buckets, and the reduced values for the heavy-buckets, which only contain a single key. Finally, we allocate tables for the light buckets, hash the light keys in parallel over the blocks and compact the light tables and heavy keys into the output array. Each step runs in O (n) work and O (n ϵ ) depth. Compared to the original semisort implementation, this version incurs fewer cache misses because the light keys per block are already sorted and consecutive keys likely go to the same hash table, which fits in cache. We compared our times in the histogram-based version of k-core and the fetch-and-add-based version of k-core and saw between a 1.1-3.1x improvement from using the histogram.
Techniques for overlapping searches. In this section, we describe how we compute and update the reachability labels for vertices that are visited in a phase of our SCC algorithm. Recall that each phase performs a graph traversal from the set of active centers on this round, C A , and computes for each center c, all vertices in the weakly-connected component for the subproblem of c that can be reached by a directed path from it. We store this reachability information as a set of (u, c i ) pairs in a hash-table, which represent the fact that u can be reached by a directed path from c i . A phase performs two graph traversals from the centers to compute R F and R B , the out-reachability set and in-reachability sets respectively. Each traversal allocates an initial hash table and runs rounds of edgeMap until no new label information is added to the table.
The main challenge in implementing one round in the traversal is (1) ensuring that the table has sufficient space to store all pairs that will be added this round, and (2) efficiently iterating over all of the pairs associated with a vertex. We implement (1) by performing a parallel reduce to sum over vertices u ∈ F , the current frontier, the number of neighbors v in the same subproblem, multiplied by the number of distinct labels currently assigned to u. This upperbounds the number of distinct labels that could be added this round, and although we may overestimate the number of actual additions, we will never run out of space in the table. We update the number of elements currently in the table during concurrent insertions by storing a per-processor count which gets incremented whenever the processor performs a successful insertion. The counts are then summed together at the end of a round and used to update the count of the number of elements in the table.
One simple implementation of (2) is to simply allocate O (log n) space for every vertex, as the maximum number of centers that visit any vertex during a phase is at most O (log n) w.h.p. However, this will waste a significant amount of space, as most vertices are visited just a few times. Instead, our implementation stores (u, c) pairs in the table for visited vertices u, and computes hashes based only on the ID of u. As each vertex is only expected to be visited a constant number of times during a phase, the expected probe length is still a constant. Storing the pairs for a vertex in the same probe-sequence is helpful for two reasons. First, we may incur fewer cache misses than if we had hashed the pairs based on both entries, as multiple pairs for a vertex can fit in the same cache line. Second, storing the pairs for a vertex along the same probe sequence makes it extremely easy to find all pairs associated with a vertex u, as we simply perform linear-probing, reporting all pairs that have u as their key until we hit an empty cell. Our experiments show that this technique is practical, and we believe that it may have applications in similar algorithms, such as computing least-element lists or FRT trees in parallel [12, 13] .
EXPERIMENTS
In this section, we describe our experimental results on a set of realworld graphs and also discuss related experimental work. Tables 1  and 3 show the running times for our implementations on our graph inputs. For compressed graphs, we use the compression schemes from Ligra+ [80] , which we extended to ensure theoretical efficiency. We describe these modifications and also other statistics about our algorithms (e.g., number of colors used, number of SCCs, etc.) in the full version of the paper [29] . Experimental Setup. We run all of our experiments on a 72-core Dell PowerEdge R930 (with two-way hyper-threading) with 4 × 2.4GHz Intel 18-core E7-8867 v4 Xeon processors (with a 4800MHz bus and 45MB L3 cache) and 1TB of main memory. Our programs use Cilk Plus to express parallelism and are compiled with the g++ compiler (version 5.4.1) with the -O3 flag. By using Cilk's workstealing scheduler we are able obtain an expected running time of W /P + O (D) for an algorithm with W work and D depth on P processors [16] . For the parallel experiments, we use the command numactl -i all to balance the memory allocations across the sockets. All of the speedup numbers we report are the running times of our parallel implementation on 72-cores with hyper-threading over the running time of the implementation on a single thread. Graph Data. To show how our algorithms perform on graphs at different scales, we selected a representative set of real-world graphs of varying sizes. Most of the graphs are Web graphs and Table 2 : Graph inputs, including vertices and edges. diam is the diameter of the graph. For undirected graphs, ρ and k max are the number of peeling rounds, and the largest non-empty core (degeneracy). We mark diam values where we are unable to calculate the exact diameter with * and report the effective diameter observed during our experiments, which is a lower bound on the actual diameter. social networks-low diameter graphs that are frequently used in practice. To test our algorithms on large diameter graphs, we also ran our implementations 3-dimensional tori where each vertex is connected to its 2 neighbors in each dimension.
We list the graphs used in our experiments, along with their size, approximate diameter, peeling complexity [28] , and degeneracy (for undirected graphs) in Table 2 . LiveJournal is a directed graph of the social network obtained from a snapshot in 2008 [17] . com-Orkut is an undirected graph of the Orkut social network. Twitter is a directed graph of the Twitter network, where edges represent the follower relationship [51] . ClueWeb is a Web graph from the Lemur project at CMU [17] . Hyperlink2012 and Hyperlink2014 are directed hyperlink graphs obtained from the WebDataCommons dataset where nodes represent web pages [61] . 3D-Torus is a 3-dimensional torus with 1B vertices and 6B edges. We mark symmetric (undirected) versions of the directed graphs with the suffix -Sym. We create weighted graphs for evaluating weighted BFS, Borůvka, and Bellman-Ford by selecting edge weights between [1, log n) uniformly at random. We process LiveJournal, com-Orkut, Twitter, and 3D-Torus in the uncompressed format, and ClueWeb, Hyperlink2014, and Hyperlink2012 in the compressed format. SSSP Problems. Our BFS, weighted BFS, Bellman-Ford, and betweenness centrality implementations achieve between a 8-67x speedup across all inputs. We ran all of our shortest path experiments on the symmetrized versions of the graph. Our experiments show that our weighted BFS and Bellman-Ford implementations perform as well as or better than our prior implementations from Julienne [28] . Our running times for BFS and betweenness centrality are the same as the times of the implementations in Ligra [76] . We note that our running times for weighted BFS on the Hyperlink graphs are larger than the times reported in Julienne. This is because the shortest-path experiments in Julienne were run on directed version of the graph, where the average vertex can reach many fewer vertices than on the symmetrized version. We set a flag for our weighted BFS experiments on the ClueWeb and Hyperlink graphs that lets the algorithm switch to a dense edgeMap once the frontiers are sufficiently dense, which lets the algorithm run within half of the RAM on our machine. Before this change, our weighted BFS implementation would request a large amount of amount of memory when processing the largest frontiers which then caused the graph to become partly evicted from the page cache.
In an earlier paper [28] , we compared the running time of our weighted BFS implementation to two existing parallel shortest path implementations from the GAP benchmark suite [6] and Galois [55] , as well as a fast sequential shortest path algorithm from the DI-MACS shortest path challenge, showing that our implementation is between 1.07-1.1x slower than the ∆-stepping implementation from GAP, and 1.6-3.4x faster than the Galois implementation. Our old version of Bellman-Ford was between 1.2-3.9x slower than weighted BFS; we note that after changing it to use the edgeMap-Blocked optimization, it is now competitive with weighted BFS and is between 1.2x faster and 1.7x slower on our graphs with the exception of 3D-Torus, where it performs 7.3x slower than weighted BFS, as it performs O (n 4/3 ) work on this graph.
Connectivity Problems. Our low-diameter decomposition (LDD) implementation achieves between 17-58x speedup across all inputs. We fixed β to 0.2 in all of the codes that use LDD. The running time of LDD is comparable to the cost of a BFS that visits most of the vertices. We are not aware of any prior experimental work that reports the running times for an LDD implementation.
Our work-efficient implementation of connectivity achieves 25-57x speedup across all inputs. We note that our implementation does not assume that vertex IDs in the graph are randomly permuted and always generates a random permutation, even on the first round, as adding vertices based on their original IDs can result in poor performance. There are several existing implementations of fast parallel connectivity algorithms [67, 78, 79, 83] , however, only the implementation from [79] , which presents the algorithm that we implement in this paper, is theoretically-efficient. The implementation from Shun et al. was compared to both the Multistep [83] and Patwary et al. [67] implementations, and shown to be competitive on a broad set of graphs. We compared our connectivity implementation to the work-efficient connectivity implementation from Shun et al. on our uncompressed graphs and observed that our code is between 1.2-2.1x faster in parallel.
Despite our biconnectivity implementation having O (diam(G)) depth, our implementation achieves between a 20-59x speedup across all inputs, as the diameter of most of our graphs is extremely low. Our biconnectivity implementation is about 3-5 times slower than running connectivity on the graph, which seems reasonable as our current implementation performs two calls to connectivity, and one breadth-first search. There are a several existing implementations of biconnectivity. Cong and Bader [22] parallelize the Tarjan-Vishkin algorithm and demonstrated speedup over the Hopcroft-Tarjan (HT) algorithm. Edwards and Vishkin [31] also implement the Tarjan-Vishkin algorithm using the XMT platform, and show that their algorithm achieves good speedups. Slota and Madduri [82] present a BFS-based biconnectivity implementation which requires O (mn) work in the worst-case, but behaves like a linear-work algorithm in practice. We ran the Slota and Madduri implementation on 36 hyper-threads allocated from the same socket, the configuration on which we observed the best performance for their code, and found that our implementation is between 1.4-2.1x faster than theirs. We used a DFS-ordered subgraph corresponding to the largest connected component to test their code, which produced the fastest times. Using the original order of the graph affects the running time of their implementation, causing it to run between 2-3x slower as the amount of work performed by their algorithm depends on the order in which vertices are visited. Our strongly connected components implementation achieves between a 13-43x speedup across all inputs. Our implementation takes a parameter β, which is the base of the exponential rate at which we grow the number of centers added. We set β between 1.1-2.0 for our experiments and note that using a larger value of β can improve the running time on smaller graphs by up to a factor of 2x. Our SCC implementation is between 1.6x faster to 4.8x slower than running connectivity on the graph. There are several existing SCC implementations that have been evaluated on realworld directed graphs [41, 60, 83] . The Hong et al. algorithm [41] is a modified version of the FWBW-Trim algorithm from McLendon et al. [60] , but neither algorithm has any theoretical bounds on work or depth. Unfortunately [41] do not report running times, so we are unable to compare our performance with them. The Multistep algorithm [83] has a worst-case running time of O (n 2 ), but the authors point-out that the algorithm behaves like a linear-time algorithm on real-world graphs. We ran our implementation on 16 cores configured similarly to their experiments and found that we are about 1.7x slower on LiveJournal, which easily fits in cache, and 1.2x faster on Twitter (scaled to account for a small difference in graph sizes). While the multistep algorithm is slightly faster on some graphs, our SCC implementation has the advantage of being theoretically-efficient and performs a predictable amount of work.
Our minimum spanning forest implementation achieves between 17-50x speedup over the implementation running on a single thread across all of our inputs. Obtaining practical parallel algorithms for MSF has been a longstanding goal in the field, and several existing implementations exist [5, 23, 66, 78, 90] . We compared our implementation with the union-find based MSF implementation from PBBS [78] and the implementation of Borůvka from [90] , which is one of the fastest implementations we are aware of. Our MSF implementation is between 2.6-5.9x faster than the MSF implementation from PBBS. Compared to the edgelist based implementation of Borůvka from [90] our implementation is between 1.2-2.9x faster.
MIS, Maximal
Matching, and Graph Coloring. Our MIS and maximal matching implementations achieve between 31-70x and 25-70x speedup across all inputs. The implementations by Blelloch et al. [11] are the fastest existing implementations of MIS and maximal matching that we are aware of, and are the basis for our maximal matching implementation. They report that their implementations are 3-8x faster than Luby's algorithm on 32 threads, and outperform a sequential greedy MIS implementation on more than 2 processors. We compared our rootset-based MIS implementation to the prefix-based implementation, and found that the rootsetbased approach is between 1.1-3.5x faster. Our maximal matching implementation is between 3-4.2x faster than the implementation from [11] . Our implementation of maximal matching can avoid a significant amount of work, as each of the filter steps can extract and permute just the 3n/2 highest priority edges, whereas the edgelist-based version in PBBS must permute all edges. Our coloring implementation achieves between 11-56x speedup across all inputs. We note that our implementation appears to be between 1.2-1.6x slower than the asynchronous implementation of JP in [40] , due to synchronizing on many rounds which contain few vertices. k-core, Approximate Set Cover, and Triangle Counting. Our k-core implementation achieves between 5-46x speedup across all inputs, and 114x speedup on the 3D-Torus graph as there is only one round of peeling in which all vertices are removed. There are several recent papers that implement parallel algorithms for k-core [27, 28, 46, 71] . Both the ParK algorithm [27] and Kabir and Madduri algorithm [46] implement the peeling algorithm in O (k max n + m) work, which is not work-efficient. Our implementation is between 3.8-4.6x faster than ParK on a similar machine configuration. Kabir and Madduri show that their implementation achieves an average speedup of 2.8x over ParK. Our implementation is between 1.3-1.6x faster than theirs on a similar machine configuration.
Our approximate set cover implementation achieves between 5-57x speedup across all inputs. Our implementation is based on the implementation presented in Julienne [28] ; the one major modification was to regenerate random priorities for sets that are active on the current round. We compared the running time of our implementation with the parallel implementation from [15] which is available in the PBBS library. We ran both implementations with ϵ = 0.01. Our implementation is between 1.2x slower to 1.5x faster than the PBBS implementation on our graphs, with the exception of 3D-Torus. On 3D-Torus, the implementation from [15] runs 56x slower than our implementation as it does not regenerate priorities for active sets on each round causing worst-case behavior. Our performance is also slow on this graph, as nearly all of the vertices stay active (in the highest bucket) during each round, and using ϵ = 0.01 causes a large number of rounds to be performed.
Our triangle counting (TC) implementation achieves between 39-81x speedup across all inputs. Unfortunately, we are unable to report speedup numbers for TC on our larger graphs as the single-threaded times took too long due to the algorithm performing O (m 3/2 ) work. There are a number experimental papers that consider multicore triangle counting [1, 37, 50, 52, 74, 81] . We implement the algorithm from [81] , and adapted it to work on compressed graphs. We note that in our experiments we intersect directed adjacency lists sequentially, as there was sufficient parallelism in the outer parallel-loop. There was no significant difference in running times between our implementation and the implementation from [81] . We ran our implementation on 48 threads on the Twitter graph to compare with the times reported by EmptyHeaded [1] and found that our times are about the same.
Performance on 3D-Torus. We ran experiments on a family of 3D-Torus graphs with different sizes to study how our diameterbounded algorithms scale relative to algorithms with polylogarithmic depth. We were surprised to see that the running time of some of our polylogarithmic depth algorithms on this graph, like LDD and connectivity, are 17-40x more expensive than their running time on Twitter and Twitter-Sym, despite 3D-Torus only having 4x and 2.4x more edges than Twitter and Twitter-Sym. Our slightly worse scaling on this graph can be accounted for by the fact that we stored the graph ordered by dimension, instead of storing it using a local ordering. It would be interesting to see how much improvement we could gain by reordering the vertices.
In Figure 1 we show the normalized throughput of MIS, BFS, BC, and graph coloring for 3-dimensional tori of different sizes, where throughput is measured as the number of edges processed per second. The throughput for each application becomes saturated before our largest-scale graph for all applications except for BFS, which is saturated on a graph with 2 billion vertices. The throughput curves show that the theoretical bounds are useful in predicting how the half-lengths 3 are distributed. The half-lengths are ordered as follows: coloring, MIS, BFS, and BC. This is the same order as sorting these algorithms by their depth with respect to this graph. Locality. While our algorithms are efficient on the MT-RAM, we do not analyze their cache complexity, and in general they may not be efficient in a model that takes caches into account. Despite this, we observed that our algorithms have good cache performance on the graphs we tested on. In this section we give some explanation for this fact by showing that our primitives make good use of the caches. Our algorithms are also aided by the fact that these graph datasets often come in highly local orders (e.g., see the Natural order in [30] ). Table 4 shows metrics for our experiments measured using Open Performance Counter Monitor (PCM). Due to space limitations, we only report numbers for the ClueWeb graph. We observed that using a work-efficient histogram is 3.5x faster than using fetch-and-add in our k-core implementation, which suffers from high contention on this graph. Using a histogram reduces the number of cycles stalled due to memory by more than 7x. We also ran our wBFS implementation with and without the edgeMapBlocked optimization, which reduces the number of cache-lines read from and written to when performing a sparse edgeMap. The blocked implementation reads and writes 2.1x fewer bytes than the unoptimized version, which translates to a 1.7x faster runtime. We disabled the dense optimization for this experiment to directly compare the two implementations of a sparse edgeMap. Processing Massive Web Graphs. In Tables 1 and 3 , we show the running times of our implementations on the ClueWeb, Hyper-link2014, and Hyperlink2012 graphs. To put our performance in context, we compare our 72-core running times to running times reported by existing work. Table 5 summarizes the existing results in the literature. Most results process the directed versions of these graphs, which have about half as many edges as the symmetrized version. Unless otherwise mentioned, all results from the literature use the directed versions of these graphs. To make the comparison easier we show our running times for BFS, SSSP (weighted BFS), BC and SCC on the directed graphs, and running times for Connectivity, k-core and TC on the symmetrized graphs in Table 5 .
FlashGraph [26] reports disk-based running times for the Hy-perlink2012 graph on a 4-socket, 32-core machine with 512GB of memory and 15 SSDs. On 64 hyper-threads, they solve BFS in 208s, BC in 595s, connected components in 461s, and triangle counting in 7818s. Our BFS and BC implementations are 12x faster and 16x faster, and our triangle counting and connectivity implementations are 5.3x faster and 12x faster than their implementations, respectively. Mosaic [54] report in-memory running times on the Hyperlink2014 graph; we note that the system is optimized for external memory execution. They solve BFS in 6.5s, connected components in 700s, and SSSP (Bellman-Ford) in 8.6s on a machine with 24 hyper-threads and 4 Xeon-Phis (244 cores with 4 threads each) for a total of 1000 hyper-threads, 768GB of RAM, and 6 NVMes. Our BFS and connectivity implementations are 1.1x and 40x faster respectively, and our SSSP implementation is 1.05x slower. Both FlashGraph and Mosaic compute weakly connected components, which is equivalent to connectivity. BigSparse [45] report diskbased running times for BFS and BC on the Hyperlink2012 graph on a 32-core machine. They solve BFS in 2500s and BC in 3100s. Our BFS and BC implementations are 149x and 88x faster than their implementations, respectively.
Slota et al. [85] report running times for the Hyperlink2012 graph on 256 nodes on the Blue Waters supercomputer. Each node contains two 16-core processors with one thread each, for a total of 8192 hyper-threads. They report they can find the largest connected component and SCC from the graph in 63s and 108s respectively. Our implementations find all connected components 1.6x faster than their largest connected component implementation, and find all strongly connected components 1.6x slower than their largest-SCC implementation. Their largest-SCC implementation computes two BFSs from a randomly chosen vertex-one on the in-edges and the other on the out-edges-and intersects the reachable sets. We perform the same operation as one of the first steps of our SCC algorithm and note that it requires about 30 seconds on our machine. They solve approximate k-cores in 363s, where the approximate k-core of a vertex is the coreness of the vertex rounded up to the nearest powers of 2. Our implementation computes the exact coreness of each vertex in 184s, which is 1.9x faster than the approximate implementation while using 113x fewer cores.
Stergiou et al. [86] describe a connectivity algorithm that runs in O (log n) rounds in the BSP model and report running times for the symmetrized Hyperlink2012 graph. They implement their algorithm using a proprietary in-memory/secondary-storage graph processing system used at Yahoo!, and run experiments on a 1000 node cluster. Each node contains two 6-core processors that are 2-way hyper-threaded and 128GB of RAM, for a total of 24000 hyper-threads and 128TB of RAM. Their fastest running time on the Hyperlink2012 graph is 341s on their 1000 node system. Our implementation solves connectivity on this graph in 38.3s-8.8x faster on a system with 128x less memory and 166x fewer cores. They also report running times for solving connectivity on a private Yahoo! webgraph with 272 billion vertices and 5.9 trillion edges, over 26 times the size of our largest graph. While such a graph seems to currently be out of reach of our machine, we are hopeful that techniques from theoretically-efficient parallel algorithms can help solve problems on graphs at this scale and beyond.
CONCLUSION
In this paper, we showed that we can process the largest publiclyavailable real-world graph on a single shared-memory server with 1TB of memory using theoretically-efficient parallel algorithms. We outperform existing implementations on the largest real-world graphs, and use much fewer resources than the distributed-memory solutions. On a per-core basis, our numbers are significantly better. Our results provide evidence that theoretically-efficient sharedmemory graph algorithms can be efficient and scalable in practice.
