The enumeration of all maximal cliques in an undirected graph is a fundamental problem arising in several research areas. We consider maximal clique enumeration on sharedmemory, multi-core architectures and introduce an approach consisting entirely of data-parallel operations, in an effort to achieve efficient and portable performance across different architectures. We study the performance of the algorithm via experiments varying over benchmark graphs and architectures. Overall, we observe that our algorithm achieves up to a 33-time speedup and 9-time speedup over state-of-theart distributed and serial algorithms, respectively, for graphs with higher ratios of maximal cliques to total cliques. Further, we attain additional speedups on a GPU architecture, demonstrating the portable performance of our data-parallel design.
INTRODUCTION
Over the past decade, supercomputers have gone from commodity clusters built out of nodes containing a single CPU core to nodes containing small numbers of CPU cores to nodes containing many cores, whether self-hosted (i.e., Xeon Phi) or from accelerators (typically GPUs). For software developers, this has led to a significant change. One decade ago, software projects could target only distributed-memory parallelism, and, on a single node, could use a single-threaded approach, typically with C or Fortran. Now, software projects must consider shared-memory parallelism in addition to distributed-memory parallelism. Further, because the architectures on leading edge supercomputers vary, software projects have an additional difficulty, namely how to support multiple architectures simultaneously. One approach to this problem is to maintain separate implementations for separate architectures. That is, to have a CUDA implementation for NVIDIA GPUs and a TBB implementation for Intel architectures. That said, this approach has drawbacks. For one, the software development time increases, as modules need to be developed once for each architecture. Another problem caused by this approach is a lack of future-proofing; as supercomputers adopt new processor architectures (e.g., FPGA), the code for each module must be re-written, or possibly even re-thought.
Data-parallel primitives (DPPs) [6] is an approach for developing a single code base that can run over multiple architectures in a portably performant way. DPPs are customiz- * e-mail:blessley@cs.uoregon.edu † e-mail:tperciano@lbl.gov ‡ e-mail:mmathai@cs.uoregon.edu § e-mail:hank@cs.uoregon.edu ¶ e-mail:ewbethel@lbl.gov able building blocks, meaning the algorithm design consists in composing these building blocks to solve a problem. To be a DPP, an algorithm needs to execute in O(logN) time on an array of size N, provided there are N or more cores to work with. Well-known patterns, such as map, reduce, gather, scatter, and scan, meet this property, and are some of the most commonly used DPPs. While programming with DPPs requires re-thinking algorithms, the payoff comes in reduced code size, reduced development time, portable performance, and the ability to port to new architectures with reduced effort.
With this paper, we explore the maximal clique enumeration problem in the context of DPPs. Our motivator to pursue this work was an image processing problem that required maximal cliques and also aimed to support multiple architectures. In fact, very recent algorithms for the analysis of experimental image data take advantage of graphical models with maximal clique analysis and high performance computing techniques as in [38, 39] . That said, we have found that, for certain conditions, our approach is competitive with leading maximal clique implementations. We focus our comparisons on state-of-the-art maximal clique solvers. In our experiments, we find that our DPP-based algorithm is faster than the leading solutions for some graphs, namely those with higher ratios of maximal cliques to total cliques. Overall, the contribution of this paper is an important demonstration that the DPPs approach can work well for graphs, as well as a specific algorithm for maximal clique enumeration.
BACKGROUND AND RELATED WORK

Maximal Clique Enumeration
A graph G consists of a set of vertices V, some pairs of which are joined to form a set of edges E. A subset of vertices C ⊆ V is a clique, or complete subgraph, if each vertex in C is connected to every other vertex in C via an edge. C is a maximal clique if its vertices are not all contained within any other larger clique in G. The size of a clique can range from zeroif there are no edges in G-to the number of vertices in V, if every vertex is connected to every other vertex (i.e., G is a complete graph). The maximum clique is the clique of largest size within G, and is itself maximal, since it cannot be contained within any larger-sized clique. The task of finding all maximal cliques in a graph is known as maximal clique enumeration (MCE). Figure 1 illustrates a graph with 6 vertices and 9 undirected edges. An application of MCE on this graph would search through 15 total cliques, of which only 3 are maximal.
The maximum number of maximal cliques possible in G is exponential in size; thus, MCE is considered an NP-Hard problem for general graphs in the worst case [34] . However, for certain sparse graph families that are encountered in practice (e.g., bipartite and planar), G typically contains only a polynomial number of cliques, and numerous algorithms have been introduced to efficiently perform MCE on real-world graphs. A brief survey of prior MCE research, in- cluding the algorithms we compare against in our study, is provided later in this section. [28] . Moreover, several DPP-based algorithms have been introduced for the construction of spatial search structures in the visualization domain (e.g., ray tracing), particularly for real-time use on graphics hardware. These include k-d trees, uniform grids, two-level grids, bounding volume hierarchies (BVH), and octrees [29, 49, 19, 18, 17, 22, 25] . Finally, our experiments make use of the VTK-m framework [36] , which is the same framework used in several of these scientific visualization studies. VTK-m is effectively the unification of three predecessor visualization libraries-DAX [35] , EAVL [33] , and PISTON [29] -each of which were constructed on DPP with an aim to achieve portable performance across multiple many-core architectures.
Related Work
Maximal Clique Enumeration
Several studies have introduced algorithms for MCE. These algorithms can be categorized along two dimensions: traversal order of clique enumeration and whether it is serial or parallel.
Serial depth-first MCE uses a backtracking search technique to recursively expand partial cliques with candidate vertices until maximal cliques are discovered. This process represents a search forest in which the set of vertices along a path from a root to a child constitutes a clique, and a path from a root to a leaf vertex forms a maximal clique. Upon discovering a maximal clique, the algorithm backtracks to the previous partial clique and branches into a recursive expand operation with another candidate vertex. This approach limits the size of the search space by only exploring search paths that will lead to a maximal clique.
The works in [5, 7] introduce two of the earliest serial backtracking-based algorithms for MCE; the implementation of the algorithm in [7] attained more prominence due to its simplicity and effective performance for most practical graphs. The algorithms proposed in [16, 20, 45, 26, 10, 31] build upon [7] and devise similar depth-first, tree-based search algorithms. Tomita et al. [43] optimize the clique expansion (pivoting) strategy of [7] to prune unnecessary subtrees of the search forest, make fewer recursive calls, and demonstrate very fast execution times in practice, as compared to [7, 45, 10, 31] . Eppstein et al. [13, 14] develop a variant of [7] that uses a degeneracy ordering of candidate vertices to order the sequence of recursive calls made at the top-most level of recursion. Then, during the inner levels of recursion, the improved pivoting strategy described in [43] is used to recurse on candidate vertices. [14] also introduces two variants of their algorithm, and propose a memoryefficient version of [43] using adjacency lists. Experimental results indicate that [14] is highly competitive with the memory-optimized [43] on large sparse graphs, and within a small constant factor on other graphs.
Distributed-memory, depth-first MCE research has also been conducted. Du et al. [12] present an approach that assigns each parallel process a disjoint subgraph of vertices and then conducts serial depth-first MCE on a subgraph; the union of outputs from each process represents the complete set of maximal cliques. Schmidt et al. [40] introduce a parallel variant of [7] that improves the process load balancing of [12] via a dynamic work-stealing scheme. In this approach, the search tree is explored in parallel among compute nodes, with unexplored search subtrees dynamically reassigned to underutilized nodes. Lu et al. [30] and Wu et al. [47] both introduce distributed parallel algorithms that first enumerate maximal, duplicate, and non-maximal cliques, then perform a post-processing phase to remove all the duplicate and nonmaximal cliques. Dasari et al. [11] expand the work of [14] to a distributed, MapReduce environment, and study the performance impact of various vertex-ordering strategies, using a memory-efficient partial bit adjacency matrix to represent vertex connectivity within a partitioned subgraph. Svendsen et al. [42] present a distributed MCE algorithm that uses an enhanced load balancing scheme based on a carefully chosen ordering of vertices. In experiments with large graphs, this algorithm significantly outperformed the algorithm of [47] .
Serial breadth-first MCE iteratively expands all k-cliques into (k + 1) cliques, enumerating maximal cliques in increasing order of size. The number of iterations is typically equal to the size of the largest maximal clique. Kose et al. [21] and Zhang et al. [48] introduce algorithms based on this approach. However, due to the large memory requirements of these algorithms, depth-first-based algorithms have attained more prevalence in recent MCE studies [40, 42] .
Shared-memory breadth-first MCE on a single node has not been actively researched to the best of our knowledge. In this study, we introduce a breadth-first approach that is designed in terms of data-parallel primitives. These primitives enable MCE to be conducted in a massively-parallel fashion on shared-memory architectures, including GPU accelerators, which are designed to perform this data-parallel computation. We compare the performance of our algorithm against that of Tomita et al. [43] , Eppstein et al. [14] and Svendsen et al. [42] . These studies provide suitable benchmark comparisons because they each introduce the leading MCE implementations in their respective categories: Tomita et 
DATA-PARALLEL PRIMITIVES
The new algorithm presented in this study is described in terms of data-parallel primitives, or DPPs. These primitives provide high-level abstractions and permit new algorithms to be platform-portable across many environments. The following primitives are used in our algorithm implementation:
• Map: Applies an operation on all elements of the input array, storing the result in an output array of the same size, at the same index; • Reduce: Applies a summary binary operation (e.g., summation or maximum) on all elements of an input array, yielding a single output value. ReduceByKey is a variation which performs Reduce on the input array, segmenting it based on a key or unique data value in the input array, yielding an output value for each key; Most DPPs can be extended with a developer supplied functor, enabling custom operations. For example, if a developer wants to extract all perfect squares from an input array, they can write a unary predicate functor that checks if the fractional parts of the square root of each value is zero. The Compact DPP then execute the functor over the entire input array in parallel.
ALGORITHM
This section presents our new DPP-based MCE algorithm, which consists of an initialization procedure followed by the main computational algorithm. The goal of the initialization procedure is to represent the graph data in a compact format that fits within shared memory. The main computational algorithm enumerates all of the maximal cliques within this graph. The implementation of this algorithm is available online [4] , for reference and reproducibility.
Initialization
In this phase, we construct a compact graph data structure that consists of the following four component vectors: This data structure is known as a v-graph [6] and it is constructed using only data-parallel operations. The compressed form of the v-graph in turn enables efficient data-parallel operations for our MCE algorithms. We construct the v-graph as follows. Refer to algorithm 1 for pseudocode of these steps and Figure 2 for an illustration of the input and output.
1. Reorder: Accept either an undirected or directed graph file as input; if the graph is directed, then it will be converted into an undirected form. We re-order an edge
This maintains the ascending vertex order that is needed in our algorithms; 2. Sort: Invoke a data-parallel Sort primitive to arrange all edge pairs in ascending order (line 9 of algorithm 1). The input edges in Figure 2 provide an example of this sorted order; 3. Unique: Call the Unique data-parallel primitive to remove all duplicate edges (line 10 of algorithm 1). This step is necessary for directed graphs, which may contain bi-directional edges (a, b) and (b, a); 4. Unzip: Use the Unzip data-parallel primitive to separate the edge pairs (a i , e i ) into two arrays, A and E, such that all of the first-index vertices, a i , are in A and all of the second-index vertices, e i , are in E (line 11 of algorithm 1). For example, using the edges from Figure 2 , we can create the following A and E arrays: The array E represents the edge list in our v-graph structure; 5. Reduce: Use the ReduceByKey data-parallel primitive to compute the edge count for each vertex (line 13 of algorithm 1). Using the arrays A and E from step 3, this operation counts the number of adjacent edges from E that are associated with each unique vertex in A. The resulting output arrays represent the lists I and C in our v-graph structure:
6. Scan: Run the ExclusiveScan data-parallel operation on the edge counts array, C, to obtain indices into the edge list, E, for each entry in I (line 15 of algorithm 1). This list of indices represents the list V in our v-graph (see Figure 2 ). In our running example, vertex 0 has 3 edges and vertex 1 has 3 edges, representing index segments 0-2 and 3-5 in E, respectively. Thus, vertex 0 and vertex 1 will have index values of 0 and 3, respectively:
Hashing-Based Algorithm
We now describe our hashing-based algorithm to perform maximal clique enumeration, which comprises the main computational work. This algorithm takes the v-graph from the initialization phase as input. In the following subsections we provide an overview of the algorithm, along with a more detailed, step-by-step account of the primary data-parallel operations.
Algorithm Overview
We perform MCE via a bottom-up scheme that uses multiple iterations, each consisting of a sequence of data-parallel operations. During the first iteration, all 2-cliques(edges) are expanded into zero or more 3-cliques and then tested for maximality. During the second iteration, all of these new 3-cliques are expanded into zero or more 4-cliques and then tested for maximality, so on and so forth until there are no newly-expanded cliques. The number of iterations is equal to the size of the maximum clique, which itself is maximal and cannot be expanded into a larger clique. Figure 3 presents the progression of clique expansion for the example graph in Figure 1 . During this process, we assess whether a given k-clique is a subset of one or more larger (k + 1)-cliques. If so, then the kclique is marked as non-maximal and the new (k + 1)-cliques are stored for the next iteration; otherwise, the k-clique is marked as maximal and discarded from further computation.
In order to determine whether a k-clique is contained within a larger clique, we use a hashing scheme that searches through a hash table of cliques for another k-clique with the same hash value. These matching cliques share common vertices and can be merged into a larger clique if certain criteria are met. Thus, hashing is an important element to our algorithm. Figure 4 illustrates the clique merging process between two different k-cliques.
Algorithm Details
Within each iteration, our MCE algorithm consists of three phases: dynamic hash table construction, clique expansion, and clique maximality testing. These phases are conducted in a sequence and invoke only data-parallel operations; re- of length (k = 3) × (numCliques = 5) = 15.
Dynamic Hash Table Construction:
As our algorithm uses hashing as an integral component, we discuss the operations that are used to construct a hash table into which the cliques are hashed and queried.
First, each clique is hashed to an integer (line 21 of algorithm 2). This is done using the FNV-1a hash function [3] , h, and taking the result modulo the number of cliques. Further, only the clique's last k − 1 vertex indices are hashed. Only the last (k − 1) vertices are hashed because we just need to search (via a hash table) for matching (k − 1)-cliques to form a new (k + 1)-clique. For example, cliques 0-3-4 and 1-3-4 both hash their last two vertices to a common index, i.e., h (3) (4) , and can combine to form 0-1-3-4, since leading vertices 0 and 1 are connected (see Figure 4) .
Next, we allocate an array, hashTable, of the same size as cliques, into which the cliques will be rearranged (permuted) in order of hash value. with three chains of contiguous cliques, each sharing the same hash value. Since the cliques within a chain are not necessarily in sorted order, the chain must be probed sequentially using a constant number of lookups to find the clique of interest. This probing is also necessary since different cliques may possess the same hash value, resulting in collisions in the chain. For instance, cliques 0-1-3 and 0-1-4 both hash to index h(1-3) = h(1-4) = 1, creating a collision in Chain 1 . Thus, the hash function is important, as good function choices help minimize collisions, while poor choices create more collisions and, thus, more sequential lookups.
Clique Expansion: Next, a two-step routine is performed to identify and retrieve all valid (k + 1)-cliques for each kclique in hashTable. The first step focuses on determining the sizes of output arrays and the second step focuses on allocating and populating these arrays.
In the first step, a Map primitive computes and returns the number of (k + 1)-cliques into which a k-clique can be expanded (line 29 of algorithm 2). The first step works as follows. For a given k-clique, i, at cliqueStarts[i], we locate its chain at chainStarts [i] and iterate through the chain, searching for another k-clique, j, with (a) a larger leading vertex Id and (b) the same ending (k − 1) vertices; these two criteria are needed to generate a larger clique and avoid duplicates (see Theorem 7.1 and Theorem 7.3). For each matching clique j in the chain, we perform a binary search over the adjacent edges of i in the v-graph edge list E to determine whether the leading vertices of i and j are connected. If so, then, by Theorem 7.1, cliques i and j can be expanded into a larger (k + 1)-clique consisting of the two leading vertices and the shared (k − 1) vertices, in ascending order. The total number of expanded cliques for i is returned. In our example, this routine returns a counts array, newCliqueCounts = [1 0 0 0 0], indicating that only clique 0-3-4 could be expanded into a new 4-clique; Figure 4 illustrates the generation of this 4-clique.
In the second step, an inclusive Scan primitive is invoked on newCliqueCounts to compute the sum of the clique counts, numNewCliques (line 30 of algorithm 2). The second step works as follows. This sum is used to allocate a new array, cliques, of size numNewcliques · (k + 1) to store all of the (k + 1)-cliques, along with a new offset index array with increments of (k + 1). With these arrays, we invoke a parallel Map operation that is identical to the Map operation of the first step, except that, upon discovery of a new (k + 1)-clique, we write the clique out to its location in cliques (using the offset array), instead of incrementing a newCliques counter (line 32 of algorithm 2). For the running example, the new cliques array consists of the single 4-clique, 0-1-3-4.
Clique Maximality Test: Finally, we assess whether each k-clique is maximal or not. Prior to Clique Expansion, a bit array, isMaximal, of length numCliques, is initialized with all 1s. During the first step of Clique Expansion, if a clique i merged with one or more cliques j, then they all are encompassed by a larger clique and are not maximal; thus, we set isMaximal [i] = isMaximal[j] = 0, for all j. Since each (k + 1)-clique includes (k + 1) distinct k-cliques-two of which are the ones that formed the (k + 1)-clique-we must ensure that the remaining k − 1 k-cliques are marked as nonmaximal with a value of 0 in isMaximal. In our example, the 4-clique 0-1-3-4 is composed of 4 different 3-cliques: 1-3-4, 0-3-4, 0-1-4, and 0-1-3. The first two were already marked as non-maximal, but the remaining two are non-maximal as well, and need to be marked as so in this phase. Our approach for marking these remaining cliques as non-maximal is as follows. The algorithm terminates when numNewCliques = 0 (line 17 of algorithm 2). The generated cliques array of new (k + 1)-cliques becomes the starting array (line 37 of algorithm 2) for the next iteration (line 38 of algorithm 2), if the termination condition is not met.
EXPERIMENTAL OVERVIEW
We assess the performance of our MCE algorithm in two phases, using a collection of benchmark input graphs and both CPU and GPU systems. In the first phase, we run our algorithm-denoted as Hashing-on a CPU platform and compare its performance with three state-of-the-art MCE algorithms-Tomita [43] , Eppstein [14] , and Svendsen [42] . In the second phase, we evaluate portable performance by testing Hashing on a GPU platform and comparing the runtime performance with that of the CPU platform, using a common set of benchmark graphs. The following subsections describe our software implementation, hardware platforms, and input graph datasets.
Software Implementation
Both of our MCE algorithms are implemented using the platform-portable VTK-m toolkit [4] , which supports finegrained concurrency for data analysis and scientific visualization algorithms. With VTK-m, a developer chooses data parallel primitives to employ, and then customizes those primitives with functors of C++-compliant code. This code is then used to create architecture-specific code for architectures of interest, i.e., CUDA code for NVIDIA GPUs and Threading Building Blocks (TBB) code for Intel CPUs. Thus, by refactoring an algorithm to be composed of VTK-m data-parallel primitives, it only needs to be written once to work efficiently on multiple platforms. In our experiments, the TBB configuration of VTK-m was compiled using the gcc compiler, the CUDA configuration using the nvcc compiler, and the VTKm index integer (vtkm::Id) size was set to 64 bits. The implementation of this algorithm is available online [4] , for reference and reproducibility.
Test Platforms
We conducted our experiments on the following two CPU and GPU platforms:
• CPU: A 16-core machine running 2 nodes, each with a 3.2 GHz Intel Xeon(R) E5-2667v3 CPU with 8 cores. This machine contains 256GB DDR4 RAM memory. All the CPU experiments use the Intel TBB multi-threading library for many-core parallelism.
• GPU: An NVIDIA Tesla K40 Accelerator with 2880 processor cores, 12 GB memory, and 288 GB/sec memory bandwidth. Each core has a base frequency of 745 MHz, while the GDDR5 memory runs at a base frequency of 3 GHz. All GPU experiments use NVIDIA CUDA V6.5. 
Test Data Sets
We applied our algorithm to a selected set of benchmark and real-world graphs from the DIMACS Challenge [1] and Stanford Large Network Dataset collections [2] . Table 1 lists a subset of these test graphs, along with their statistics pertaining to topology and clique enumeration. For each graph, we specify the number of vertices (V), edges (E), maximum clique size (Max size ), number of maximal cliques (Cliques max ), number of total cliques (Cliques all ), and ratio of maximal cliques to total cliques (Max ratio ). The DIMACS Challenge data set includes a variety of benchmark instances of randomly-generated and topologicallychallenging graphs, ranging in size and connectivity. The Stanford Large Network Data Collection contains a broad array of real-world directed and undirected graphs from social networks, web graphs, road networks, and autonomous systems, to name a few.
RESULTS
In this section, we present the results of our set of MCE experiments, which consists of two phases: CPU and GPU.
Phase 1: CPU
This phase assesses the performance of our Hashing algorithm on a CPU architecture with the set of graphs listed in Table 2 and Table 3 . For each graph in Table 2 , the total runtime (in seconds) of Hashing is compared with that of Tomita and Eppstein, two serial algorithms that have demonstrated state-of-theart performance for MCE. The set of graphs used for comparison was adopted from the paper of Eppstein et. al [14] , which compared the CPU results of three newly-introduced MCE algorithms with that of the Tomita algorithm. In this phase, we report the best total runtime among these three algorithms as Eppstein. Moreover, we only test on those graphs from [14] that are contained with the DIMACS Challenge and Stanford Large Network Data collections. Among these graphs, 9 were omitted from the comparison because our Hashing algorithm exceeded available shared memory on our single-node CPU system (approximately 256GB). Each of these graphs has a very large number of non-maximal cliques relative to maximal cliques. Thus, most of these nonmaximal cliques are progressively expanded and passed on to the next iteration of our algorithm, increasing the computational workload and storage requirements. Reducing our memory needs and formalizing the graph properties that lead to a high memory consumption by our algorithm will be investigated in future work.
From Table 2 , we observe that Hashing performed comparably or better on more than half-8 out of 15-of the test graphs. Using the graph statistics from Table 1 , it is apparent that our algorithm performs best on graphs with a high ratio of maximal cliques to total cliques, Max ratio . This is due to the fact that, upon identification, maximal cliques are discarded from further computation in our algorithm. So, the larger the number of maximal cliques, the smaller the amount of computation and memory accesses that will need to be performed. Tomita and Eppstein do not perform as well on these types of graphs due to the extra sequential recursive branching and storage of intermediary cliques that is needed to discover a large number of maximal cliques. From Table 2 we see that Tomita exceeded the available shared memory of its CPU system (approximately 3GB) for the majority of the graphs on which we possess the faster runtime.
Next, we compare Hashing to the CPU-based distributedmemory MCE algorithm of Svendsen et al. [42] , which we refer to as Svendsen. We use the set of 12 test graphs from [42] 10 of which are from the Stanford Large Network Data Collection and 2 of which are from the DIMACS Challenge collection. As can be seen in Table 3 , we attain significantly better total runtimes for 3 of the graphs. Each of these graphs have high values of Max ratio , corroborating the findings from the CPU experiment of Table 2 . For the remaining 9 graphs, one completed in a very slow runtime (loc-Gowalla) and 8 exceeded available shared memory. We do not report the graphs that failed to finish processing due to insufficient memory; each of these graphs have low values of Max ratio . The loc-Gowalla graph just fits within available device memory, but possesses a low Max ratio (see Table 1 ), leading to the significantly slower runtime than Svendsen.
Phase 2: GPU
Next, we demonstrate and assess the portable performance of Hashing by running it on a GPU architecture, using the graphs from Table 4 . Each GPU time is compared to both the Hashing CPU time and the best time between the Tomita and Eppstein algorithms. From Table 4 we observe that, for 8 of the 9 graphs, Hashing GPU achieves a speedup over the CPU. Further, for 5 of these 8 graphs, Hashing GPU performs better than both Hashing CPU and Tomita/Eppstein. These speedups demonstrate the ability of a GPU architecture to utilize the highly-parallel design of our algorithm, which consists of many fine-grained and compute-heavy data-parallel operations. Moreover, this experiment demonstrates the portable performance of our algorithm, as we achieved improved execution times without having to write custom, optimized GPU functions within our algorithm; the same high-level algorithm was used for both the CPU and GPU experiments.
CONCLUSIONS AND FUTURE WORK
We have described a data-parallel primitive (DPP)-based algorithm for maximal cliques that has shown good performance on both CPUs and GPUs. The algorithm performs well on graphs with a large ratio of maximal cliques to total cliques, and outperformed leading implementations on standard data sets. Overall, the contribution of the paper is not only a new algorithm for maximal clique enumeration, but also significant evidence that the DPP approach can work well for graphs. In terms of future work, the memory requirements for our approach prevented us from considering certain graphs, especially in a GPU setting. We would like to reduce overall memory usage and also explore the usage of host memory when running on the GPU, such as by leveraging the outof-core MCE techniques presented in Cheng et al. [9] . We also hope to improve the algorithm to work better on denser graphs with low ratios of maximal cliques to total cliques, via pruning strategies for cliques that will eventually be covered by larger non-maximal cliques. Finally, when our algorithm cannot be further improved, we would like to formalize the conditions for which it outperforms the current leading algorithms. [44] , which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system [37] , which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
ACKNOWLEDMENT
APPENDIX
The following lemmas and theorem relate to properties upon which our hashing and sorting-based algorithms are based. Proof. Please refer to [21] . Figure 4 demonstrates this property by creating a 4-clique, 0-1-3-4, from two 3-cliques, 0-3-4 and 1-3-4, both of which share the 2-clique 3-4. Effectively, once two k-cliques with matching (and trailing) (k − 1) vertices are found, we only need to test whether the leading vertices are connected; if so, the two k-cliques can be merged into a new (k + 1)-clique. Proof. In our algorithm, a k-clique is only matched with another k-clique that that a higher leading vertex Id. Both of the leading vertices of these two k-cliques have lower Ids and are distinct from the vertices of the matching (k − 1)-clique. By induction, this (k − 1)-clique must also be in ascending vertex order. Thus, the expanded (k + 1)-clique must possess an ascending vertex Id order. Proof. We will prove by induction on the size k.
• Base case of k = 2. The starting set of 2-cliques are the edges from the v-graph edge list, all of which are unique, since duplicate edges were removed in the initialization routine. 
