The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations.
Introduction
The number of triangles, i.e. cycles of length three, in an undirected graph lays the foundation of the clustering coefficient and the transitivity ratio -concepts which are not only of theoretical interest but are often applied to networks analysis [7, 20] . This creates a need for fast practical algorithms capable of counting triangles in large graphs.
Previous research in this area focused on sequential algorithms [12, 15] , parallel algorithms for MapReduce model [17] , and various approximation approaches [11, 18] .
The recent emergence of frameworks for general-purpose computing on graphics processor units (GPU), such as Nvidia CUDA, started a new branch of research in parallel computing. General-purpose GPU offers the computing power of a small cluster for much lower price, but it comes at a cost of certain limitations imposed by the architecture of graphic cards. Single Instruction Multiple Data model, high latency global memory, and small cache size are obstacles particularly hard to overcome in case of memory-intensive and highly irregular graph computations.
Nevertheless, a number of studies show there are methods of dealing with these issues and certain graph problems can be solved effectively on GPU -examples include minimum spanning tree [19] , connected components [16] , breadth first search [10] , and strongly connected components [4] .
In this paper we propose a parallel triangle counting algorithm together with its CUDA implementation and discuss its performance. Running on the 448-core Nvidia Tesla C2050 graphic card we achieve 8 to 15 times speedup over a single-core CPU solution. We also examine a setup with multiple GPU. We are able to further speed up the computation up to 2.8 times when running on four cards instead of one.
A similar studies have been conducted already by Leist et al. [13] , and more recently by Chatterjee [5] and Green et al. [8] . As we argue later, our approach, despite being very simple, significantly outperforms the previous ones.
This paper is structured as follows. Section 2 outlines the algorithm we use. Section 3 provides the implementation details and discusses the optimizations we employ. Section 4 carefully describes the experiments performed to evaluate our implementation and presents the results of these experiments. Section 5 contains our conclusions.
Algorithm

Known Sequential Algorithms
Schank and Wagner [15] present an extensive list of known sequential algorithms for counting and listing triangles, analyze their theoretical time complexity, and evaluate them experimentally on both synthetic and real world graphs. Two algorithms, the edge-iterator and forward, appear to be the winners of this comparison. Their running times are O(m deg max ) and O(m √ m), respectively, where m denotes the number of edges. They perform similarly for graphs with low deviation from the average degree, but the latter is more robust to skewed degree distributions. Latapy [12] simplifies the forward algorithm, reducing its memory needs and running time, and makes its analysis more straightforward. This modified version can be seen as a variant of the edge-iterator algorithm with an additional preprocessing phase. This allows a tighter bound on the worst-case complexity and greatly reduces the amount of work to be done in the subsequent counting phase, especially in the case of graphs with the high degree deviation.
We choose forward as a starting point for our parallel algorithm. Both the preprocessing and counting phases are not only easily parallelizable but they also rarely access the memory in random fashion, which makes the algorithm a good fit for GPU.
Sequential Forward Algorithm
Now we briefly describe the sequential forward algorithm. First, the algorithm sets an arbitrary linear order ≺ on vertices which is consistent with their degrees, i.e. deg(u) < deg(v) implies u ≺ v. Then, for every vertex u, it filters its adjacency list removing vertices v such that v ≺ u. Since ≺ is antisymmetric, u remains unremoved in the adjacency list of v. This turns each undirected edge into a directed one, by choosing the orientation from the vertex with the lower degree to the vertex with the higher degree. In the final step of the preprocessing, the algorithm sorts adjacency lists according to some arbitrary, previously fixed, linear order, e.g. the natural order on vertices' numbers.
After this preprocessing is done, the algorithms iterates over all edges and, for each edge, calculates intersection of adjacency lists of both its ends. The total size of these intersections is the number of triangles in the graph. Since adjacency lists are sorted, each such intersection can be computed by a two pointers merge algorithm (as in merge-sort ) in time linear in the sum of lengths of both lists. Note that only edges that go forward according to ≺ are left. Thus, every triangle is counted exactly once, and it can be shown [12, 15] that all adjacency lists are not longer than √ m, which makes the whole algorithm run in O(m √ m) time.
Parallel Forward Algorithm
The preprocessing phase is easily parallelized with a few prefix sum and sort routines. We describe it in detail in the next section. The counting phase is parallelized by assigning a single thread to each edge. Each thread calculates adjacency lists intersection in sequential manner. This gives us an O( √ m) time algorithm running on O(m) processors, thus having the optimal work.
From a theoretical point of view, our algorithm does not have a polylogarithmic time complexity, and thus does not put the problem into the class NC. However, in practice, the speedup is always bounded by the number of available processors (or cores in the case of GPU). Our algorithm has the optimal work and O(m) speedup, thus it is optimal for graphs with the number of edges greater than the number of processors the algorithm is executed on. This is generally the case since modern graphic cards usually have only a few hundred cores.
Implementation
The implementation described in this section and all the tools used to run the experiments described in the next section are available on GitHub 1 .
Input Format
Before going to implementation details it is important to note what the input format of our algorithm is. It is often tempting to use data representation that is particularly suitable for a specific algorithm. However, in practice it is rarely the case that the data is available in the selected format and converting it may take a significant amount of time.
In most works on graph computation on GPU the input format is either an adjacency list [4, 10] or an edge array [16, 19] . We decide to use the latter. It is an array of structures, each structure composed of two values -identifiers of the two vertices a given edge connects. We assume there are no self-loops nor multiple edges, and each (undirected) edge appears exactly twice, once in each direction. We do not assume any particular order of the edges.
The rationale for of our choice is similar to the one presented in [16] . An adjacency list can be converted to an edge array with a fast and simple singlepass algorithm. The conversion in the opposite direction requires sorting, which makes it much more computationally expensive. Thus, using the edge array representation makes the algorithm more versatile in the sense that it can be used in various contexts without any significant overhead for a format conversion.
The above intuitions can be further supported by the following numbers. On the LiveJournal graph, which we use in the experiments (for more details, see Section 4), our CPU solution optimized for an adjacency list input runs about 12 seconds, while the solution optimized for an edge array input is only 2 seconds slower. On the other hand, converting this graph from the edge array representation to the adjacency list representation takes about 7 seconds.
Preprocessing Phase
The preprocessing phase consists of eight steps. They make a heavy use of the Trust library [9] .
1. Copy the edge array to the GPU memory.
2. Calculate number of vertices using thrust::reduce routine with thrust::maximum operator, which computes the largest vertex identifier across both ends of all edges.
3. Sort edges, according to the first vertex, in case of a tie according to the second vertex, using thrust::sort routine, which performs radix sort. This way the edge array becomes a concatenated adjacency list of subsequent vertices, each list sorted.
4. Calculate the node array: i-th element of this array points to the first edge in the edge array which first vertex is i. This is done by running m − 1 threads and letting k-th thread examine edges k and k + 1. If first vertices of these edges differ, the thread writes k + 1 to an adequate cell of the node array. It may happen that the thread stores this value in more than one cell when there is a vertex with an empty adjacency list.
5. Mark edges going "backwards", i.e. from a vertex with the higher degree to a vertex with the lower degree. In case of a tie, compare identifiers of vertices. The degree of a vertex can be calculated quickly by subtracting two subsequent cells of the node array.
6. Remove "backward" edges using thrust::remove if routine, which removes marked elements and compacts the edge array preserving the order of elements that are not removed.
7. Transform the edge array from an array of structures to a structure of arrays. We call this step unzipping.
8. Recalculate the node array.
Counting Phase
Triangles are counted with the following kernel performing the two pointers merge algorithm. The i-th thread deals with edges which index in the edge array modulo the number of threads equals i. For each such edge the thread sequentially computes the size of the intersection of the adjacency lists of the both ends of this edge.
__global_ _ void C o u n t T r i a n g l e s ( int m , int * edge , int * node , uint64_t * result ) { int start = blockDim . x * blockIdx . x + threadIdx . x ; int step = gridDim . x * blockDim . After the kernel is done, elements of result array are summed, using thrust::reduce routine, to obtain the total triangle count, and the algorithm terminates.
As usual, a careful choice of the number of threads and blocks to run is crucial to achieve high performance. After tuning these parameters experimentally, we conclude it is optimal to run 64 threads per block and 8 blocks per each multiprocessor. These numbers proved themselves to be optimal, or nearly optimal, for all the graphs we used in our experiments, as well as for both the Tesla C2050 and NVS 5200M devices.
Optimizations
Unzipping Edges.
CountTriangles kernel runs 13% to 32% faster when the edge array is given as a structure of arrays. Conversion from an array of structures is very fast, usually takes less than 20 milliseconds.
Sorting Edges as 64-bit Integers.
thrust::sort runs approximately 5 times faster when the edge array is passed to it as an array of 64-bit integers instead of an array of pairs of 32-bit integers. When using this optimization it is worth to remember that, because of the endianness, it produces a slightly different ordering -edges are ordered by second vertex then, in case of a tie, by first.
Avoiding Unnecessary Reads.
Compare our preliminary version of a while loop in CountTriangles kernel: The preliminary version reads two values form the edge array in every loop execution, while the final version reads only one value when no triangle is found. This difference seems unimportant because these unnecessarily read values are often cached and the preliminary version performs less work in divergent branches. Nevertheless, the final version runs 36% to 48% faster.
Reducing Warp Size.
We can simulate reducing the warp size by doubling the number of threads and making half of the threads within a warp idle. Although our final implementation does not benefit from this method, we find it worth noting since it allowed 30% faster kernel execution at earlier stages of the kernel's development. We believe it is due to the fact that, when a read misses the cache and a thread has to wait for the global memory, all other threads in the warp have to wait as well. Reducing the warp size makes less threads affected by a single cache miss. This effect is especially significant for our algorithm because cache misses happen at different moments for different threads.
CPU Preprocessing for Very Large Graphs.
Sorting in the step 3 of the preprocessing phase requires the largest amount of the global memory of the GPU. If the input graph is too large to fit into the memory in this step, we use a slightly modified version of the preprocessing. First, we use the CPU to calculate vertex degrees and remove backward edges. It runs slower than on the GPU but halves the input size. Then, we can move to GPU to sort and unzip edges and calculate the node array. This optimization allows to process graphs twice larger than without it.
Unsuccessful Optimization Attempts.
We have tried the virtual warp-centric method [10] , collaborative reading to shared memory, and assembler level prefetching. None of these optimizations increase the performance of our implementation, probably due to a high overhead compared to possible gains.
Multi-GPU Setup
We examine a simple extension of our algorithm to a multi-GPU setup. The preprocessing phase is run just on a single GPU, then the edge array and node array are copied to the remaining devices, and each device iterates over its allotted subset of edges. The speedup of this approach is limited by the Amdahl's law. The preprocessing time is roughly proportional to the number of edges, while the counting phase time appears to be proportional to the number of triangles. The fraction of the execution time spent on the preprocessing varies depending on the graph -for graphs that we use in our experiments this fraction ranges from 0.08 to 0.76, which gives the maximum speedup for 4 GPUs between 3.23 and 1.22. This is roughly consistent with our experimental results. The biggest speedups are obtained for Kronecker graphs which have large triangles to edges ratios.
Less trivial approach to the multi-GPU parallelization would probably require splitting the graph into (not necessarily disjoint) subgraphs, which then can be processed independently [6, 17] . However, it is not clear if the obtained speedup would compensate the overhead caused by the splitting phase. We do not cover this issue in this paper, but we find it a viable direction for future research.
Experiments
To evaluate the performance of our implementation we run it on a number of graphs and compare its running time with a baseline single-core CPU implementation. The baseline implementation is our own implementation of the forward algorithm, which is slightly faster than the one provided in [12] .
The graphs we use are: Internet topology graph, LiveJournal online social network, and Orkut online social network from Stanford Large Network Dataset Collection [14] ; Citeseer and DBLP co-paper networks and Kronecker R-MAT graphs from 10th DIMACS Implementation Challenge [2] ; Barabási-Albert network [3] ; Watts-Strogatz network [20] . The Table 1 summarizes basic properties of these graphs.
Experiments are run on the 448-core Nvidia Tesla C2050 GPU and the Intel Xeon X5650 CPU. All binaries are compiled with the -O3 optimization level. We measure wall clock time. We start the measurement just before the edge array is copied from CPU to GPU and finish it right after the final result is copied from GPU to CPU and all GPU memory is freed. Before the measurement we call cudaFree(NULL) to preinitialize CUDA context, because otherwise the first call to cudaMalloc takes additional 100 milliseconds.
Each experiment is run five times and we report the mean values. The standard deviation never exceeds 0.05 of the mean.
The results are presented in the Table 1 . All execution times are given in milliseconds. The first speedup column shows the GPU over CPU speedup, while the second speedup column shows the 4 GPU over 1 GPU speedup. Graphs marked with † are too large to fit into the GPU memory so part of the preprocessing phase is run on the CPU (as described in the Section 3.4.5), which accounts for lower performance in case of these graphs.
Conclusions and Future Work
In the past there were various attempts to count triangles faster than with sequential algorithms. MapReduce approach to the problem [17] has significant overhead, and even for moderately sized graphs the execution time is in the order of minutes. It is beneficial to use it for extremely large graphs, with the number of edges in the order of one billion. Another approach is to use an heuristic approximation algorithm [11, 18] . Such algorithms provide good speedups and usually need little memory, but it comes at the cost of getting only an approximate triangle count, which can differ from the actual count usually by a few percent.
CUDA implementation of the parallel triangle counting algorithm, presented in this paper, achieves 8 to 15 times speedup over our optimized single-core CPU solution. It is capable of finding 3.8 billion triangles in 89 million edges graph in less than 10 seconds on the Nvidia Tesla C2050 GPU. By using 4 graphic cards instead of one we can further speedup computation up to 2.8 times, especially when the number of triangles in the input graph is much larger than the number of edges.
Taking into account speedups achieved by using GPU to solve other graph problems -e.g. 50 times for minimum spanning tree [19] , 10 times for connected components [16] , 12 times for breadth first search [10] , and 5 to 40 times for strongly connected components [4] -our results seem satisfactory.
The first work we can try to directly compare to is [13] . Unfortunately, such comparison is not easy for a number of reasons. First, the source code for the previous approach is not available, so we can compare only to the numbers provided in the paper. Second, the paper solves a slightly different problem, which is computing the clustering coefficient. It requires computing the number of triangles but also the number of two-edge paths in the input graph. Fortunately, the latter part is not harder than the former, so we can assume this gives our algorithm at most two times advantage. Third, in the cited paper, experiments are run on the Nvidia GeForce GTX 480 graphic card, which is very similar to the Nvidia Tesla C2050. Nevertheless, they are different devices. Lastly, the execution times measured during these experiments are presented only on plots, so we can obtain only approximate numerical values. Having said that, our algorithm is 45 (or 20, for 4 graphic cards) times faster than the previous one on the Barabási-Albert network, and 7 (or 3, for 4 graphic cards) times faster on the Watts-Strogatz network. We believe this is largely due to our choice of the forward algorithm.
Another GPU algorithm, proposed in [5] , is evaluated on Nvidia Tesla C1060 which is a slightly slower device than the one we use. However, the author reports running times in the order of 20 seconds for graphs with 2000 nodes, which means our approach is orders of magnitude faster.
The most recent work on the topic [8] proposes much more elaborate algorithm, in which also the adjacency list intersection step is parallelized. The algorithm is evaluated on a number of real world graphs, two of which (Citeseer and DBLP) also appear in our experiments. The evaluation is performed on the Nvidia Tesla K40, which is a bit more powerful than our device. Despite this, our algorithm achieves roughly two times lower execution times for these two graphs.
Our algorithm, despite being very simple, significantly outperforms, to our best knowledge, all triangle counting algorithms for GPU up to date.
We plan on extending our research in two directions which now we will briefly discuss.
First, it would be interesting to check if methods from [6, 17] can be applied, without a too big overhead, to split the graph into subgraphs which can be processed independently. This could give a better multi-GPU solution, and what is more important, this would allow to count triangles in graphs which do not fit into the GPU memory -which is one of the biggest limitations of our current algorithm.
Second, it might be beneficial to use a different counting algorithm for a small subset of vertices with largest degrees. A natural candidate for such algorithm is matrix multiplication [1] .
