Abstract-Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting -some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is the limited control flow available on the processor. This is inpart due to the high dependency between the control flow, input data, and limited utilization of vector instructions. Not surprising, compilers are not always able to detect these data dependencies and vectorize the algorithms. Using the branchavoiding model we show to remove control flow restrictions by replacing branches with an equivalent set of arithmetic operations. More so, we show how these can be vectorized using Intel's AVX-512 instruction set and that our new vectorized algorithms are 2 × −5× faster than scalar counterparts. We also present a new load balancing method, Logarithmic Radix Binning (LRB) that ensures that threads and the vector data lanes execute a near equal amount of work at any given time. Altogether, our algorithm outperforms several 2017 HPEC Graph Challenge Champions such as the KOKKOS framework and a GPU based algorithm by anywhere from 1.5× and up to 14×.
I. INTRODUCTION
Triangle counting and enumeration is a widely used kernel for numerous applications. These include clustering coefficients, community detection, email spam detection, Jaccard indices, and finding maximal k-trusses. The key building block of triangle counting is adjacency list intersection. Numerous algorithms have been developed for triangle counting and these encapsulate a wide range of programming models: vertex centric, edge centric, gather-apply-scatter (GAS), and linear algebra. Some approaches require the adjacency arrays to be sorted whereas other approaches do not. Most previous algorithms focus on scalar execution as vectorizing these algorithms is typically challenging. In this paper, we show several news algorithm for vectorizing the computational kernels of triangle counting.
We also propose a new load-balancing technique, which we call Logarithmic Radix Binning (LRB), that ensures that all the threads and vector units get a near equal amount of work. LRB improves on previous techniques which only focus on thread level load-balancing and shows how to vectorize at the vector lane granularity. Thus, vector unit utilization.
In this paper, we present several new algorithms for triangle counting. These algorithm uses techniques developed for the branch-avoiding model discussed in Green et al. [9] , [7] . Specifically, Green et al. show that the cost of branch mis-prediction is high for data dependent applications (such as graph algorithms) and that these applications can typically be implemented in an alternative manner that avoids branches entirely. This eliminates mis-prediction and makes execution times more consistent, at the cost of additional operations.
Algorithmic Contributions
• We develop a two-tiered binning mechanism for triangle counting. In the first tier, for each edge we decide on an intersection method that will be applied to that edge. Our current implementation uses two different kernels (though it can be extended to more). Therefore, this tier consist of two bins. For the second tier, we present Logarithmic Radix Binning (LRB). For each intersection kernel, a 2D set of bins is maintained. Each of these bin stores edges with similar computational properties. Thus, our vectorized triangle counting algorithm can grab K edges, where K is the vector width, with similar computational requirements allowing for good load-balancing and utilization at the vector lane granularity. In practice, this offers good scalability.
• We show how to increase the number of control flows in software. For a multi-threaded processor with P physical hardware threads and vector instructions K elements wide, we show how to execute up to P · K concurrent software threads. Where each of these threads is executing a different intersections. Thus, our new algorithm increases the control flow on the processor by a factor of K. For the Intel Knights Landing processor (discussed in Sec. V) used in our experiments, P = 272 and K = 16, allows us to support up to 4352 concurrent software threads.
• We show two novel vectorized triangle counting algorithms: 1) sorted set intersection, and 2) binary search. The first of these approaches finds common neighbors by scanning across the two sorted adjacency arrays being intersected, similar to a merge. The second approach finds common neighbors by searching for elements of one array in the other. We also show a runtime mechanism for deciding which algorithm should be selected The LRB method is then used to ensure good execution.
• LRB is architecture agnostic and has been extended to the GPU [5] . On the GPU, different bins are executed using a different number of threads, thus ensuring good loadbalancing and trading off various overheads.
Performance Contributions
• We compare our new algorithm against several high performing triangle counting algorithms, including [23] , [31] , [2] , [27] . Several of these were the fastest triangle counting algorithms in the 2017 HPEC Graph Challenge [24] . Our algorithm is faster than all algorithms for a but small number of instances across a wide range of test graphs. This includes outperforming KOKKOS [31] by an average of 4× and as much as 10× and [2] by an average of 2× and as much as 6×.
• From a scalability perspective, we show that our algorithm scales to large thread counts. This highlights the fact LRB is frequently able to give each thread a near equal amount of work and ensure good workload balance.
• Our new vectorized algorithms offers 2×-5× performance speedup over their scalar counterparts (which also use the LRB load balancing mechanism).
II. RELATED WORK
The applications in which triangle counting and triangle listing are used is broad. It became an important metric to data scientists with the introduction of clustering coefficients [30] . Other applications for triangle counting are: finding transitivity [18] , spam detection in email networks [1] , finding tightly knit communities [21] , finding k-trusses [4] , [28] , [10] , and evaluating the quality of different community detection algorithms [17] , [32] . An extended discussion of triangle counting applications can also be found in [3] . Triangle counting was also a key kernel for the HPEC Graph Challenge [22] . a) Computational Approaches: Given a graph, G = (V, E), where V are the vertices and E are the edges in the graph, the three simplest and mostly widely used approaches for counting triangles in a graph [26] ; enumerating over all node triplets O(|V | 3 ), using linear algebra operations O(|V | w ) (where w < 2.376), and adjacency list intersection. Adjacency list intersection can be completed in multiple ways: sorted set intersections, hash tables, or binary searches to look up values. The time complexity of each of these approaches is data dependent, yet upper bounds can be given. Triangle can also be completed using the Gather-ApplyScatter (GAS) programming models [6] , [14] . In this work we focus on the adjacency intersection approaches.
b) Algorithmic Optimization: Numerous computational optimizations can be applied to triangle counting algorithms for static graphs to help reduce the overall execution time. For example, Green & Bader [8] present a combinatorial optimization that reduces the number of necessary intersections-offering a better complexity bound. Green et al. [12] show a scalable technique for load-balancing the triangle counting on shared-memory systems. Shun & Tangwongsan [23] , Polak [20] , and Pearce [19] show how to reduce the computational requirements by finding triangles in a directed graph rather than the undirected graph. Leist et al. [15] , Green et al. [13] , Wang et al. [29] , and Fox et al. [5] show several different strategies for implementing triangle counting on the GPU. c) Vector Instruction Sets: Vector instructions have been an integral part of commodity processors in the last twenty years, though the history of Single Instruction Multiple Data (SIMD) programs goes back even further. In SIMD instruction, each datum is placed in a separate lane and each lane executes the same instruction. Intel's AVX-512 ISA can operate on vectors of 512 bits. The AVX-512 instruction set has numerous conditional instructions referred to as masks in the AVX-512 ISA. Our algorithm makes extensive use of these instructions.
III. LOGARITHM RADIX BINNING
In this section, we present Logarithm Radix Binning (LRB for short), a method that effectively load balances the intersections across the thread, for both intersection algorithms. This method is efficient and works well for both the scalar and vectorized algorithms. LRB works by placing edges into bins based on the logarithmic value of the estimated amount of work for that edge. For triangle counting, we initially group the edges into two unique bins, one bin for the sorted set intersection and one for the binary search. We will then apply LRB to each to the edges and distribute them over a 2D grid of bins.
a) Initial binning: The intersection of the adjacency arrays can be done in multiple ways. Two popular approaches are 1) sorted set intersections and 2) using binary search. In the sorted set intersection, common elements are found by moving across the two sort adjacency arrays using a mergelike access pattern. Sorted set intersection performs well when the two adjacency arrays are of near equal lengthwe call this a balanced intersection. However, when one adjacency array is extremely large and the other is small, or the intersection is imbalanced, binary search is more efficient. For binary search, each element in the smaller array is looked up in the larger of the two arrays.
To determine which bin to place an edge (u, v) ∈ E into, we use the following estimations:
Intersecting (u, v) and (v, u) will find the same triangles, as such only one of these is necessary. For simplicity and performance reasons, we choose the edge (u, v) such that b) Finer grain binning: Fig. 1 depicts an edge list with the estimated amount of work for that edge. For each edge, the method with the minimal amount of estimated work is selected using Eq. 1 and Eq. 2. The yellow boxes denote edges that use the sorted set intersection approach and the blue boxes represent edges that use the binary search. The second row represents an edge ordering where the edges are placed into two bins-one for each approach. For the vectorized algorithm, this can lead to significant workload Algorithm 2: Branch-avoiding with conditional instructions. Variants of the branch-based and branch-avoiding algorithms can be found in [7] .
Algorithm 3: Vectorized sorted intersection kernel g) Storage Complexity Analysis: We use CSR (compressed sparse row) to represent the original graph and assume that the adjacency arrays are sorted. For triangle counting, CSR requires two arrays, one for the offsets of size O(|V |) and one for the indices (edges) which is of size O(|E|). The binning technique used by LRB stores the edges in a different order. We used an array of size O(|E|) to store these edges. While this new edge list does not increase the theoretical upper-bound; from a practical perspective, it does double the memory consumption. The edges in the new reordered edge-lists determine the order in which the edges will be intersected. However, the intersection process itself uses the sorted adjacency arrays in the CSR graph.
IV. VECTORIZED ALGORITHMS
In this section we present our new branch-avoiding and vectorized triangle counting algorithms. Alg. 2 depicts the sorted list intersection using the branch-avoiding programming model [9] , [7] . Note, the control flow for the branchavoiding algorithms is largely independent of input valuesthis is a key enabling factor for our vectorized approach. See Green et al. [9] , [7] for additional discussion on the branchavoiding programming model. Alg. 2 depicts a branch-avoiding algorithm for list intersection-this algorithm uses conditional instructions. Such instructions do not always exist in all architectures and are in fact designed for a single control flow systems where a single instruction is executed based on hardware flags (zeroflag, carry-flag, and overflow-flag). As such, a naïve vector implementation might be constrained to a single set of these flags or would a require a single control flow. We show how to overcome this hardware constraint using the AVX-512 instruction set. Specifically we show 1) how to increase the number of software control flows and 2) how to control the execution of each lane using masks (even though we do not have enough hardware flags).
Our experience with conditional instructions is that the compiler is not able to figure out how to use them. We strongly differentiate between compare and branch instructions. Compare instructions are used for evaluating different values whereas a branch typically uses compare output to decide on the next sequence of executable instructions. a) Vectorized Intersections: We started off by describing how to increase the control flow. The vectorized triangle counting can be implemented in a variety of ways using the branch avoiding model. In the first approach the different lanes work together on the same intersection (consisting of two arrays). This approach was shown to be effective on the GPU's SIMT programming model [11] , [13] ; however, this approach is significantly more challenging to implement for vector instructions. In the second approach, each lane in the vector unit is responsible for a different intersection. Thus, each vector unit requires 2 · K different adjacency arrays for K different intersections.
We choose the second of the approaches-each lane is responsible for a different intersection. This also removes the overhead of the partitioning scheme found in [13] . Thus, the maximal number of concurrent intersections (software threads) that can be executed on a system is Concurrent = P · K.
The branch-avoiding algorithm found in Alg. 2 depicts an initial "recipe" for implementing a vectorized triangle counting algorithm. Note that the number of data dependent branches has been significantly reduced, yet there still remains one condition in the control flow that is data dependent-the WHILE loop's condition which checks bounds of the two respective arrays. To vectorize the algorithm, this condition also needs to be vectorized and this is by no means trivial as this condition is responsible for the entire WHILE loop. The vectorized version of the algorithm ensures that certain data lanes will be ignored if the bounds of the indices for that lane are exceeded-each of the K lanes is responsible for managing its own bounds. Similar restrictions exist for the binary search based intersection (Alg. 4). Alg. 3 and Alg. 4 depict the vectorized code for the sorted set intersection approach and for the binary search approach, respectively. These algorithms show close-to-real vector code instructions (using Intel's AVX-512 instructions set) rather than pseudo-code. This allows highlighting:
• The vectorized algorithms require gathering the elements for K intersections (using 2 · K arrays) instead of just two arrays for a scalar execution. The introduction of efficient gather instructions has greatly simplified the process of collecting elements from random locations in memory.
• The AVX-512 instruction set introduces masked-vector instructions. These masks enable operating on a subset of the vectors lanes and updating the counters for each lane. Masked instructions are not conditional instructions. Specifically, the masked instructions are always executed across all the lanes; however, some data lanes might not be updated based on the value of the mask. Another key difference is that conditional instructions were designed for a single control flow (one per thread) whereas the masked operations allow a vector-wide control flow (for multiple control flows). This distinction is the reason that a single conditional instruction is replaced with two masked instructions. For example, the CADDEQ operations is replaced with vector CMPEQ instruction followed by a masked add vector operation. While this obviously incurs a performance penalty, it also enables increasing the scalability of the algorithm across the vector.
V. PERFORMANCE ANALYSIS a) Experiment System:
The experiments presented in this paper are primarily executed on an Intel Xeon Phi 7250 processor with 96GB of DRAM memory (102.4 GB/s peak bandwidth). This processor is part of the Xeon Knights Landing series of processors. In addition to the main memory, the processor has an additional 16GB of MCDRAM high bandwidth memory (400 GB/s peak bandwidth) which is used used as our primary memory -if the graph fits into main memory the lower latency DRAM memory is not utilized. The Intel Xeon Phi 7250 has 68 cores with 272 threads (4-way SMP). These cores run at a 1.3 GHz clock and share a 32MB L2 cache. Given these system parameters and using our new algorithms, we are able to execute up to 4352 b) Inputs: The algorithms are tested using real world graphs and networks taken from SNAP [16] and the HPEC Graph Challenge [24] , Table I . By default, all graphs are treated as undirected. Directed graphs are transposed and duplicate edges created in this phase are removed. Our algorithm can also utilize the optimization of finding triangles in a directed graph (where only half the edges exist). This concept is used in [23] , [20] , [19] and is referred to as the DOD graph in [19] -which is the terminology we use in this paper.
c) LRB Analysis: Our algorithm implementation incorporates multiple optimizations. To capture the benefits of each of these optimization, we execute our algorithm with several different optimizations. Table II describes the various optimizations we use. Fig. 3 depicts various performance characteristics of our new algorithms and the various optimizations for the socLiveJournal1 graph -similar results were seen for other graphs. Note the abscissa is log scale for all the sub-figures. Fig. 3 (a) depicts the execution time as a function of the number of threads and Fig. 3 (b) depicts the speedup for each of these configurations in comparison with a sequential execution of a specific algorithm. For all these configurations, the parallel scalability is near linear all the way up to 68 threads which is the number of physical cores on the KNL system used in our experiments. While there is some performance improvement beyond 68 threads, the scaling it is not linear. This is a well known artifact of multiple threads per core when resources are shared. Yet, it also shows that LRB is successful as a load-balancing mechanism. Fig. 3 (c) highlights the contributions of the different optimizations of our algorithm. For each thread count, all algorithms are normalized against the "Mixed edge-list" implementation for a given thread count. The typical speedup of going from the scalar execution to the vectorized execution increases performance by roughly 2.5× for both the regular graph as well the the DOD graph. For other graphs, the vectorization increased performance by as much as 5×. Applying all these optimizations together greatly improves performance over an already optimized algorithm (that selects an ideal intersection kernel for each edge). Specifically, for soc-LiveJournal this improves performance by an average 1 We note that parallelism may be limited in practice by the number of vector units. To the best of our knowledge 4 threads (single core) share 2 VPUs [25] . KOKKOS, a SpMV based implementation that uses vector instructions, HPEC Graph Challenge Champion by an average of 2.5×. Our new algorithm is also upto 4× faster than the fastest algorithm for the GPU (running an NVIDIA P100 GPU). There are numerous instances where our new algorithm is also 5 × −10× faster than these algorithm.
ACKNOWLEDGMENTS
Funding was provided in part by the Defense Advanced Research Projects Agency (DARPA) under Contract Number FA8750-17-C-0086. This work was partially funded by the Doctoral Studies Program at Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525. The content of the information in this document does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.
