This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementations that avoid branches.
INTRODUCTION
This paper concerns computations on a graph G = (V, E), where V is a set of vertices and E = {(u, v)|u, v ∈ V } is a set of edges. 1 Traditionally, the key challenges associated with creating high-performance graph implementations are computational demand, irregular memory access, difficulty Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. of load balancing, storage, and optimization criteria that cause the problem to be intractable, among others. In this work, we consider an additional challenge, which is critical to practical implementation but largely unstudied: branch prediction, which is an important factor in single-core performance on essentially all modern multi-and emerging manycore processors. We show subtle and sometimes unexpected performance phenomena that suggest incorrectly predicted branches can reduce single-core efficiency. These observations suggest that a simple algorithmic redesign, in which branches are avoided, can improve performance and can even make it more consistent.
We study branches because exploiting instruction-level parallelism is critical to achieving high single-core throughput, which is the building block for all higher levels of parallelization, such as shared memory or distributed memory parallelism. The presence of a conditional branch interrupts the flow of instructions; if it is not known whether the branch will be taken, the processor cannot know which instruction to fetch next, creating stalls in the processor pipeline. To address this problem, a modern processor core tracks the history of a branch, and uses this state to speculatively fetch the next instruction in what it estimates is the most likely outcome. If it guesses incorrectly, any speculatively executed instructions must be cancelled, causing slowdowns in time and potential reductions in energy-efficiency.
We have analyzed two different graph algorithms with respect to their branching behavior: connected components, based on the classic Shiloach-Vishkin (SV) algorithm [46] , and the classical form of breadth-first search (BFS) [18] , sometimes referred to as the "top-down" algorithm [8] . SV is a propagation-based algorithm and BFS is a shortest-path algorithm. Our results can in principle be extended to other algorithms in both families, including all-pairs shortest-paths, betweenness centrality, and depth-first search, among numerous others [10, 24, 26, 32, 50] . This paper focuses primarily on our findings for connected components. Our complete set of results appear in an accompanying technical report, which includes BFS [28] . 2 We occasionally summarize and allude to the BFS results herein as needed.
Our analysis quantifies the effect of branch mispredictions, both analytically and empirically. Our empirical studies rely on our own highly-tuned assembly language implementations of the target algorithms. We show that SV, which performs an equal amount of work in every iteration, suffers a performance penalty in its early iterations due in part to an increase in the number of branch mispredictions, which are also called branch misses. In SV's later iterations, when the branch prediction accuracy increases, the performance increases as well. This observation motivates a branch-avoiding algorithm that reduces the number of branches and branch mispredictions that the algorithm incurs. This change yields overall speedups over the highly tuned branch-based assembly implementation. The variations in per-iteration performance and number of executed instructions of SV essentially go away in the branch-avoiding version as well, bringing with it more consistent and predictable performance.
BFS also exhibits branch mispredictions, and we develop a branch-avoiding algorithm for it, too. However, our specific algorithm significantly increases the number of store operations by more than an order of magnitude. Consequently, there is no performance win for BFS [28] . Nevertheless, taken together we believe these two cases, SV and BFS, raise a number of intriguing new questions, both about the role of branch-avoidance in algorithm design, whether compilers can produce our hand-generated transformations, and whether additional architectural support could exploit the branching behavior we observe and mitigate cases of performance loss.
RELATED WORK
Our work focuses on connected components (CC) and breadth-first search (BFS), in part because they are primitive building blocks of higher-level graph analytics. Such analytics include connected components itself [38, 45] , computing modularity [42] , detecting communities [42, 43] , partitioning graphs [33] , computing clustering coefficients [51] , computing betweenness centrality [10, 26, 29] , computing closeness centrality [44] ), as well as computing a wide variety of distance-based analytics. A variety of packages implement these analytics, including STINGER [4, 22] , GraphCT [1, 21] , Ligra [47] , Pregel [37] , and the Combinatorial BLAS [13] . However, these packages focus on shared memory multicore, manycore, distributed memory parallelism [9, 12, 15, 30, 54] , and massively multithreaded systems [5, 7] . Thus, our study of low-level single-core behavior and instruction-level parallelism complements this other work, and should also apply broadly thereto.
Branch predictors.
A significant number of prior works on branch predictors has focused on their design and implementation in hardware; see Smith's survey of strategies [48] , among other seminal references [20, 34, 35, 49, 52, 53] . Little is known publicly about the actual implementation of the branch predictors in modern processors, since these are vendor-specific and proprietary. As such, there is some ongoing empirical research that tries to demystify these implementations using synthetic benchmarks [25, 40] . However, with few exceptions, most of the other work on branch prediction evaluates against general benchmark suites, such as SPECint2006 and SPECfp2006. 3 Therefore, they do not provide the additional level of understanding possible with a focus on more specific and application-oriented kernels, as in our study.
It has been shown that the impact of branch predictors on merging two sorted arrays can in fact increase the total 3 See: https://www.spec.org/benchmarks.html execution time by up to 5× [27] . These results are relevant to both sorting algorithms and to triangle-counting in graphs, the latter of which is a building-block for clustering coefficients when list intersection is used for finding the triangles [51] . Thus, these studies corroborate one another with respect to the affects of branch misprediction.
Performance engineering of graph computations.
There is some work on low-level performance engineering of graph computations. Green-Marl is domain specific language, which targets shared-memory platforms [31] . It emits back-end code that manages shared variables using, for instance, atomic instructions; from published code samples, its implementations are branch-based. Cong and Makarychev describe techniques to implement graph algorithms that are more cache-friendly [17] . They also show how to use software prefetching to improve spatial locality on IBM Power7 and Sun Niagara2 platforms. Both platforms support multiple threads per core, which can help in memory latency hiding.
For BFS specifically, there are additional studies. Chhugani et al. present a shared-memory parallel BFS [16] . They focus on reducing cross-socket communication, and use lockfree techniques. Merrill and Garland have developed a highly-tuned GPU implementation [39] . Beamer et al. have proposed algorithmic changes, which they refer to as being direction-optimizing [8] . None of these studies considers the impact of branching per se, and so largely complements our study.
Graph property characterizations.
Many researchers have characterized high-level properties of real-world graphs, like the existence of power-law degree distributions and small-world algorithmic effects [3, 6, 11, 23, 36, 41, 51] . Our analysis is justified in part by some of these findings, such as the existence of a large connected component [11] , which has implications for how our target graph computations will behave.
At a lower-level, Burtscher et al. develop metrics to quantify irregularity, with respect to both memory accesses and control-flow [14] . They use these metrics to compare different computations, including graph computations, confirming some aspects of conventional wisdom about what we consider "regular" versus "irregular." However, it is not clear (to us) how to translate these metrics into actionable transformations of code that improve performance.
BRANCH PREDICTION
Given a particular (static) conditional 4 branch in a graph algorithm, our analysis goal is to estimate how many times the branch predictor will mispredict it. We base our analysis on a simple 2-bit branch predictor [48] . The empirical evaluation of § 5 will justify this choice. Like most branch prediction techniques, it uses the history of previous executions of a given branch to predict the next outcome; 5 as such, one may formalize the analysis of predictors mathematically using Markov chains and reason about expected Figure 1 : A 2-bit branch predictor behaves as shown in this finite-state automaton. Each node is a state representing the next prediction, e.g., the strongly and weakly taken states predict "taken," the others, "not taken." Each edge shows how the state changes once the actual branch condition is resolved.
branch misses, which we have done. However, for concerns of readability and space, this paper omits the details of such analysis, instead stating the key results and offering more intuitive high-level explanations.
A model of 2-bit predictors
For each (static) conditional branch in the program, a 2bit predictor maintains a 2-bit state value, which encodes four possible states. Each state value is a prediction for the next occurrence of this branch; once the true branch condition is known, this state is updated. The precise states and transitions appear in the finite-state automaton (FSA) of figure 1. In particular, there are four possible states, named Strongly-Taken, Weakly-Taken, Weakly-Not-Taken, and Strongly-Not-Taken. The "strong" states reflect that the last few branches were all the same, i.e., all "taken" or all "not taken," and so it is likely the next branch will be the same. The weak states allow for the predictor's bias to change if a new pattern emerges.
We will further assume that the processor has enough branch state storage to track, for each conditional branch of interest, its 2-bit state for the duration of the program. That is, we will not consider the case when the processor runs out of branch state storage and must "evict" (and therefore lose or reset) the branch state. Our target programs are sufficiently compact that this assumption is reasonable.
Algorithm 1: A simple sequential while-loop, which executes its body exactly n times
Analysis of simple loops
A common programming pattern in graph algorithms is a simple sequential loop. Such loops iterate over, for example, the set of vertices, edges, or neighbors of a vertex (the adjacency list).
Consider, for example, the simple sequential while-loop of algorithm 1. By "simple," we mean that (a) the iteration variable i increases monotonically by 1 at each iteration; (b) the loop bound n is constant as the loop executes; and (c) there are no early exits. Thus, this loop executes its body exactly n times. The conditional branch in this case depends on the condition, i < n. We will assume the convention, for this loop, that the branch is taken when the condition is true, and not taken when the condition is false. 6 There will be exactly n + 1 evaluations of this branch, only the last of which is not taken, namely, when exiting the loop. 7 We can state a number of facts about such loops, assuming the 2-bit branch predictor. Lemma 1. When n ≥ 3, the final state of the 2-bit predictor is Weakly-Taken.
Proof. The conditional branch is taken n times. In the worst case, we begin the loop in the Strongly-Not-Taken state. According to the FSA of figure 1, after three taken state transitions, the predictor will be in the Strongly-Taken state. Since the final branch is not taken, the predictor must move into the Weakly-Taken state.
Lemma 2. When n ≥ 3, the maximum number of branch mispredictions incurred by the loop's conditional test (ignoring conditional branches in the body) is 3.
Proof. As with lemma 1, the the initial state of the predictor may be Strongly-Not-Taken, which will cause 2 mispredictions before reaching either of the Taken states. For the last loop iteration, when i = n ≥ 3, the predictor will be in the Strongly-Taken state but the branch will be taken, incurring one more branch miss. Thus, there could be up to 3 misses. Furthermore, there must be at least 1 branch miss, which occurs on the last (not taken) branch; the reason is that the predictor must be in the Strongly-Taken state by iteration i = n−1, independent of the initial state.
Lemma 3. Suppose we execute the same loop k ≥ 2 times, where n ≥ 3 on the first execution, and n ≥ 1 on every subsequent execution. An example is a nested loop, where k designates the outer-loop iteration count and n the innerloop count. Then there may be up to k + 2 mispredictions for the inner loop-that is, up to 3 misses during the first execution and 1 additional miss on each of the k − 1 remaining executions.
Proof. Based on lemma 1 the branch predictor is in the Weakly-Taken state at the end of the first execution of the loop and may see up to 3 mispredictions. This state becomes the initial state for the next execution. If n ≥ 1 on every execution after the first, then the predictor will move to the Strongly-Taken state; on the last iteration, it will return to the Weakly-Taken state, incurring 1 misprediction. That is, we will bounce back-and-forth between Strongly-Taken and Weakly-Taken.
Lemma 4. Suppose n = 0. Then the predictor will move toward the Strongly-Not-Taken state and cannot be in the Strongly-Taken state; furthermore, it will incur either 0 or 1 branch misses. 6 This choice is arbitrary and depends on the specific code generated. There is an equivalent argument if one assumes code such that the branch is taken only when the condition is false. 7 There is an additional branch at the bottom of the loop. However, this branch is unconditional, since it must jump back to the top of the loop. 
Algorithm 3: Branch-avoiding Shiloach-Vishkin algorithm for finding connect components.
Then the predictor will return to its initial state, incurring either 1 or 2 branch misses. Lemma 6. Suppose n = 2. Then the branch predictor must end in either the Weakly-Taken or Weakly-Not-Taken states, and will incur between 1 and 3 branch misses.
CONNECTED COMPONENTS
For the problem of finding connected components, we assume the Shiloach and Vishkin (SV) algorithm [46] . It has been implemented on numerous multiprocessor systems, including the massively threaded Cray XMT [1, 21] and a variety of x86 systems [38] .
SV is based on a propagation technique, and its pseudocode appears in algorithm 2. It maintains for each vertex v a component label, CC id [v], and updates this label to place adjacent vertices into the same connected component. Initially, each vertex v is placed into a connected component by itself, which by convention is a label equal to the vertex number. As such, there are a total of |V | connected components at this stage. In the first iteration, each vertex v compares its own label with each of its neighbors, u ∈ adj(v). Again by convention, the vertex replaces its own label with the minimum label among itself and its neighbors. The algorithm is iterative and stops when no further label changes occur, maintained by a flag.
Each iteration requires O(|V | + |E|) computations, since the algorithm accesses all vertices and their respective adjacencies. The maximal length of propagation is limited by the graph diameter d. As such, the total time complexity of the algorithm is O(d · (|V | + |E|)). Relative to algorithm 2, there is a shortcut that can reduce the number of iterations to d/2 [46] . However, this shortcut does not change our analysis, and we do not consider it further. Conceptually, the component labels propagate as figure 2 depicts. Initially (a), four of the components have minimal labels locally; these labels propagate gradually, and the label of a given node may change several times, (b)-(e), possibly even within the same iteration. Eventually, the algorithm reaches a final state (e) where for a fully-connected graph there will be a single connected component.
Branch (Mis)predictions in SV
The standard version of the SV algorithm (algorithm 2) has four static conditional branches. To analyze the branch mispredictions, we assume the 2-bit branch predictor model of § 3.
The first conditional branch is the termination test of the while statement. This condition is evaluated d + 1 times, where d is the diameter of the graph. Per § 3, assuming d ≥ 3, it should incur at most 3 mispredictions, ignoring mispredictions in the body of the loop.
Next, consider the two conditional branches associated with the two for-loops. The first for-loop iterates over all vertices; the second for-loop iterates over all neighbors of each vertex, thereby effectively visiting all edges. From the facts of § 3.2, the first for-loop will incur up to 3 branch misses in total, assuming sufficiently large |V |. The second for-loop is an instance of a repeated loop (see lemma 3), which is executed |V | times. Though the exact behavior of the inner loop depends on the degree distribution, we can estimate the misses by applying corollary 1, which implies approximately |V | branch misses.
Finally, the if-statement is the hardest to analyze offline. The actual number of branch mispredictions will depend on the input graph. To get a qualitative idea of what to expect, consider the example in figure 2. In the first iterations, vertices are likely to "swap" their connected components multiple times, which complicates branch prediction as there may not be a regular pattern. As iterations proceed, labels begin to stabilize, making this condition more predictable. Thus, we should expect to see many mispredictions initially, gradually decreasing as iterations proceed.
Branch-avoiding SV
Algorithm 3 shows the pseudocode for a branch-avoiding SV algorithm. This algorithm compares the values of the connected component labels; however, it does not branch based on the value of the comparison. Instead, this approach uses a conditional move that copies the value into the variable cv if and only if the label of u is smaller than the value in cv. For the SV algorithm the value of the connected component of v is stored in cv which is a register, meaning that the number of writebacks (stores) is |V |. To ensure the correctness of the algorithm and that the algorithm will stop at some point, the variable change is updated using a bitwise OR of bitwise XOR between the initial c init v and the updated cv. If the value of the connected component changed for the current vertex, then cv is not equal to c init v and their XOR value is non-zero. Accordingly, if any connected component changed, then change will be non-zero.
EMPIRICAL RESULTS

Implementation Details
To carry out a carefully controlled experiment that would allow us to isolate the effect of branches, it was ultimately necessary to hand-code the implementations of the SV algorithm in assembly language.
Prior to doing so, we tried several implementations of the SV algorithm for both the x86-64 and ARM architectures. Unfortunately, compilers do not provide explicit control over the use of branches or conditional moves in the generated code, which complicated our analysis. The compilers on our x86-64 systems tended to generate conditional branches when conditional moves could be used. While we found two ways to make these compilers avoid branches, both of them involved unnecessary inefficiencies. The first was to use inline assembly, which allowed manual selection of instructions. However, the compilers generated suboptimal code around the inlined assembly. The second was to force the compiler to use SETcc instructions by storing the result of the comparison into a byte variable followed by extending the bit mask for conditional selection. This approach caused the compiler to generate multiple instructions where a single CMOVcc instruction sufficed. On the ARM system we had the reverse problem: instead of a conditional branch or move, compilers preferred to use conditional store instructions, which impose an especially big performance penalty on our Cortex-A15 platform.
Thus, we resorted to hand-coded x86-64 and ARM assembly. We used an in-house (but open-source) tool, called PeachPy [19] . We used the PeachPy framework for improved productivity over using conventional assemblers; the assembly code generated by PeachPy is nearly equal to handwritten assembly code.
We performed our experiments on the seven systems, which vary by microarchitecture, shown in table 2. On all systems the assembly implementations performed at least as well as C implementations -typically the assembly implementations significantly outperformed the C implementations. The algorithms were tested on graphs taken from the DIMACS 10 Graph Challenge [2] , detailed in table 1. In each subplot, the total speedup of the branch-avoiding algorithm over the branch-based algorithm is shown as a text annotation (e.g., "1.20×" in the top-left subplot). For several of the iterations, the difference between the branch-based algorithm and the branch-avoiding is as high as 30% − 50%, with the branch-avoiding algorithm being the faster of these. In a handful of cases, specifically on the Bonnell system, the branch-based algorithm is 20% faster than the branchavoiding algorithm.
Connected Components
Recall that as the connected component labels propagate, fewer vertices change their connected component. This fact makes the branch predictor's job easier. Figures 4 and 5 show the ratio of branches and number of branch mispredictions as a function of the iteration, respectively, which confirms this behavior.
On some systems, such as the Cortex-A15, the branchavoiding algorithm offers better performance for all iterations of the algorithm over all the graphs. On other systems, in the initial iterations the branch-avoiding algorithms offers better performance and in the later iterations the branchbased algorithm gives better performance. This behavior appears to be both system-and input graph-dependent. When a performance crossover point exists, it is a single crossover audikw1 auto coAuthorsDBLP cond−mat−2005 ldoor 1.98x 1.80x point from which the branch-avoiding algorithm is initially faster to where the branch-based is subsequently faster. The significance of a single crossover point is that it naturally suggests there could be a simple hybrid algorithm to run the right algorithm at the right time. Figure 4 shows that the branch-based algorithm executes nearly double the number of branches as the branch-avoiding algorithm. For the Intel and AMD systems, the number of branches is constant throughout the iterations, while for the Cortex-A15 system, it is not. For the Intel and AMD systems, the hardware counter returns the number of retired branch instructions, while the ARM system returns the number of dispatched branches. Due to the higher misprediction rate in the first iterations, the number of dispatched branches is also higher as these are flushed instructions.
The branch-based algorithm can potentially have as many as 4× the number of branch mispredictions as that of the branch-avoiding algorithm, as shown in figure 5 . In all cases, the branch-avoiding algorithm has fewer branches and branch mispredictions. For most graphs, the ratio between the total number of mispredictions for the two algorithms, indicated by the number at the top-right corner of each subplot, is for a given graph within a small region for all systems. Figure 6 shows the ratio of the total number of branch mispredictions for the two algorithms versus the lower-bound on the number of branch mispredictions. The lower-bound is given in § 4 for the 2-bit branch-predictor, and is shown in the figure by a black line at y=1. For most systems, the branch-avoiding algorithm is near the lower-bound, while the branch-based algorithm is well above this line. For the Cortex-A15 system, there are three different graphs in which the branch misprediction rate is well above the lowerbound, for the auto graph the branch misprediction rate is 50% above the lower-bound. This means that the implemented branch-predictor in fact increases the misprediction rate. Both the Bonnell and the Silvermont systems also have higher than lower-bound miss rate for several of the graphs. However, these are lower than the miss rate of the Cortex-A15 system. While we are not able to show an upper-bound miss rate for the connected components algorithm, we are able to do so for the classic top-down algorithm for BFS. Those details can be found in our extended technical report [28] .
The effects of misprediction
To get an idea of how strongly mispredictions influence performance, we show pairwise correlations among the a priori most likely predictors of execution time: instructions, loads, stores, and based on the subject of this paper, branches and branch mispredictions. Figure 7 shows this data for the branch-based versions of the algorithms. The 6 × 6 grid of subplots shows the correlations among time, instructions, branches, mispredictions, loads, and stores, measured per edge traversal. For example, (row 1, column 2) subplot in each half is a scatter plot comparing time ("T") on the y-axis with instructions ("I") on the x-axis. The points are color-coded by platform, on the subset of platforms that supported all necessary hardware performance counters. For each (R, C) plot in the upper-triangle, the computed correlation coefficients appear in the transposed (C, R) position of the lower-triangle.
In the case of SV, mispredictions more strongly correlate with time than instructions, branches, loads, and stores. Though not a strict proof-of-cause, this observation is nevertheless somewhat surprising, as it implies mispredictions may be nearly or even more important than memory behavior. By contrast, in the case of BFS (not shown), the correlations with stores and mispredictions is roughly equal, with stores being slightly more strongly correlated than time; refer to our technical report for details [28] .
CONCLUSIONS
On the one hand, our study is a positive result for the branch-avoiding technique in the case of SV, where mispredictions are more strongly correlated to time than even memory traffic, much to our surprise. This raises the question of whether branch-avoidance might be important in other computations, and whether increased microarchitectural support for predicated instructions might have more significant benefits.
For BFS, our study did not show significant speedups for the branch avoiding algorithms [28] . Stores are as critical as branch mispredictions, so the tradeoff that reduces branches at the cost of significantly increasing stores cannot pay off. One question is why: although total stores increased by much as 100×, the actual slowdown was always 2× or less. Indeed, the extra stores are purely "local" in that they should mostly hit in cache, by design of the implementation. Thus, there is a potential in the microarchitecture to address whatever resource constraints the additional stores impose, such as buffers for more outstanding operations. We will look into these in future work. We also hope to explore the impact of the branch predictor on additional graph building blocks.
An additional question is how compilers and programming languages could expose the choice between branch-based and branch-avoiding implementations to the programmer. Presently, programmers have no control over this aspect of code generation. In addition, the compilers we used relied on heuristics that, for our graph algorithms, we found to be sub-optimal and inconsistent across architectures. In our view, explicit control could elevate branch behavior from an implementation detail to a part of algorithm design and analysis. 
