Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information.
INTRODUCTION
Thread divergence, which arises from the difficulty of mapping GPU programs onto SIMD (single instruction, multiple data) hardware units, presents a major challenge for applications that run on accelerated architectures (e.g. CUDA, OpenCL). When optimizing programs, compilers typically operate on the intermediate representation (IR) of a control flow graph, and not on the source code level, with the IR providing an abstraction from machine-specific intrinsics. A control flow graph (CFG) is a directed graph where nodes represent basic blocks of instructions and edges represent control flow paths. Identifying the base structured patterns of a CFG can shed light on the branch divergence problem [23] , where a poorly written GPU kernel can severely impact execution performance. Structured programming consists of base constructs that represent how programs are written [3, 27] . Thus, compilers optimizing code will benefit from knowledge of the structuredness of a CFG, rather than the structuredness of a program.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Profiling-based approaches toward understanding thread divergence behavior [8, 13, 20, 21, 25] have been inefficient and lack scalability, since a flat profile not only consists of redundant information but also consumes unnecessary disk space per process p. Likewise, hardware performance counters have been prone to inaccuracies when monitoring divergence behavior [16, 17] . This research introduces how a control flow graph can capture sufficient information for a GPU kernel's execution performance, which could enable code transformations by hooking onto a JIT compilation pass and executing the modified, high performing code.
To this end, we present CUDAflow, a scalable toolkit for heterogeneous computing applications. Specifically, CUDAflow provides a new methodology for characterizing CUDA kernels using control flow graphs and instruction operations executed and performs kernel subgraph matching to gain insights to an application's resource requirements. To the knowledge of the authors, this work is a first attempt at employing subgraph matching for revealing thread divergence behavior and generating high performing code.
Contributions described in this paper include:
• Formalize control flow graphs for GPU kernels.
• Perform subgraph matching on various kernel CFG and GPUs.
• Reveal thread divergence behavior based on CFG properties.
The rest of the paper is organized as follows. Section 2 provides background information. Section 3 describes the methodology behind our CUDAflow tool and our implementation approach. Sections 4 and 5 summarizes the findings of our application characterization studies. Section 6 discusses prior work, and Section 7 outlines some directions for future work.
BACKGROUND
The next subsections cover preliminaries for structured programming and the thread divergence problem.
Structured Programming
Structured programming is a paradigm where programs are written using three base constructs: sequences of statements, if-thenelse blocks, and loops. Figure 1 displays the base patterns in a graph representation. Compilers benefit from the structuredness of a CFG, including guaranteed reducibility for better compiler optimizations, ease of decompilation, and reduced performance degradation caused by thread divergence on SIMD units.
The following definitions refer to Figure 1 and will aide in further discussion.
• Path: A path between nodes A and B is an ordered list of adjacent edges and vertices, where the list begins with an out-edge A and ends with an in-edge B. • Condition node: Any node with two or more out-edges.
• Region: A region between two nodes (edges) A and B contains all nodes and edges present on any path from A to B, where nodes are "internal" to a particular region.
• Dominator: A node (edge) P is a dominator of node (edge) Q if every path from the entry node of the CFG that reaches Q has to pass through P .
• Post-dominator: A node (edge) Q is a post-dominator of a node (edge) P if every path from P to the exit node of the CFG has to pass through Q.
-Immediate post-dominator (IPDOM): Parent node of given node in post-dominator tree.
-Dominator (post-dominator) relationships allow construction of a dominator (post-dominator) tree for the CFG.
• Single-entry-single-exit (SESE) region: A region between two nodes (edges) A and B is SESE if the following conditions hold: A dominates B, B post-dominates A, every cycle containing A also contains B and vice versa.
SIMD Thread Divergence
SIMD thread divergence originates from unstructured CFGs, further degrading thread execution performance on SIMD units. Unstructured CFGs are graphs that cannot be completely folded, and will be discussed in Section 3.2.4. For instance, consider the CFG in Figure 2 (left), where P and Q represent unstructured conditional nodes. For the two involved threads, T0 and T1, and for their execution pattern shown in Figure 2 (middle), condition P behaves divergently, where the two threads execute different branch targets. Such divergent execution is handled using a mechanism called a reconvergence stack, where diverged threads resume joint execution only at the IPDOM of a divergent node [9] . In addition, R is computationally expensive and gets executed twice in this mechanism. The structured equivalent of CFG (Fig 2, right) however achieves combined execution of R from both threads, lowering the overall execution time. In the structured version of a CFG, diverging threads at P are reconverged as early as possible, through the introduction of a new IPDOM node Z for P . The transformations require additional predicate variables and branch instructions. In structured CFGs, threads diverged on a condition cannot execute the same node before they reconverge at the IPDOM, which is key reason for reduction in divergent execution.
METHODOLOGY
In this section, our CUDAflow tool is presented whose overall organization is depicted in Figure 3 . The dashed lines represent CUDAflow's interaction with the current nvcc toolchain. Control flow graphs are constructed statically and the program counter is sampled dynamically, collecting counts of executed instructions and corresponding source code locations, among other information.
Parameter Space Characterization
This section describes the different components considered in the process of understanding control flow behavior in heterogeneous application codes.
Kernel Control Flow Graphs
One of the more complex parameters used to characterize SIMD thread divergence is with a control flow graph. DEFINITION 1. A CFG is constructed for each GPU kernel computation and can be represented as a directed graph G = (N, E, s), where (N, E) is a finite directed graph, and a path exists from the ST ART node s ∈ N to every other node. A unique ST OP node is also assumed in the CFG. A node in the graph represents a basic block (a straight line of code without jumps or jump targets), whereas directed edges represent jumps in the control flow.
Each basic block region is incremented with the number of times the node is visited. Upon sampling the program counter, the PC address is referenced internally to determine which basic block region the instruction corresponds to.
. L_41 : / * 04 a0 * / DSETP . LE .AND P0 , PT , | R6 | , + INF , PT ; / * 04 a8 * / @P0 BRA ' ( . L_43 ) ; / * 04 b0 * / LOP32I . OR R5 , R7 , 0 x80000 ; / * 04 b8 * / MOV R4 , R6 ; / * 04 c8 * / BRA ' ( . L_42 ) ;
The above SASS assembly code illustrates how a control flow graph is constructed. Each basic block is labeled in the left margin (e.g. ".L_41"), with branch instructions representing edges that lead to corresponding block regions (e.g. ".L_43," ".L_42"). The PC offsets are listed in hexadecimal between the comments syntax (/ * * /). In other words, ".L_41" represents a node ni, with ".L_43," and ".L_42" as its children. Example control flow graphs for selected SHOC (top) [7] and Rodinia (bottom) [5] GPU benchmarks are displayed in Table 1 , comparing Kepler, Maxwell and Pascal architectures. Section 4 discusses the differences in GPU architectures. To simplify the discussion, we will refer to each GPU variant by its architecture family (Table 2, bottom). Notice how the CFG is slightly different for each architecture, due in part to the architecture layout of the GPU and its compute capability (NVIDIA virtual architecture). Although it may appear as if Maxwell uses fewer nodes for its CFGs, calc_temp actually uses one more node than its Kepler counterpart. Also, note that similarities in structure exist with several CFGs, including csr_scalar and sum_kernel. Part of the goal of this research is to predict the required resources for the application by inferring performance through CFG subgraph matching, with the subgraphs serving as building blocks for more nested and complex GPU kernels. Next, we introduce several metrics that build on this CFG representation.
Transition probability
Transition probabilities represent frequencies of an edge to a vertex, or of branches to code regions, which describes the application in a way that gets misconstrued in a flat profile. A stochastic matrix could also facilitate in eliminating dead code, where states with 0 transition probabilities represent node regions that will never be visited. Kernels employing structures like loops and control flow increase the complexity analysis, and transition probabilities of kernels could assist compilers for code generation purposes. Figure 4 : Transition probability matrices for Pathfinder (dynproc_kernel) application, comparing Kepler (left) and Maxwell (right) versions.
off-diagonal entry is filled with the label of the corresponding edge, or zero if no edge exists [29] . The adjacency matrix describes the transition from Ni to Nj, and is a Markov chain Xt over a finite state space S. If the probability of moving from i to j in one time step is P r(j|i) = mi,j, the adjacency matrix is given by mi,j as the i th row and the j th column element: Since the total transition probability from a state i to all other states must be 1, this matrix is a right stochastic matrix, so that j Pi,j = 1. Figure 4 illustrates transition probability matrices for a kernel from the Pathfinder application (Tab. 1, bottom-right), comparing Kepler (left) and Maxwell (right) versions. The Pascal matrix was the same as the Maxwell version and was omitted for space purposes. The entries of the transition probability matrix were calculated by dividing the number of observed node transitions i to j by the sum of code(M ), shown as the lower triangular elements, for some mi,j. Note that entries of the upper triangular were discarded for convenience since the lower trangular entries represent the node transitions. Although the matrices differ in size, observe that a majority of the transitions take place in the upper-left triangle, with a few transitions in the bottom-right, for all matrices. The task is to match graphs of arbitrary sizes based on its transition probability matrix and instruction operations executed, among other information.
Metrics
The following subsections described the metrics employed in our CUDAflow approach.
Frequent Subgraph Mining
Frequent subgraph mining has been used in a variety of contexts, including protein sequence alignment, document clustering, anomaly detection and social network analysis [4, 12, 14] . For CUDA applications, we consider frequent subgraph mining as a method for characterizing code by identifying the computational patterns it comprises. Each subgraph represents building blocks for large scale kernels with specific resource requirements, where each block will have associated registers, threads, instruction mixes and source file location. DEFINITION 3. Given a pair of graphs G = (V, E) and G = (V , E ), G is a subgraph of G if and only if:
G is also referred to as a supergraph of G.
DEFINITION 5.
A graph G is subgraph isomorphic to a graph G , denoted by G ⊂ G , if and only if there exists a subgraph G of G such that G is isomorphic to G . DEFINITION 6. Given a set of graphs GD (referred as a graph database) and a threshold σ(0 < σ ≤ 1), the support of a graph G, denoted by supG is defined as the fraction of graphs in GD to which G is subgraph isomorphic.
G is frequent if and only if supG ≥ σ.
The frequent subgraph mining problem is given a threshold σ and a graph database GD, finding all frequent subgraphs in GD.
In order to perform subgraph matching, we first scaled the matrices to the same size by taking for graphs G1 and G2 the maximal proper submatrix, constructed by B(Gi) = max(|V1|, |V2|) for a given Gi = min(|V1|, |V2|). The similarities in the shapes of the control flow graphs (Table 1 ) and the activity regions in the transition probability matrices (Fig. 4 ) motivated this approach. In our case, the dense hotspots in the transition matrix should align with its counterpart if the matrices are similar enough. Bilinear interpolation was used to scale the transition matrix before performing the pairwise comparison.
Bilinear Interpolation
Bilinear interpolation is an extension of linear interpolation for functions of two variables (e.g., x and y) on a rectilinear 2D grid [11] . The idea is to perform linear interpolation first in one direction, and then again in the other direction. Interpolation works by using known data to estimate values at unknown points. Although each step is linear in the sampled values and in the position, the interpolation as a whole is not linear but rather quadratic in the sample location. DEFINITION 7 . Suppose that we want to find the value of the unknown function f at the point (x, y). It is assumed that we know the value of f at the four points Q11 = (x1, y1), Q12 = (x1, y2), Q21 = (x2, y1), and Q22 = (x2, y2).
To find the value at P , for known values at Q, we first do linear interpolation in the x-direction. This yields
Then interpolate vertically
Unit square: If we choose a coordinate system in which the four points where f is known are (0, 0), (0, 1), (1, 0), and (1, 1), then the interpolation formula simplifies to 
Pairwise Comparison
Once the matrix is interpolated, the affinity scores (S1 and S2 for graphs G 1 and G 2 , respectively) are matched via a similarity measure. which includes the Kronecker product and the Euclidean distance. By definition, sim(Gi, Gj) = 1 when i = j, with the similarity measure placing progressively greater weights on objects that are further apart.
Kronecker product.
The Kronecker product, denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a generalization of the outer product (which is denoted by the same symbol) from vectors to matrices, and gives the matrix of the tensor product with respect to a standard choice of basis.
If A is a m×n matrix and B is a p×q matrix, then the Kronecker product A ⊗ B is the mp × nq block matrix:
Euclidean distance.
The Euclidean distance between points p and q is the length of the line segment connecting them (pq). If p = (p1, p2, ..., pn) and q = (q1, q2, ..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:
CFG Folding
This section discusses the steps involved in folding a control flow graph to determine whether the graph is structured or unstructured. The pseudocode is described in Algorithm 1 and detects for base structured patterns (sequence, selection, loop) defined in Section 
if Nj 1 ⇔ Nj 2 then Unstructuredness detected 13:
return False. 14:
Ni ← fold(Ni, Nj).
16:
goto loop.
17: else if Nj references Ni then Loop detected 18:
19:
The first pass assigns to
Ni the root of the CFG. At each iteration, the base structured patterns are checked which entails conditions within itself. For instance, a child node's count that is being evaluated would determine whether the subgraph is a sequence, a selection or a loop. ... Folding determines whether a control flow graph is structured and involves replacing the base structured pattern with a single node in the CFG. During folding, any edge not belonging to base pattern but has its source (or sink) node in base pattern is redirected so a newly created single node is its source (sink). Maximal folding repeatedly applies folding to a CFG until no more base structured patterns exist. Maximal folding operates in constant time O(n), where n=#nodes in CFG. A CFG is structured if and only if the CFG is completely foldable; otherwise the CFG is unstructured.
Implementation
The methodology described in the previous section is implemented through a hybrid static and dynamic analysis approach. Refer to [18] for more information on the approach.
Hybrid static and dynamic analysis
We statically collect instruction mixes and source code locations from generated code and map the instruction mixes to the source locator activity as the program is being run. The static analysis of CUDA binaries produces an objdump file, which provides assembly information, including instruction operations, program counter offsets, and line information. The CFG structure was stored in iGraph format [6] . We attribute the static analysis from the objdump file to the profiles collected from the source code activity to provide runtime characterization of the GPU as it is being executed on the architecture. This mapping of static and dynamic profiles provides a rich understanding of the behavior of the kernel application with respect to the underlying architecture.
EXPERIMENTAL SETUP
This section describes the execution environment and the applications as part of this work.
Execution environment
To demonstrate our proposed CUDAflow methodology, we measured the performance of applications on a variety of GPU architectures. The hardware architecture and software environment configuration are displayed in Table 3 , whereas the graphic processor units are listed in Table 2 . Table 3 corresponds to the CPU entry in Table 2 . The selected GPUs reflect the various architecture family generations, and performance results presented in this paper represent GPUs belonging to the same family. For instance, we observed that the performance results from a K80 architecture and a K40 (both Kepler) were similar, and, as a result, did not include comparisons of GPU architectures within families. Also, note the changes in architectural features across generations (global memory, MP, CUDA cores per MP), as well as ones that remain fixed (constant memory, warp size, registers per block). For instance, while the number of multiprocessors increased in successive generations, the number of CUDA cores per MP (or streaming multiprocessors, SM) actually decreased. Consequently, the number of CUDA cores (MP×CUDAcores_per_mp) increased in successive GPU generations.
Applications
Rodinia and SHOC application suite are a class of GPU applications that cover a wide range of computational patterns typically seen in parallel computing. Table 4 provides a description of the applications used in this experiment along with source code statistics, including the number of kernel functions, the number of associated files and the total lines of code.
Rodinia
Rodinia is a benchmark suite for heterogeneous computing to help architects study emerging platforms such as GPUs (Graphics Processing Units), which includes applications and kernels that target multi-core CPU and GPU platforms [5] . Rodinia covers a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to architectural insights such as memory-bandwidth limitations and the consequent importance of data layout.
SHOC Benchmark Suite
The Scalable HeterOgeneous Computing (SHOC) application suite is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing [7] . SHOC provides implementation versions for CUDA, OpenCL, and Intel MIC, and supports execution on clusters with MPI as well as single-node hosts.
ANALYSIS
We analyzed the SHOC and Rodinia applications using our new methodology and discuss the results at different granularities. Table 5 displays total execution time for all applications, comparing different architectures. The execution readings were made with the TAU Performance System. Note that run times for each application was somewhat similar across architectures, although the Maxwell GPU performed poorly for all applications except for BFS. Figure 5 projects goodness as a function of efficiency (bottom figure zooms in on dense region of top figure), which displays the similarities and differences of the benchmark applications. The size of bubble represents the number of operations executed, whereas the shade represents the GPU type. Efficiency describes how gainfully employed the GPU floating-point units remained, or flops per second:
Application level
Goodness metric describes the intensity of the floating-point and memory operation arithmetic intensity: Figure 5 shows a positive correlation between the two measures, where the efficiency of an application increases along with its goodness. Table 6 provides statistics for various metrics from CUDAflow for randomly selected kernels, comparing Kronecker and Euclidean CFG subgraph matching strategies, with score representing the similarity of the CFGs. The naming convention adopted for each kernel is as follows: <architecture.suite.application.kernel>. CFG properties are also displayed with the number of vertices and edges representing V and E, respectively. The instruction mix percentages are kernel-specific, with the instruction mix values normalized according to its total instruction count. The selected kernels not only display the diverseness of the kernels, with V and E varying between the two as well as it operations intensity, but also the performance of each measure regardless of application or architecture. For instance, SPMV and Particlefilter kernels (Table 6 , (1, 3) and (2, 2)) scored 1.11 (Table 1, 2nd row) , which demonstrates that CUDAFlow is able to properly perform subgraph matching on arbitrary kernels. Figure 6 projects the differences in vertices |V | for all CFG kernel pairs as a function of the Kronecker measure (application, architecture, kernel), with shade representing the frequency of the result. Note that most matched CFGs had a similarity score of 1.8 to 2.8 and had vertex differences under 10. Figure 7 projects euclidean distance as a function of the Kronecker product, which shows a direct positive correlation with most similarities within 2.2 to 2.8. Table 7 displays how our similarity measure scores when comparing the same CFG kernel but with a different GPU. Ideally, the score should be 1, since by definition sim(Gi, Gi) = 1. Although there were some poor performers (SPMV), most kernels scored within reasonable similarity measures of 1.04 to 2.00, despite the architectural differences across GPU generations. Figure 8 displays how selected kernels score compared with all other kernels from the SHOC and Rodinia application suite. The xaxis is labeled with the application of the kernel for brevity, but note that the measure is of an individual kernel when compared with all other 162 kernel variants (54 kernels from Tab. 4 × 3 GPUs). From a bird's eye view, the plot illustrates which kernels are most similar with the rest of the observed kernels (particlef, scan) versus ones that are different (hotspot, spmv). This example demonstrates that by introducing an arbitrary (unseen) kernel that there is now a way to measure how "similar" or how "different" a kernel might measure compared with previously observed kernels.
CFG subgraph matching
These metrics can be used both for guiding manual optimizations and by compilers or autotuners. For example, human optimization effort can focus on the code fragments that are ranked high by kernel impact, but low by the goodness metric. An autotuner can also use metrics such as the goodness metric to explore the space of optimization parameters more efficiently, e.g., by excluding cases where we can predict a low value of the goodness metric without having to execute and time the actual generated code.
PRIOR WORK
Control flow divergence in heterogeneous computing applications is a well known and difficult problem, due to the lockstep nature of the GPU execution paradigm. Current efforts to address branch divergence in GPUs draw from several fields, including profiling techniques in CPUs, and software and hardware architectural support in GPUs. For instance, Sarkar demonstrated that the overall execution time of a program can be estimated by deriving the variances of basic block regions [24] . Control flow graphs for flow and context sensitive profiling were discussed in [1, 2] , where instrumentation probes were inserted at selected edges in the CFG, which reduced the overall profiling overhead with minimal loss of information. Hammock graphs were constructed [30] that mapped unstructured control flow on a GPU [9, 28] . By creating thread frontiers to identify early thread reconvergence opportunities, dynamic instruction counts were reduced by as much as 633.2%.
Both hardware and software approaches exist to address the control flow divergence problem. For hardware, thread-aware predication code aided the CUDA compiler in systematically mapping nested control flow structures to data-parallel architectures [15] . Similarly, a dual-path stack architecture approach increased path parallelism using GPGPU-sim, which traversed depth-first on the control flow tree followed by a single path traversal [22] .
The authors [10] have characterized PTX kernels by creating an internal representation of a program and running it on an emulator, which determines the memory, control flow and parallelism of the application. This work closely resembles ours, but differs in that we perform workload characterization on actual hardware during execution. The MIAMI toolkit [19] is an instrumentation framework for studying an application's dynamic instruction mix and control flow, but does not include support for GPUs.
Subgraph matching has been explored in a variety of contexts. For instance, the DeltaCon framework matched arbitrary subgraphs based on similarity scores [14] , which exploited the properties of the graph (e.g. clique, cycle, star, barbell) to support the graph matching. Similarly, frequent subgraph mining was performed on molecular fragments for drug discovery [4] , whereas document clustering was formalized in a graph database context [12] . However, none of these approaches apply frequent subgraph matching for code generation purposes.
CONCLUSION
We have presented CUDAflow, a control-flow-based methodology for analyzing the performance of CUDA applications. We combined static binary analysis with dynamic profiling to produce a set of metrics that not only characterizes the kernel by its computation requirements, whether memory or compute bound, but also provides pinpointed and detailed insights in application performance. Specifically, we provide an intuitive visualization and metrics display, and correlate performance hotspots with source line and file information, effectively guiding the end user to locations of interest. We have implemented this new methodology and demonstrated its capabilities on SHOC and Rodinia applications.
Future work includes incorporating memory reuse distance statistics of a kernel to characterize and help optimize the memory subsystem and compute/memory overlaps on the GPU. In addition, we want to generate robust models that will discover optimal block and thread sizes for CUDA kernels for specific input sizes without executing the application. Last, we are in the process of developing an online web portal [26] that will archive a collection of control flow graphs for all known GPU applications. For instance, the web portal would be able to make on-the-fly comparisons across various hardware resources, as well as other GPU kernels, without burdening the end user with hardware requirements or software package installations, and will enable more feature rich capabilities when reporting performance metrics. Similarity compared with all other Kernels Figure 8 : Kronecker similarity measure for selected kernels with respect to all other GPU kernels from SHOC and Rodinia suite, comparing architecture types.
REFERENCES

