We present challenges faced in making a domain-specific language (DSL) for graph algorithms adapt to varying requirements of generating a spectrum of efficient parallel codes. Graph algorithms are at the heart of several applications, and achieving high performance on graph applications has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to auto-parallelize, due to access patterns influenced by the input graph -which is unavailable until execution. Former research has addressed this issue by designing DSLs for graph algorithms, which restrict generality but allow efficient codegeneration for various backends. Such DSLs are, however, too rigid, and do not adapt to changes. For instance, these DSLs are incapable of changing the way of processing if the underlying graph changes. As another instance, most of the DSLs do not support more than one backends. We narrate our experiences in making an existing DSL, named Falcon, adaptive. The biggest challenge in the process is to retain the DSL code for specifying the underlying algorithm, and still generate different backend codes. We illustrate the effectiveness of our proposal by auto-generating codes for vertex-based versus edge-based graph processing, synchronous versus asynchronous execution, and CPU versus GPU backends from the same specification.
INTRODUCTION
Graphs model several real-world phenomena such as friendship, molecular interaction and co-authorship. Several graph algorithms have been designed across domains to compute such relationships between entities. Performance of these graph algorithms has become critical today due to the explosive growth of unstructured data. For instance, to simulate a simple physical phenomenon, an algorithm may have to work with billions of particles.
On the other side, computer hardware is witnessing rapid changes with new architectural innovations. Exploiting these architectures demands complex coding and good compiler support. The demand intensifies in the presence of parallelization. It is not uncommon to see a twenty-line textbook graph algorithm implemented using several hundred lines of optimized parallel code.
It would be ideal if a graph algorithm can be programmed at a high-level without worrying about the nuances of the hardware. Domain-specific languages (Dills) for graph data analytics allow programmers to write complex algorithmic codes with minimal hardware knowledge and less programming effort. The code generator of the DSL compiler emits efficient parallel code [3, 11] . Such DSLs often disallow writing arbitrary programs, trading off generality for performance. This makes programming parallel hardware easy, and adapting to changes manageable.
In this work, we focus on Falcon [3] graph DSL, which supports a wide variety of backends: CPU, GPU, multi-GPU systems. It extends C language to allow graph processing being specified at a high-level. Falcon provides special data types (such as worklist and set) as well as constructs (such as foreach and reduction) for simplifying algorithm specification and aiding efficient code generation. Table 1 compares various graph processing frameworks.
Hardware Support
Iterators Tool DSL CPU GPU multi-GPU Edge Vertex Worklist GreenMarl [12] √ Graph algorithms are challenging to parallelize due to their inherent irregularity, which makes their data-access, control-flow, and communication patterns data-dependent. For instance, vertex-based processing works well for road networks, but social networks demand an edge-based processing. Social networks have a skewed degree distribution and road networks have a large diameter. Sequential processing of parallel loops demands synchronous execution, but independent loops can be more efficient with asynchronous processing. Dense subgraphs can be efficiently processed using a topology-driven approach [17] , whereas a data-driven worklist-based approach is better suited for sparse subgraphs. Similarly, backend optimizations are quite different for different targets such as CPU and GPU.
Falcon, and other graph DSLs, allow writing code for a particular kind of processing. The code written in Falcon DSL needs modifications for an alternative way of processing. Various syntactic elements in the program need to be changed for the alternative way. Thus, the code needs to be written separately for vertex-based and edge-based processing, for instance. It would be ideal if one could generate different kind of code from the same DSL specification. Such a setup greatly simplifies the algorithmic specification, and also allows generating code for various situations / backends / graph types from the same specification.
In this paper, we highlight the challenges faced in building the multi-skilled compiler Falcon-MS. In particular, we make the following contributions:
• We present Falcon-MS, a versatile compiler which generates different implementations for the same DSL program for a graph algorithm. In particular, the compiler can generate vertex-based or edge-based processing, synchronous or asynchronous codes, topologydriven versus data-driven processing, and CPU or GPU or multi-GPU codes. • We illustrate the effectiveness of Falcon-MS compiler using several graph algorithms and graphs of various types. The performance of Falcon-MS is compared against other frameworks.
APPROACH
Algorithm 1 shows the Falcon DSL code for Breadth First Search (BFS) computation. The algorithm is vertex-based and uses points and outnbrs iterators. The above Falcon program is for CPU. A programmer has to write a separate program for GPU [3] . Similarly, different programs need to be written for edge-based or worklist-based algorithms. It would be ideal if the programmer simply specifies what rather than how, and the DSL compiler takes care of the appropriate code generation. We support it in this work. In our proposal, the programmer needs to specify simply different compiletime arguments. This triggers generation of parallel C++ or CUDA code matching the output of the edge, vertex or worklist based BFS, targeting CPU or GPU respectively, from a single DSL code. We explain each of these transformations in the subsections below.
Vertex-based versus Edge-based
In graph algorithms, two important performance criteria are load-balance across threads and synchronization while dealing with shared data items. Depending upon how the graph is processed, load-balance and synchronization requirement may vary. Two traditional graph-processing ways are edgebased and vertex-based. In the former, in each round, each thread operates on an edge (u, v) and applies the computation operator to the edge. The operator typically reads attributes at the end-points, attribute at the edge, and propagates information across the edge-points. In contrast, in vertex-based processing, each thread operates on a vertex u, and applies the computation operator to every edge originating from u. 1 Edge-based processing improves load-balance, since edges of a high-degree node can be spread across multiple threads. In contrast, vertex-based codes improve propagation of information across the graph and can also reduce synchronization requirements. On the negative side, a vertex-based processing may incur load-imbalance due to skewed degree distribution. For instance, a thread operating on a low-degree vertex would have to wait for another thread operating on a high-degree vertex. This issue is especially prominent in social networks. Thus, each of the two ways of processing exhibits pros and cons. Therefore, it is imperative to have an ability in the DSL to generate either type of code. Falcon-MS compiler enables conversion from vertexbased to edge-based processing, and vice versa. This is done by modifying the intermediate representation (IR) of the compiler. Falcon uses syntax-tree as IRs. An important conversion non-triviality stems from the edge-based processing being a single loop, while the corresponding vertex-based processing is a nested loop. Thus, the outer loop is over vertices, while the inner loop is over all the neighbors of each vertex. Pseudo-code for the vertex-based to edge-based transformation in Falcon-MS is presented in Algorithm 2. The other way is similar.
In vertex-based to edge-based conversion, the subtree of a node in syntax-tree (IR) is eligible for conversion if the following conditions are met:
• The subtree is rooted at a function node whose only child is a foreach node iterating through point's neighbors. • The function is the only statement in the body of a foreach iterating through the points.
The first condition ensures iteration through the neighbors, while the second one ensures iteration through points (vertices). If other statements are present, those may interfere with the conversion and it is unclear if those can be transformed into edge-based processing. If the above conditions are met we switch the points iterator of the foreach statement from which the function is called to an edge iterator, and then remove the foreach statement inside the function. The formal parameter of the function is modified from Point type to Edge type. It also necessitates declaring two new variables at the beginning of the function corresponding to the source and the destination vertices of the modified formal parameter of type Edge. These variables are mapped to the formal parameters of the function corresponding to the Point iterator (see Line 5, Algorithm 1) and the iterator corresponding to innbrs or outnbrs iterator (see Line 6, Algorithm 1) given in the input DSL code. The symbol table entries are modified to change the variable types from pointiterators to Point. Such an implementation allows the rest of the processing in the iteration to be arbitrary, and reduces the number of alterations required to be performed on the intermediate representation. An important artifact of this processing conversion is that it affects graph representation in memory. In vertexbased code the graph is stored in compressed sparse-row (CSR) format. Compared to edge-list representation, CSR format reduces the storage requirement and has the benefit of quick access to the out-neighbors of a vertex. In edge-based codes, on the other hand, the graph is stored in edge-list format (source-vertex, destination-destination, edge-weight) which enables quick retrieval of the source and the destination vertices of an edge. Algorithm 3 shows the edges iterator based Falcon program for BFS. Our novel compiler transformation modifies the intermediate representation of the points+outnbrs iterator BFS program to the one which matches the intermediate representation of the edges iterator based DSL code shown in Algorithm 3. 
Synchronous versus Asynchronous Processing
By default, most graphs DSLs generate BSP-style synchronous code. That is, it inserts a barrier at the end of a parallel construct. While this works well in several codes and eases arguing about the correctness (due to data-races restricted to within-iteration processing across threads), synchronous processing may be overly prohibitive in certain contexts. Especially, in cases where processing across iterations is independent and the hardware does not necessarily demand single-instruction multiple data (SIMD) execution, asynchronous processing may improve performance. Arguing about the correctness-guarantees gets so involved in asynchronous execution, that graph DSLs strictly enforce only synchronous code generation. The Falcon-MS compiler is designed to allow the programmer to generate synchronous or asynchronous code without having to change the algorithm specification code in the DSL. Achieving this necessitates identifying independent processing in the code. Towards this, we maintain read and write sets of global variables and the graph attributes used in each target function separately. A target function is a function which is being called in the body of a foreach statement and the function call is the only statement inside the body of the foreach. On CPU, it is the parallel loop body, while on GPU, this function becomes the kernel.
The code conversion in Falcon-MS is a two-step process, as shown in Algorithms 4 and 5. In Step 1, we mark nodes in the control-flow graph (CFG); and in Step 2, we generate the appropriate code.
A node corresponds to a statement and its properties in the input DSL program. The node type is KERNEL_LAUNCH if it is a foreach statement which calls a target function. The node stores relevant information like read set (rset), and write set (rset) for a statement in addition to barrier, visited, and predecessor_count. In Step 1, we construct the CFG of the target function. Using the read and the write sets corresponding to each target function, we mark each node of the CFG as barrier-free or requiring-barrier (Line 7, Algorithm 5). A barrier-free node signifies that the target function corresponding to the barrier-free node can be executed concurrently with the children of this node. A node is barrier-free if it satisfies the following two conditions:
(1) There is no dependency (Read-Write, Write-Read and Write-Write) between the node and each of its children in CFG. (2) There is no dependency between the node and the codes between the node and its children.
If a node is barrier-free, we pass the read and the write sets of the node to its children (Lines 9-10, Algorithm 5). We do this so that the grand-child should not have any dependency with the grand-parent node to declare its parent barrier-free (and so on). We follow this process to mark all the CFG nodes in breadth-first search order.
In Step 2, based on the target code the programmer wants, different procedures are followed to make the code asynchronous. Falcon compiler supports only BSP execution model. So the compiler code generator is modified to generate asynchronous code. If the target is GPU, all the target function calls marked as barrier-free do not contain a cudaDeviceSynchronize() barrier function call in the generated code after the kernel launch. Also, each of the barrier-free kernels is launched in different streams of the same GPU. On the other hand, if the target is CPU, the target function call corresponding to the barrier-free node is put in a section of an OpenMP parallel region, and its children and the code between the node and its children in another section. The compiler then recursively checks if the child nodes are barrier-free or not. If they are, then a new OpenMP parallel sections construct is introduced inside the section where the child was put in earlier, because of its barrier-free parent node. This recursive introduction of parallel sections continues until a non-barrier-free node is found, or until the processing reaches the end of the function where these target functions are called. The introduction of OpenMP constructs is done by adding new statement in the intermediate representation.
For a parallel region, two nodes are added: one each for the start and the end of the construct.
In a similar manner, for each section, a node for the start and another node for the end are added. Adding these nodes is easy if both the node and its children in the CFG lie in the same scope. We can then simply add a node prior to and another node right after the barrier-free node. The processing gets involved when a node and its children are in different scopes. In such cases, we need to find the predecessors of these nodes which lie in the same scope.
Data-driven versus Topology-driven Processing
In topology-driven processing, each graph element (vertex or edge) is assigned a thread in every iteration. When there is no processing to be done at an element, the corresponding thread is idle. Such a processing is not work-efficient especially for sparse processing. In contrast, a data-driven processing maintains a worklist for active graph elements. Each such element is assigned a thread. This improves workefficiency as the inactive graph elements are not processed.
On the other hand, no overhead of a worklist is necessary in the topology-driven implementation. Former research has shown that both the kinds of processing can be beneficial in different contexts [17] . Therefore, it would be helpful if switching from one kind to another gets enabled via a DSL compiler.
The Falcon-MS compiler enables this. It is a two step process. In the first step, each kernel is checked if it can be converted into worklist-based. In the second step, compiler IR is modified such that code generation module generates worklist based code. The primary program analysis required in this transformation is to identify where point attributes are getting modified (e.g., distance of nodes in SSSP) and push such points (vertices) into the worklist. The Falcon-MS compiler code generator goes through the IR of the foreach statement and checks for such attribute updates and checks. Based on the target architecture, code generation module generates worklist based code for the particular target. In case of CPU, compiler generates Galois worklist based code. In case of GPU the compiler emits code which uses Falcon library.
The Falcon-MS compiler transforms only those codes which follow a particular pattern. The code snippet in Figure 6 shows the pattern, where ⊕ is a relational operator. ...
}
The condition of if-statement in Line 5 should contain the point attributes (here attr) which are getting changed inside the if-stmt (Line 6). The if-statement should be enclosed inside a foreach-statement (Line 3) which iterates through the neighbors of a point. To make the transformation logically correct, the analysis needs to check that point attributes are not modified outside the if-statement. This check is required as the transformed code will not process all the points.
Because of the restriction on the pattern, our method cannot transform algorithms which are not possible to write in the mentioned pattern. One such example is the pagerank algorithm. The implementation of pagerank algorithm requires two kernels. In first kernel, current pagerank value of each nodes are initialized; and in second kernel, pagerank values of nodes are updated based on the previous pagerank values of their neighbours. Since update to pagerank should be done only after the initialization, the two kernels cannot be merged into one. As a result it cannot be transformed by our method.
CPU, GPU and Multi-GPU Codes
We modified the Falcon grammar to remove <GPU> tag. The modified compiler recognizes a device-independent version of the DSL code. Based on the command-line argument, our compiler generates code for an appropriate target device.
To generate multi-GPU code, the programmer has to use the parallel sections construct. The compiler assumes that each of the sections is independent of each other. We identify the number of sections in a parallel sections construct and map each of the sections to a different GPU. For each Graph, Set and Collection object used in a particular parallel section, a GPU copy is created in the mapped GPU. It may happen that the programmer has used a single graph and used that graph in multiple sections. In such cases, the graph needs to be copied to each GPU. For each of those GPU copies, we keep track of the attributes of the graph or its points/edges used in the target functions where the graph is passed as an argument. This helps us to replace the graph whose attribute is accessed in CPU by the appropriate GPU copies where the accessing attribute is present. Now if an attribute of a GPU graph is accessed in the CPU, the compiler generates code with a call to cudaMemcpy() function to copy the attribute from GPU to CPU or from CPU to GPU based on whether the programmer has read or written to the attribute. One advantage of assuming independent sections is that the attributes accessed in CPU can be changed on maximum one GPU, which eases our analysis and code generation.
Data-Transfer Aggregation Optimization
Data-transfer aggregation deals with combining multiple calls for data-transfer between CPU and GPU into a single call. Such an optimization is not preset in Falcon. In Falcon, the custom properties of points/edges in a graph object is stored as an array where n t h position stores the value corresponding to n t h point/edge. If a property of a point/edge of a GPU graph object is read in the CPU, a data transfer from GPU to CPU is necessary. By default, for n such reads, n instances of data transfer are required. This is not always desirable and can quickly become a performance bottleneck due to slow PCI-e across the devices. To overcome this, we combine multiple such data transfers into a single data transfer, where feasible. Such an optimization is particularly beneficial for code inside a loop. If the instructions inside the body of a loop only read a property at a particular location (and may write to other properties), we copy the array related to that property to CPU before the beginning of the loop using a copy operation. The instructions in the loop then use the CPU copy of data. This often improves the performance as the number of iterations of a loop is expected to be high.
EXPERIMENTAL EVALUATION
The details of Falcon-MS compiler transformations and code generation for CPU, GPU and multi-GPU targets were discussed in last section. Parallel code with OpenMP (vertex and edge based processing) Galois (for worklist based processing respectively) is generated for CPU. CUDA code is generated for GPU and multi-GPU machines. We used a range of graph types to assess the effectiveness of Falcon-MS. The dataset graphs in our experimental setup and their characteristics are presented in Table 2 . We used four graph algorithms in our testbed: Breadth-First Search (BFS), Connected Components (CC), Minimum Spanning Tree computation (MST) and Single-Source Shortest Paths computation (SSSP). All these algorithms are fundamental in graph theory and form building blocks in various application domains. We compare the generated codes against the following frameworks: Galois [20] , Totem [8] , Green-Marl [12] , LonestarGPU [18] and Gunrock [25] . The Falcon-MS compiler auto-generated codes perform similar to hand-optimized codes written in Falcon. Therefore, in the sequel, we discuss directly our proposed techniques embedded into Falcon, unless otherwise stated.
The benchmarks are run on a multi-GPU machine with Xeon E5-2650 v2 CPU with 32 cores clocked at 2.6 GHz with 100GB RAM, 32KB of L1 data cache, 256KB of L2 cache and 20MB of L3 cache. The machine has eight Tesla K40C devices each having 2880 cores clocked at 745 MHz with 12GB of global memory.
Baselines and Comparison with Other Frameworks
The baseline execution times of Falcon on GPU are listed in Table 3 . The values corresponds to the vertex based implementation of the algorithms. We observe that the execution times on road networks are particularly high for propagation based algorithms such as BFS, SSSP and CC. This occurs Figures 2, 3, 4 and 5 present the performance benefit of Falcon-MS on GPU against other frameworks. The best execution among edge and vertex based execution of Falcon-MS is used in the results. The GPU-baseline used for this comparison is Totem, whose speedup is assumed to be 1.0 (hence not shown in the plots). For SSSP, we observe that Falcon-MS provides consistently better speedups compared to LonestarGPU and Gunrock, except on the two social . Totem performs better on the social networks as well as on RMAT graphs due to its inbuilt edge-based processing and other optimizations to improve load-balancing across GPU threads. For BFS, the results are mixed across various frameworks and there is no clear winner, but there are interesting patterns based on the graph types. Gunrock performs quite well on the road networks (USA-full and USA-CTR), primarily due to its workefficient worklist-based processing. Totem outperforms again on social networks due to edge-based processing and better load-balancing. Performance of almost all the frameworks on RMAT graphs is quite similar, with LonestarGPU performing poorly. Falcon-MS stands out on random graphs with speedups close to 2× over all other frameworks. For MST, Falcon-MS performs better than LonestarGPU on road-networks and social-networks. On the other hand, for two random graphs LonestarGPU performed better. Lones-tarGPU (Gunrock, respectively) failed to run for Rand-100M and Rand-125M graph inputs for all ( BFS, respectively ) benchmarks. Figure 6 presents results of edge-based versus vertex-based processing of Falcon-MS across various graphs for CC, BFS and SSSP. We observe that edge-based processing performs better in social-networks (orkut and sinaweibo) and RMAT graphs. Both these kinds of graphs have skewed (power-law) degree-distribution resulting in a large load imbalance with vertex-based processing. These graphs follow small-world property due to this peculiar (and natural) degree distribution. This load-imbalance manifests itself as thread-divergence as the number of iterations (based on the number of neighbors) of each thread has high variance. In other words, threads mapped to vertices having few neighbors have to wait for others mapped to high-degree vertices. This inhibits parallelism for SIMT style of processing. In contrast, in edge-based processing, since threads are mapped to (a group of) edges, the load-imbalance is relatively negligible. This results in less thread-divergence among warpthreads, leading to improved execution time. Road networks and random graphs, on the other hand, have quite uniform Figure 6 : Falcon-MS GPU: Speedup of edge-based over vertex-based processing degree-distribution. Therefore, edge-based processing is not very helpful. In fact, for uniform degree-distributions, edgebased processing may lead to inferior results (as seen in our experiments), due to increased synchronization requirement. Table 4 presents the effect of asynchronous processing for various graphs on CPU. We used a combination of BFS and SSSP to perform independent processing on the same graph. We observe that asynchronous version improves execution time by 38%. This occurs because threads do not have to wait for other threads. This is primarily true on CPUs as threads are monolithically working on different parts of the graph and seldom require synchronization. In our experiments, since all the GPU resources were utilized by a single kernel, asynchronous processing performed similar to synchronous execution.
Vertex-based versus Edge-based Processing

Synchronous versus Asynchronous Processing
Code Generation for Multiple Targets
As discussed in Section 2.4, our approach can seamlessly generate code for CPU or GPU or multi-GPU. The multi-GPU code works with different graphs for the same algorithm, or with the same graph for different algorithms. Table 5 presents results for the former with CC as the algorithm generating code for CPU with 16 threads, single GPU and two GPUs. We observe that multi-GPU version took much less time as compared to other backends. In both CPU and single-GPU versions, the graphs are processed one after another. On the other hand, in multi-GPU version, both the graphs are processed simultaneously in different GPUs; so the overall execution time is the larger of the two. 
Data-Transfer Optimization
To illustrate the effect of the memcpy-optimization, we devised a simple Falcon program which computes BFS on GPU and then queries distances of various vertices from the CPU to compute the maximum distance. We observe that the program with optimization takes less than a second to find the maximum distance. On the other hand, without optimization the same program takes more than five minutes. This happens because without optimization, the existing Falcon engine generates code to copy distance of each vertex one by one as required in each iteration (total N small cudaMemcpys). In contrast, the optimized code copies all the distances at once and then uses it for finding maximum distance (one large cudaMemcpy). This leads to considerably reduced communication overhead, leading to improved execution. The Figure 7 shows speedup of Falcon-MS best performing code over Falcon vertex-based code. The Falcon-MS GPU performs better on social network graphs due to skewed degree distribution of the graph which results in much higher thread divergence in vertex-based implementation.
RELATED WORK
Green-Marl [12] graph DSL for multi-core CPUs is extended with a CUDA backend in [22] . Falcon [3] is a heterogeneous graph DSL targeting multi-GPU machine. Unlike other DSLs, it supports mutation of graph objects via cautious speculative execution. GPU and CPU devices are supported with a vertex centric programming model in the Abelian [9] . The frameworks [20] [23] supports graph analytics on multi-core CPUs. Totem [8] is a heterogeneous framework for graph processing for multi-GPU machines.
The GraphIt DSL [26] targets multi-core CPUs. It has alogrithmic language and scheduling language. The programmer has to specify the scheduling methodology for the algorithm using the scheduling language. Falcon-MS complements these efforts, and can be combined for mutual benefits. The Falcon DSL code is converted to synchronous, asynchronous, vertex based or edge based processing based on the command line argument. It also supports GPUs and multi-GPU machines. None of the existing DSLs and frameworks supports this versatility.
Davidson et al. [6] suggested that worklist-based graph algorithms do not benefit much on (the then) GPU. GasCL [7] is a graph processing framework built on top of OpenCL and supports parallel work distribution. Dynamic graph algorithms on GPU is supported by [18] . A data-centric abstraction with GPU-specific optimizations is provided in [25] . IrGL [19] implements optimizations named iteration outlining, cooperative conversion with nested parallelism.
Large graphs require processing on a computer cluster. The popular distributed graph analytics frameworks for CPU clusters are [4, 10, 15, 16] . Green-Marl DSL is extended to generate Graph Processing System (GPS) [21] code for CPU clusters [14] . The Falcon compiler is extended for distributed system with CPU and CPU by generating CUDA/C++ code [2] and OpenCL code [24] . Pregel [16] [21] like systems outperform MapReduce() systems [1] in graph analytics applications. Phoenix [5] supports fault-tolerance for graph analytics on distributed systems. PGX.D [13] uses cooperative context switching to implement low overhead and efficient communication for graph analytics.
CONCLUSION
Irregular codes have data-dependent access patterns. Therefore, compilers need to make pessimistic assumptions leading to very conservative code. While DSLs for irregular codes allow us the flexibility to make more informed decisions about the domain, the existing DSLs lack adaptability. Different graphs expect different kinds of processing to achieve the best performance. While existing DSLs do allow changing the algorithm specification to suit a purpose, it would be ideal if the specification remains intact and the compiler judiciously generates the necessary efficient code. We presented our experiences in achieving the same, for a graph DSL, Falcon. In particular, we auto-generated codes for vertex-based and edge-based, for synchronous versus asynchronous, for worklist-driven versus topology-driven, and for CPU versus GPU versus multi-GPU processing. We illustrated the effectiveness of our techniques using a variety of algorithms and several real-world graphs. We believe our techniques can be embedded in other DSLs as well. 
