Abstract-A novel parallel algorithm is presented for generating random scale-free networks using the preferentialattachment model. The algorithm, named cuPPA, is customdesigned for single instruction multiple data (SIMD) style of parallel processing supported by modern processors such as graphical processing units (GPUs). To the best of our knowledge, our algorithm is the first to exploit GPUs, and also the fastest implementation available today, to generate scale-free networks using the preferential attachment model. A detailed performance study is presented to understand the scalability and runtime characteristics of the cuPPA algorithm. In one of the best cases, when executed on an NVidia GeForce 1080 GPU, cuPPA generates a scale-free network of two billion edges in less than 3 seconds.
I. INTRODUCTION
Networks are prevalent in many complex systems, such as circuits, chemical compounds, protein structures, biological networks, social networks, the Web, and XML documents. Recently, there has been substantial interest in the study of a variety of random networks to serve as mathematical models of complex systems. Various network theories, metrics, topology, and mathematical models have been proposed to understand the underlying properties and relationships of these systems. Among the proposed network models, the first and the most studied model is the Erdős-Rényi model [1] . However, the Erdős-Rényi model does not exhibit the characteristics observed in many real-world complex systems [2] . Barabási and Albert [2] discovered a class of inhomogeneous networks, called scale-free networks, characterized by a power-law degree distribution P (k) ∝ k −γ , where k represents the degree of a vertex and γ is a constant. While high degree vertices are improbable in Erdős-Rényi networks, they do occur with statistically significant probability in scale-free networks. Furthermore, the work This paper has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the U.S. Department of Energy. Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
of Albert et al. [3] suggests these high degree vertices appear to play an important role in the behavior of scalefree systems, particularly with respect to their resilience [4] . For example, the Barabasi-Albert model can be used for evaluating the North American electric grid with high reliability [4] . Additional related work is discussed later in Section V.
As these complex systems of today grow larger, the ability to generate progressively large random networks becomes all the more important. It is well known that the structure of larger networks is fundamentally different from that of small networks, and many patterns, such as communities, emerge only in massive networks [5] . Although various random network models have been used and studied over the last several decades, even efficient sequential algorithms for generating such networks were nonexistent until recently. The efficient sequential algorithms are able to generate networks with millions of edges in a reasonable amount of time, however, generating networks with billions of edges can take prohibitively large amount of time. This motivates the need for efficient parallel algorithms for generating such networks. Naïve parallelization of the sequential algorithms for generating random networks may not work due to the dependencies among the edges and the possibility of creating duplicate (parallel) edges.
Graphics processors (GPUs) are a cost-effective, energyefficient, and widely available parallel processing platform. GPUs are highly parallel, multi-threaded, many-core processors that have greatly expanded beyond graphics operations and are now widely used for general purpose computing. The use of GPUs is prevalent in many areas such as scientific computation, complex simulations, big data analytics, machine learning, and data mining. However, there is a lack of GPU-based graph/network generators, especially for scale-free networks such as those based on the preferential attachment model. In this paper, we present cuPPA, a novel GPU based algorithm for generating networks conforming to the preferential attachment model. With cuPPA, one can generate a network with two billion edges using a modern NVidia GPU in less than three seconds. To the best of our knowledge, this is the first GPU-based algorithm to generate networks using the exact preferential attachment model.
The rest of the report is organized as follows. In the following Section II, background material is provided in terms of preliminary information, notations, an outline of the network generation problem, and two leading sequential algorithms. In Section III, our parallel cuPPA algorithm for the GPU is presented. The experimental study and performance results using cuPPA are described in Section IV. We present a review of related works in Section V. Finally, Section VI concludes with a summary and an outline of future directions.
II. BACKGROUND

A. Preliminaries and Notations
In the rest of this report, we use the following notations. We denote a network G(V, E), where V and E are the sets of vertices and edges, respectively, with m = |E| edges and n = |V | vertices labeled as 0, 1, 2, . . . , n − 1. For any (u, v) ∈ E, we say u and v are neighbors of each other. The set of all neighbors of v ∈ V is denoted by N (v), i.e.,
If u and v are neighbors, sometimes we say that u is connected to v and vice versa.
We develop parallel algorithms using the CUDA (Compute Unified Device Architecture) framework on the GPU. A GPU contains multiple streaming multiprocessors (SMs). An SM is a group of core processors. Each core processor executes only one thread at a time. All core processors can execute their corresponding threads simultaneously. If some threads perform operations that have to wait for data fetches with high latencies, those are put into the waiting state and other pending threads are executed. Therefore, GPUs increase throughput by keeping the processors busy. All thread management, including the creation and scheduling of threads, is performed entirely in hardware with virtually zero overhead and requires negligible time for launching work on the GPU. For these advantages, modern supercomputers, such as Titan, the largest supercomputer in the USA, are build using GPUs in addition to conventional central processing units (CPUs).
We use K, M, and B to denote thousand, million, and billion, respectively; e.g., 2 B stands for two billion.
B. Preferential Attachment-Based Models
The preferential attachment model is a model for generating randomly evolved scale-free networks using a preferential attachment mechanism. In a preferential attachment mechanism, a new vertex is added to the network and connected to some existing vertices that are chosen preferentially based on some properties of the vertices. In the most common method, preference is given to vertices with larger degrees: the higher the degree of a vertex, the higher is the probability of choosing it. In this report, we study only the degree-based preferential attachment, and in the rest of the report, by preferential attachment (PA) we mean degreebased preferential attachment.
Before presenting our parallel algorithms for generating PA networks, we briefly discuss the sequential algorithms for the same. Many preferential attachment based models have been proposed in the literature. Two of the most prominent models are the Barabási-Albert model [2] and the copy model [6] as discussed below.
C. Sequential Algorithm: Barabási-Albert Model
One way to generate a random PA network is to use the Barabási-Albert (BA) model. Many real-world networks have two important characteristics: (i) they are evolving in nature and (ii) the network tends to be scale-free [2] . In the BA model, a new vertex is connected to an existing vertex that is chosen with probability directly proportional to the current degree of the existing vertex.
The BA model works as follows. Starting with a small clique ofd vertices, in every time step, a new vertex t is added to the network and connected to d ≤d randomly chosen existing vertices: F (t) for 1 ≤ ≤ d with F (t) < t; that is, F (t) denotes the -th vertex which t is connected. Thus, each phase adds d new edges (t, F 1 (t)), (t, F 2 (t)), . . . , (t, F d (t)) to the network, which exhibits the evolving nature of the model. Let F(t) = {F 1 (t), F 2 (t), . . . , F d (t)} be the set of outgoing vertices from t. Each of the d end points in the set F(t) are randomly selected based on the degrees of the vertices in the current network. In particular, the probability P i (t) that a outgoing edge from vertex t is connected to vertex i < t is given by P i (t) = di j dj , where d j represents the degree of vertex j. The networks generated by the BA model are called the BA networks, which bear the aforementioned two characteristics of a real-world network. BA networks have power-law degree distribution. A degree distribution is called powerlaw if the probability that a vertex has degree d is given
−γ , where γ ≥ 1 is a positive constant. Barabási and Albert showed that the preferential attachment method of selecting vertices results in a power-law degree distribution [2] .
A naïve implementation of network generation based on the BA model takes Ω(n 2 ) time where n is the number of vertices. Batagelj and Brandes give an efficient algorithm with a running time of O(m) where m is the number of edges [7] . This algorithm maintains a list of vertices such that each vertex i appears in this list exactly d i times. The list can easily be updated dynamically by simply appending u and v to the list whenever a new edge (u, v) is added to the network. Now, to find F (t), a vertex is chosen from the list uniformly at random. Since each vertex i occurs exactly d i times in the list, we have the probability Pr [
D. Sequential Algorithm: Copy Model
As it turns out, the BA model does not easily lend itself to an efficient parallelization [8] . Another algorithm called the copy model [6, 9] preserves preferential attachment and power-law degree distribution. The copy model works as follows. Similar to the BA model, it starts with a small clique ofd vertices and in every time step, a new vertex t is added to the network to create d ≤d connections to existing vertices F (t) for 1 ≤ ≤ d with F (t) < t. For each connection (t, F (t)) from vertex t the following steps are executed:
Step 1: First a random vertex k ∈ [0, t−1] is chosen with uniform probability.
Step 2: Then F (t) is determined as follows:
where l is a random outgoing connection from vertex k. We also denote F(t) = {F 1 (t), F 2 (t), . . . , F d (t)} to be the set of outgoing vertices from vertex t.
It can be easily shown that a connection from vertex t to vertex i is made with probability Proof: A vertex i can be selected in F(t) in two mutually exclusive ways: i) i is chosen in the first step and assigned to an outgoing edge of t in the second step (Equation 1); this event occurs with probability 1 t · p; or ii) a neighbor of i, v ∈ {u|i ∈ F(u)}, is chosen in the first step, and the outgoing edge to i is selected (out of d outgoing edges from v) in the second step (Equation 2); this event occurs with probability
where d i is the total degree of vertex i. Thus, we have the following equation.
Thus, the copy model is more general than the BA model. It has been previously shown [6] that the copy model produces networks with degree distribution that follows a power-law d −γ , where the value of the exponent γ depends on the choice of p. Further, it is easy to see the running time of the copy model is O(m). Copy model has been used to develop efficient parallel algorithms for generating preferential attachment networks in distributed and sharedmemory machines [8, 11] . In our work presented in this report, we adopt the copy model as a starting point to design and develop our GPU-based parallel algorithm.
III. GPU-BASED PARALLEL ALGORITHM: CUPPA
The PA model imposes a critical dependency that every new vertex needs to have the state of the previous network to compute its edges. This poses a major challenge in parallelizing preferential attachment algorithms. In phase v, to determine F (v), it requires that F i is known for each i < v. As a result, any algorithm for preferential attachment apparently seems to be highly sequential in nature: phase v cannot be executed until all previous phases are completed.
In [8] , a distributed-memory based algorithm was proposed that exploits the copy model to relieve this sequentiality and run in parallel. We reexamined that exploitation and designed cuPPA, an efficient parallel algorithm for generating preferential attachment based networks on the GPU as described next. Let T be the number of threads in the GPU. The set of vertices V is partitioned into
The graph is stored entirely in the GPU memory. Thread T i is responsible for computing and updating F (v) for all v ∈ V i . The algorithm starts with an initial network, which is a clique of the first d vertices labeled
and ensure that such edges are distinct without any parallel edges. We denote the set of vertices
The algorithm works in two phases. In the first phase of the algorithm (called Execute Copy Model), we execute the copy model for all vertices in parallel (using all threads). This phase creates all the direct edges and some of the "copy" edges (Equation 2). However, many copy edges might not be fully processed due to the dependencies. The incomplete copy edges are put in a waiting queue called Q. In the second phase of the algorithm (called Resolve Implete Edges), we resolve the incomplete edges from the waiting queue Q and finalize the copy edges. The pseudocode of cuPPA is given in Algorithm 1.
In the first phase (Line 3-21) the algorithm executes the copy model for all of its vertices. The edges that could not be completed are stored in a queue Q to be processed later. We call the queue a waiting queue. Each of the other vertices from d to n − 1 generates d new edges. There are fundamentally two important issues that need to be handled: i) how we select F (v) for vertex v where 1 ≤ ≤ d, and ii) how we avoid duplicate edge creation. Multiple edges for a vertex v are created by repeating the same procedure d times (Line 4), and duplicate edges are avoided by simply checking if such an edge already exists -such a check is done whenever a new edge is created.
For the -th edge of a vertex v, another vertex u is chosen from [1, v−1] uniformly at random (Line 5, 6). Edge (v, u) is l ← a uniform random integer in created with probability p (Line 7). However, before creating such an edge (v, u) in Line 8, the existence of such an edge is checked immediately before creating them in Line 9. If the edge already exists at that time, the edge is discarded and the process is repeated again (Line 5). With the remaining 1 − p probability, v is connected to some vertex in F(u); that is, we make an edge (v, F (u)), such that is chosen from [1, d] uniformly at random. After the first phase is completed, the algorithm starts to resolve all incomplete edges by processing the waiting queue (Lines 22-29). If an item in the current queue Q could not be resolved during this step, it is subsequently placed in another queue Q . After all incomplete edges on the queue Q are processed, the queues Q and Q are swapped and Q is cleared. We repeat this process until both the queues are empty.
A. Graph Representation
We use one array G of nd elements to represent and store the entire graph. Each vertex u connects to d existing vertices. The neighbors of u are stored between the indices inclusive from ud to (u + 1)d − 1 that represents the other end-point vertices. We call these indices
B. Partitioning and Load Balancing
Recall that we distribute the computation among the threads by partitioning the set of vertices V = {0, 1, . . . , n− 1} into T subsets V 0 , V 1 , . . . , V T −1 as described at the beginning of Section III, where T is the number of available threads. Although several partitioning schemes are possible, our study suggests that the Round Robin Partitioning (RRP) scheme best suits our algorithm. In this scheme, vertices are distributed in a round robin fashion among all threads. Partition V i contains the vertices i, i+T, i+2T, . . . , i+kT such that i + kT ≤ n < i + (k + 1)T ; that is, V i = {j|j mod T = i}. In other words, vertex i is assigned to set V i mod T . Therefore, the number of vertices in the sets are almost equal., i.e., the number of vertices in a set is either n T or n T . The round robin partitioning scheme is illustrated in Figure 1 . 
C. Segmented Round Robin Paritioning
However, the naïve round robin scheme discussed above also has some technical issues. As described in Section III, the first phase of the Algorihm 1 executes the copy model for every vertex assigned to it and stores any unresolved copy edge in the waiting queue. In the second phase, the algorithm takes out each unresolved edge from the waiting queue and tries to resolve them. To reduce the memory latency accessing the waiting queue, we store the waiting queue Q in the GPU shared memory that offers many folds faster memory access than the global GPU memory. Note that this memory is limited in capacity and is shared among all threads running within the same block. Modern GPUs such as NVidia GeForce 1080 have 48 KB of ultra-fast shared memory per block. Since the amount of the shared memory is very limited, it can only store a limited number of unresolved items in the queue. Let C denotes the total capacity of the waiting queue. For example, with a 48 KB of shared memory, we have a total capacity to store C = 48×1024 8 = 6144 items in the waiting queue where each item takes 8 bytes of memory. If we use τ threads per block, each thread will have a capacity of C τ items to be placed in the waiting queue. Therefore, if the number of vertices assigned to a thread is too large, it may generate a large number of unresolved copy edges to be placed in the waiting queue, essentially forcing the algorithm to use a large amount of GPU memory instead of the available shared memory. In order to exploit the faster shared memory without overflowing the waiting queue capacity, we use a modified round robin partitioning scheme called, Segmented Round Robin Partitioning (SRRP). In this scheme, the entire set of vertices V is first partitioned into some k consecutive subsets S 1 , S 2 , S 3 . . . S k called segments. From the definition of the copy model, it is clear that vertices on a segment S i may only depend on vertices on segment S j where i ≥ j but not vice versa. Therefore, the segments have to be processed in a consecutive fashion. Let B i = |S i | denotes the number of elements (also called the segment size) in segment S i where 1 ≤ i ≤ k. Next, the parallel algorithm is executed in k consecutive rounds where round i executes the parallel algorithm for all the vertices in segment S i . In round i, the B i vertices in segment S i are further partitioned into T subsets V 0 (S i ), V 1 (S i ), . . . V T −1 (S i ), using the round robin scheme discussed above and executed in parallel using the T threads. The technique is illustrated in Figure 2 .
Next, we need to determine the best segment size to avoid overflow while using the shared memory. From the copy model, it is easy to see that the lower the probability p is, the more likely it is to be in the waiting queue. In the worst case, when p = 0, all generated edges consist of copy edges. Therefore, at most d unresolved copy edges could be placed in the waiting queue per vertex. Additionally, as the value of d gets bigger, the number of copy edges increases and hence, the waiting queue size increases. Therefore, p and d both have a significant impact on the required size of the waiting queue. Having that in mind, we use two approaches for the segment size:
• Fixed Segment Size: The simplest way is to use a fixed sized segments in each round. From the previous discussion, it is clear that in the worst case we need d items per vertex to be placed on the waiting queue. Therefore, we can use up to τ = min C d , θ threads per block where C is the total queue capacity and θ is the maximum number of threads per block. Then the segment size is C dτ vertices per segment. Note that we can exploit the shared memory for d ≤ C, otherwise we need to use the global memory. However, in almost all practical scenarios we have d C, hence, we can take advantages of the shared memory.
• Dynamic Segment Size: Although the fixed segment size scheme ensures that the queue will not overflow in any round, it may not be the most efficient implementation. We use another scheme where the segment size is determined dynamically between two rounds based on the current state of the algorithm. In this scheme, we start with the number of threads per block τ and the segment size C dτ vertices per segment as was done in the Fixed Segment Size scheme. However, at the end of each round, we determine the maximum number of items that were placed in the waiting queue per thread. If the number of items placed in the waiting queue in the round is less than some f factor of the waiting queue capacity per thread C τ , we increase the total capacity C by a factor of f (typically, we set f = 2). Before the next round, we recompute the required number of threads per block and update the segment size accordingly.
D. CUDA-Specific Deadlock Scenario
In the round robin scheme, completion of a copy edge of a vertex in a thread T i may depend on some other thread T j where i = j. Due to the nature of dependency, T j also may have a copy edge that depends on another vertex that belongs to T i . Therefore, if any of these threads are not running simultaneously on the GPU, the other thread will not be able to complete and a deadlock situation may arise. To avoid such a situation, we must ensure that either all the GPU threads are running concurrently or the dependent threads are put to sleep for a while. In the current CUDA framework, the runtime engine schedules each kernel block to a streaming multiprocessor, and the blocks of running threads are non-preemptible. Therefore, to ensure that threads are running concurrently to avoid deadlock situation, we cannot use more blocks than the number of available streaming multiprocessors. Note that the upcoming CUDA runtime supports cooperative groups. On such future systems, the deadlock situation could be avoided using block sizes larger than the number of shared multiprocessors * .
IV. EXPERIMENTAL RESULTS
In this section, we evaluate our algorithm and its performance by experimental analysis. We demonstrate the accuracy of our algorithm by showing that our algorithm produces networks with power-law degree distribution as desired. We also compare the runtime of our algorithm using several sequential and parallel algorithms.
A. Hardware and Software
We used a computer consisting of 24 AMD Opteron(tm) 6174 processor with an 800 MHz clock speed and 64GB system memory. The machine also incorporates a NVidia 1080 GPU with 8GB memory. The operating system is Ubuntu 16.04 LTS, and all software on this machine was compiled with GNU gcc 4.6.3 with optimization flags -O3. The CUDA compilation tools V8 were used for the GPU code along with nvcc compiler.
B. Degree Distribution
To demonstrate the accuracy of cuPPA, we compared it with the Sequential Barabási-Albert (SBA) [7] and the Sequential Copy Model (SCM) algorithms. The degree distributions of the networks generated by SBA, SCM, and cuPPA are shown in Figure 3 in a log-log scale. We used n = 500M vertices, each generating d = 4 new edges with a total of two billion (2×10 9 ) edges. The distribution is heavy-tailed, which is a distinct feature of the powerlaw networks. The exponent γ of the power-law degree distribution is measured to be 2.7. This supports the fact that for the finite average degree of a scale-free network, the exponent should be 2 < γ < ∞ [12] . Also notice that the degree distributions of SBA and SCM are quite identical, experimentally verifying Theorem 1. The degree distribution of cuPPA is also similar to both SBA and SCM.
C. Visualization of Generated Graphs
In order to gain an idea of the structure and degree distributions, we obtained a visualization of some of the networks generated by our algorithm. We generated the visualizations using a popular network visualization tool called Gephi. Bearing aesthetics in mind and to minimize undue clutter, we focused on a few small networks by choosing n = 10000, p = 0. 
D. Effect of Edge Probability on Degree Distribution
As mentioned earlier, the strength of the copy model is the capability of generating other preferential attachment networks by simply varying one parameter, namely, the probability p. In Figure 7 we display the degree distribution of the generated networks by varying p. When p = 0, all edges are produced by copy edges, and thus the network Figure 6 : Visualization of networks generated by cuPPA using n = 10000, p = 0.5 and d = 4.
becomes a star network where all additional vertices connect to the d initial vertices. With a small value of p (p = 0.01), we can generate a network with a very long tail. When we set p = 0.5, we get the Barabási-Albert networks which exhibit a straight line in log-log scale. When we increase p to 1, we get a network consists entirely of direct edges that do not form any tail. 
E. Waiting Queue Size
As mentioned in Section III-C, the waiting queue size depends on p and d. To evaluate the impact of p and d, we ran simulations using 1280 CUDA threads (20 blocks and 64 threads per block) where each thread only executed one vertex. The value of p is varied from 0 to 1 with different probability values. We also varied the value of d from 1 to 4096 as increasing powers of 2. In Figure 8 , we show the number of items placed in the waiting queue per vertex for different combinations of p and d. We also added the worst case value as a line in the plot. As seen from the figure, in the worst case with p = 0, the maximum size of the waiting However, as the round progresses, the maximum size of the waiting queue decreases significantly as shown in Figure 9 . For this figure, we also ran cuPPA using 1280 CUDA threads (20 blocks and 64 threads per block) to generate networks with d = 512, 256, 128, 64 and p = 0.5. Each CUDA thread processes exactly one vertex per round. Only the first 100 rounds are shown for brevity. From Figure 9 , we can see that as the round progresses, the size of the waiting queue per round decreases dramatically for all different values of d. This indicates that we could process more vertices in later rounds using the same amount of queue memory. Therefore, we can dynamically change the size of the segments between two consecutive rounds to increase parallelization. Based on these observations regarding the size of the waiting queue, we designed an adaptive version of cuPPA that monitors the maximum size of the waiting queue and manages the segment size accordingly as discussed in Section III-C. We call this version cuPPA-Dynamic and use it for all other experiments.
F. Runtime Performance
In this section, we analyze the runtime and performance of cuPPA relative to other algorithms and show the variation of performances against various parameters.
1) Runtime Comparison with Existing Algorithms:
To the best of our knowledge, our algorithm is the first GPU-based parallel algorithm to generate preferential attachment networks. Therefore, it is not possible to compare the runtime with other GPU-based algorithms. Instead, we compare with the existing non-GPU algorithms. The runtimes of cuPPA and the existing algorithms are shown in Figure 10 for generating two billion edges (n = 500M , d = 4).
• Sequential Algorithms: We compare cuPPA with two efficient sequential algorithms: SBA [7] and SCM [6] . As shown in Figure 10 , cuPPA generates the network in just 2.72 seconds in the NVidia 1080 GPU with more than 100× speedup.
• Parallel Algorithms: We also compared cuPPA with a distributed-memory (PPA-DM) [8] and a sharedmemory (PPA-SM) [11] parallel algorithms. As shown in Figure 10 , cuPPA outperforms PPA-DM on a system with 24 processors. The main reason is that unlike PPA-DM, cuPPA does not require complex synchronizations and message communications. Due to the unavailability of the PPA-SM code, we compared the runtime to generate the largest graph (n = 10 7 , d = 10) reported in [11] with the corresponding runtime of cuPPA. PPA-SM generates the network using 16 cores of Intel Xeon CPU E5-2698 2.30 GHz in approximately 7.5 seconds, whereas cuPPA generates the same network in just 0.3 second. From Figures 11 and 12 , we can observe that for any fixed set of values for p and d, with increasing n, the runtime increases linearly, indicating that the algorithm scales very well with increasing value of n.
3) Runtime vs. Degree of Preferential Attachment: Next, we examine the runtime performance of cuPPA with increasing d. The runtime is shown in Figure 13 . Here, we set n = 7812500, vary p = {0, 0.00001, 0.001, 0.25, 0. Figure 14 Then the runtime reduces almost linearly up to p = 0.9 and then reduces sharply towards p = 1. With lower values of p, most of the edges are produced by copy edges. Therefore, the size of the waiting queue increases, thereby increasing the runtime. As the value of p increases towards 1, most of the edges are created using direct edges and, therefore, fewer items are stored in the waiting queue.
5) Runtime varied with the number of Threads:
To understand how the performance of cuPPA depends on the number of threads, we set p = 0.5 and used four different sets of n and d to generate networks. We also varied the number of CUDA threads per block from 64, 128, 256, 512, to 1024. The runtime (solid lines) and the relative speedup (dashed lines) of the experiments are shown in Figure 15 . Let t T be the runtime of cuPPA using T threads. Then, the relative speedup is defined as t T t64 in this experiments, i.e., the speedup gained compared to the runtime of cuPPA using 64 threads. Figure 15 is shown in two y-axes, the left and right axis correspond to the runtime and relative speedup respectively. From the figure, the best performance is observed with 512 threads per block for all cases. Therefore, in our final algorithm, we use up to 512 threads per block. Number of Threads (per Block) Runtime (seconds)
Relative Speed Up n=125M, d=16 n=250M, d=8 n=500M, d=4 n=62.5M, d=32 
V. RELATED WORK
Although the concepts of random networks have been used and well studied over the last several decades, efficient algorithms to generate the networks were not available until recently. The first efficient sequential algorithm to generate Erdős-Rényi and Barabási-Albert networks was proposed in [7] . A distributed memory-based algorithm to generate preferential attachment networks was proposed in [13] . However, their algorithm was not exact, rather an approximate algorithm and required manually adjusting several control parameters. The first exact distributed memorybased parallel algorithm using the copy model was proposed in [8] . Another distributed memory-based parallel algorithm using the Barabási-Albert model was proposed in [14, 15] . However, instead of using pseudorandom number generators, they used hash functions to generate the networks. A sharedmemory-based parallel algorithm using the copy model was proposed in [11] . Recursive matrix (RMAT) [16] and stochastic Kronecker graph (SKG) [17] are two popular models that can generate networks with power-law degree distributions. Due to its simpler parallel implementation SKG has been choosen in the Graph500 [18] benchmarks.
However, there exist very few GPU-based network generators. A GPU-based algorithm for generating Erdős-Rényi networks was presented in [19] . Another GPU-based algorithm for generating networks using the small-world model [20] was presented in [21] . However, no GPU-based algorithm exists for the preferential attachment model or for generating scale-free networks.
VI. CONCLUSION
A novel GPU-based algorithm, named cuPPA, has been presented, with a detailed performance study, and its combination of its scale and speed has been tested by achieving the ability to generate networks with up to two billion edges in under three seconds of wall clock time. The algorithm is customizable with respect to the structure of the network by varying a single parameter, namely, a probability measure that captures the preference style of new edges in the preferential attachment model. Also, a high amount of concurrency in the generator's workload per thread or processor is observed when that probability is at very small fractions greater than zero. In future work, we intend to exploit code profiling tools for further optimization of the runtime and memory usage on the GPU. Also, the algorithm needs to be extended to exploit multiple GPUs that may be co-located within the same node. This would require periodic data synchronization across GPUs, which can be efficiently achieved using the NVidia Collective Communication Library (NCCL). Additional future work involves porting to GPUs spanning multiple nodes, and also hybrid CPU-GPU scenarios in order to utilize unused cores of multi-core CPUs. Methods to incorporate other network generator models can also be explored with our cuPPA as a starting point. Finally, future work is needed in converting our internal, GPU-based graph representation to other popular network formats for usability.
