35 research outputs found
Accelerating the Gillespie Exact Stochastic Simulation Algorithm Using Hybrid Parallel Execution on Graphics Processing Units
<div><p>The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×−120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.</p> </div
Stoichiometric Data Structure.
<p>The stoichiometric matrix is stored as a linear array. Since each reaction has at most 2 reactants and 2 products, each entry in the array (corresponding to one reaction) has 4 data fields. Each data field has 2 parts. The first is the index of the specie and the second is the change in the molecular count. The stoichiomteric data structure is common across all runs and therefore only a single copy is maintained in the global memory on the GPU.</p
Updating Reaction Block Partial Sums.
<p>For sparsely connected systems, we use one thread per block to compute the change in the sum of propensities , within a block <i>B<sub>i</sub></i>. For densely connected systems, the whole thread warp is used. The changes in the partial block sums are found for all <i>i</i> = 1, 2…<i>p</i> in parallel using the parallel-prefix sum. Finally, the block partial propensity sums are updated in parallel as .</p
Three Level Search for Finding Index <i>f</i> of the Reaction to be Fired.
<p>In <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(a)</a>, we use a warp voting function to find the block <i>B<sub>l</sub></i> in which <i>r<sub>f</sub></i> occurs. The cumulative block sums are maintained through incremental updates during simulation. In <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(b)</a>, we use reduction and warp voting functions to find the chunk <i>C<sub>m</sub></i> in block <i>B<sub>l</sub></i> in which <i>r<sub>f</sub></i> occurs. Finally in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(c)</a>, we use a parallel prefix sum and warp voting to find the index of <i>r<sub>f</sub></i> in chunk <i>C<sub>m</sub></i>.</p
Reaction-Reaction Dependency Graph.
<p>Two arrays used to represent the dependency graph for each reaction <i>r<sub>j</sub></i>. The first is an array of indices. The second is the array of dependent reactions sorted by the block in which they occur. The index array is used to indicate the end point for each block in the dependent reaction array. Each element of the dependent contains the type of the reaction <i>T</i>, the global index of the reaction <i>r<sub>m</sub></i>, the reaction constant <i>k<sub>m</sub></i>, and the indices of the reactant species <i>s<sub>a</sub></i>, <i>s<sub>b</sub></i>. The reaction-reaction dependency graph is common across all runs and therefore only a single copy is maintained in the global memory on the GPU.</p
Performance Benchmarks.
<p><a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(a)</a> list shows the results for the Colloidal Aggregation Model (the strongly connected system). For <i>N</i> chemical species, the number reactions is . The number of dependent reactions is 3<i>N</i>–7 and therefore scales with the number of species. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(b)</a> shows the results for the Cyclic Chain Model (the weakly connected system). For <i>N</i> chemical species, this network has <i>M</i> = <i>N</i> reactions. The initial molecular counts of all species [<i>s<sub>i</sub></i>] were set to 1. All reaction constants <i>k<sub>i</sub></i> were set to 1. These initial conditions are the same as in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone.0046693-Ramaswamy1" target="_blank">[13]</a>. The graphs for PSSA-CR and SPDM were obtained from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone.0046693-Ramaswamy1" target="_blank">[13]</a>. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(c)</a> shows the results for the randomly generated system. The size of the dependent reactions list varies for 8–16 for these systems. Note that running the CR algorithm on all 4 cores (one realization per core) gives a performance gain <4× over a single core run. This gain could be expected of all other serial CPU algorithms as well.</p
Performance w.r.t. Number of Reaction Blocks.
<p><a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g009" target="_blank">Figure 9(a)</a> shows the results for a weakly connected system. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g009" target="_blank">Figure 9(b)</a> shows the results for a strongly connected system. Note that in both instances, the warp per block performance initially increases with the number of blocks and then decreases. This is because the warp per block includes a parallel-prefix sum for each reaction block that can get expensive.</p
Performance w.r.t. Number of Realizations.
<p>Both the strongly connected and the weakly connected systems show a decrease in time per update as the number of realizations is increased. This decrease flattens out, indicating saturation of the GPU computing power. The strongly connected system here had the number of species at <i>N</i> = 252 and the weakly connected system had <i>N</i> = 50, 000.</p
Write out process.
<p>The thread id indicates (based on the computation _popc(B)) whether a given thread is writing out an element less than the pivot or greater than or equal to the pivot. The values and , which are maintained in shared memory and updated incrementally, indicate the location in the global array of the location of the last element that is less than the pivot and greater than or equal to the pivot. This operation involves at most two coalesced memory writes.</p
Finding vector norms.
<p>Each thread block is assigned to compute the norm of one vector in . Each thread strides through the vector and computes the sum , where is the number of threads in a thread block. Finally, an atomic add operation is used to add all the sums within each thread into a location in global memory on the GPU.</p