Search CORE

35 research outputs found

Accelerating the Gillespie Exact Stochastic Simulation Algorithm Using Hybrid Parallel Execution on Graphics Processing Units

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date: 09/11/2012
Field of study

<div>The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×−120× performance gain over various state-of-the-art serial algorithms when simulating different types of models. </div

Directory of Open Access Journals

PubMed Central

FigShare

Stoichiometric Data Structure.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

The stoichiometric matrix is stored as a linear array. Since each reaction has at most 2 reactants and 2 products, each entry in the array (corresponding to one reaction) has 4 data fields. Each data field has 2 parts. The first is the index of the specie and the second is the change in the molecular count. The stoichiomteric data structure is common across all runs and therefore only a single copy is maintained in the global memory on the GPU.</p

FigShare

Updating Reaction Block Partial Sums.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

For sparsely connected systems, we use one thread per block to compute the change in the sum of propensities , within a block Bi. For densely connected systems, the whole thread warp is used. The changes in the partial block sums are found for all i = 1, 2…p in parallel using the parallel-prefix sum. Finally, the block partial propensity sums are updated in parallel as .</p

FigShare

Three Level Search for Finding Index f of the Reaction to be Fired.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

In <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(a)</a>, we use a warp voting function to find the block Bl in which rf occurs. The cumulative block sums are maintained through incremental updates during simulation. In <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(b)</a>, we use reduction and warp voting functions to find the chunk Cm in block Bl in which rf occurs. Finally in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g002" target="_blank">Figure 2(c)</a>, we use a parallel prefix sum and warp voting to find the index of rf in chunk Cm.</p

FigShare

Reaction-Reaction Dependency Graph.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

Two arrays used to represent the dependency graph for each reaction rj. The first is an array of indices. The second is the array of dependent reactions sorted by the block in which they occur. The index array is used to indicate the end point for each block in the dependent reaction array. Each element of the dependent contains the type of the reaction T, the global index of the reaction rm, the reaction constant km, and the indices of the reactant species sa, sb. The reaction-reaction dependency graph is common across all runs and therefore only a single copy is maintained in the global memory on the GPU.</p

FigShare

Performance Benchmarks.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(a)</a> list shows the results for the Colloidal Aggregation Model (the strongly connected system). For N chemical species, the number reactions is . The number of dependent reactions is 3N–7 and therefore scales with the number of species. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(b)</a> shows the results for the Cyclic Chain Model (the weakly connected system). For N chemical species, this network has M = N reactions. The initial molecular counts of all species [si] were set to 1. All reaction constants ki were set to 1. These initial conditions are the same as in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone.0046693-Ramaswamy1" target="_blank">[13]</a>. The graphs for PSSA-CR and SPDM were obtained from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone.0046693-Ramaswamy1" target="_blank">[13]</a>. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g007" target="_blank">Figure 7(c)</a> shows the results for the randomly generated system. The size of the dependent reactions list varies for 8–16 for these systems. Note that running the CR algorithm on all 4 cores (one realization per core) gives a performance gain <4× over a single core run. This gain could be expected of all other serial CPU algorithms as well.</p

FigShare

Performance w.r.t. Number of Reaction Blocks.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g009" target="_blank">Figure 9(a)</a> shows the results for a weakly connected system. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046693#pone-0046693-g009" target="_blank">Figure 9(b)</a> shows the results for a strongly connected system. Note that in both instances, the warp per block performance initially increases with the number of blocks and then decreases. This is because the warp per block includes a parallel-prefix sum for each reaction block that can get expensive.</p

FigShare

Performance w.r.t. Number of Realizations.

Author: Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

Both the strongly connected and the weakly connected systems show a decrease in time per update as the number of realizations is increased. This decrease flattens out, indicating saturation of the GPU computing power. The strongly connected system here had the number of species at N = 252 and the weakly connected system had N = 50, 000.</p

FigShare

Write out process.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D'Souza (120596)
Publication venue
Publication date
Field of study

The thread id indicates (based on the computation _popc(B)) whether a given thread is writing out an element less than the pivot or greater than or equal to the pivot. The values and , which are maintained in shared memory and updated incrementally, indicate the location in the global array of the location of the last element that is less than the pivot and greater than or equal to the pivot. This operation involves at most two coalesced memory writes.</p

FigShare

Finding vector norms.

Author: Ali Dashti (461880)
Ivan Komarov (120593)
Roshan M. D’Souza (159250)
Publication venue
Publication date
Field of study

Each thread block is assigned to compute the norm of one vector in . Each thread strides through the vector and computes the sum , where is the number of threads in a thread block. Finally, an atomic add operation is used to add all the sums within each thread into a location in global memory on the GPU.</p

FigShare