The Reduction Problem in CUDA and Its Simulation with P Systems by Ceterchi, Rodica et al.
The Reduction Problem in CUDA and Its
Simulation with P Systems
Rodica Ceterchi1, Miguel A´ngel Mart´ınez-del-Amor2, Mario J. Pe´rez–Jime´nez2
1 Faculty of Mathematics and Computer Science
University of Bucharest
14 Academiei st. 010014 Bucharest, Romania
E-mail: rc@fmi.unibuc.ro, rceterchi@gmail.com
2 Research Group on Natural Computing
Department of Computer Science and Artificial Intelligence
University of Sevilla
Avda. Reina Mercedes s/n, 41012 Sevilla, Spain
E-mail: mdelamor@us.es, marper@us.es
Summary. We introduce P systems with dynamic communication graphs which simu-
late the functioning of the CUDA architecture when solving the parallel reduction prob-
lem.
1 Introduction
Introduced in [13], P systems are powerful computational devices, with a high de-
gree of parallelism, whose functioning is inspired by biological processes at the level
of the cells, and of their membranes ([13],[14]). Among these processes, rewriting
and communication play an important role.
It is of interest to compare P systems with other, classical, parallel paradigms.
We have begun such a study in the form of simulating parallel classical architec-
tures with P systems, in particular, the perfect shuﬄe architecture [4], [5] and
the mesh architecture [6]. The reduction problem, being one of the most simple,
primary ones to be solved in different contexts, was used as an illustration.
In [7] some general guidelines were developed along which a wide class of par-
allel architectures can be simulated with P systems. P systems with dynamic
communication graphs were introduced. In their functioning, rewriting steps and
communication steps are separated and made more visible. They differ from the
systems introduced in [3] in that the communication graphs are inspired by the
particular parallel network architecture being simulated.
CUDA stands for Compute Unified Device Architecture [15, 10], and is a tech-
nology proprietary of NVIDIA Corp. Since its introduction in 2007, CUDA has
92 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
allowed programmers to take advantage of the inherent parallel architecture of
GPUs, which ranges from 240 cores (Tesla C1060, released on 2008) to 2880 cores
(Tesla K40, released on 2013). This is performed by using a threaded, shared-
memory, abstracted model of the GPU, which is implemented by C/C++ exten-
sions. CUDA has helped to establish GPU computing [9] as a sub-framework of
High Performance Computing. In fact, it has been successfully applied to a broad
spectrum of research areas, including Systems Biology and Population Dynamics
[12], and Membrane Computing [1, 2, 11], among others.
In the present paper we propose to simulate with P systems the reduction
problem as solved in CUDA. Section 2 is devoted to the presentation of several
improved versions of solving the reduction problem in CUDA. Section 3 is devoted
to the presentation of its simulation with P systems with dynamic communication
graphs.
2 Solving the Reduction Problem in CUDA
The GPU (Graphics Processor Unit) is the core of graphics cards. A GPU to-
day contains thousands of computing processors devoted for graphics. However,
novel techniques enable programmers to take advantage of this highly parallel ar-
chitecture for scientific computing. These are called GPGPU (General Purpose
computing on the GPU) [9].
A new era of GPGPU started with the introduction of CUDA (Compute Unified
Device Architecture) [15, 10] by NVIDIA. It offers a programming model that
abstracts the GPU architecture to programmers, so it is enough to learn some
extensions to C/C++ language (CUDA extensions), whereas the CUDA driver
will execute the code on the GPU. In the following sections we will introduce some
concepts and terminology of CUDA which are necessary to understand the work
presented in this paper.
2.1 CUDA programming model
The CUDA programming model assumes that the CPU (or host) takes control
of the execution flow, and permit the GPU (or device) to run many instances
of the same code in parallel. This code is called kernel, and it is executed by a
grid of threads. Typically, a grid is composed of thousands of threads, since the
creation of a sufficient number of threads to use all hardware resources requires a
large amount of data parallelism. The threads are arranged within the grid in a
two-level hierarchy, as seen in Figure 1. At the higher level, each grid consists of
one or more thread blocks. At the lower level, each block is organized as a three
dimensional array of threads. All blocks in a grid have the same number and
organization of threads. Each block is identified by a two dimensional identifier,
and each thread within its block by a three dimensional identifier (ID). Therefore,
any thread can be unequivocally identified by the union of both thread and thread
The Reduction Problem in CUDA and Its Simulation with P Systems 93
block identifiers. The execution of threads inside a block can be synchronized
by barrier operations ( syncthreads()), and threads of different blocks can be
synchronized only by finishing the execution of the kernel.
Fig. 1. Threading model in CUDA. Threads are executed in a grid, and they are orga-
nized in blocks.
The memory hierarchy is explicitly and manually managed in CUDA. This
memory model is composed in several levels, each one offering different speeds
and storage properties. We highlight the two most important ones: global memory
and shared memory. Global memory is the largest but the slowest memory in the
system. It is accessed by the host (where the input and output data are allocated)
and by any thread in execution. Shared memory is the smallest but fastest memory.
It is accessed by threads belonging to the same block. Normally, performance of
CUDA applications depends on how much shared memory is exploited. Thus, an
efficient way to structure an algorithm is as follows:
94 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
1. The threads of each block read its corresponding data portion from global
memory to shared memory (which is inevitable because the host only can put
the data in global memory).
2. Threads work with the data directly on the shared memory.
3. Threads copy these data back to global memory (so the host can retrieve the
result).
2.2 Modern GPU architecture
The GPU architecture has evolved in the last years, offering even more com-
pute capabilities. In general terms, it consists of a scalable processor array, orga-
nized in Streaming Multiprocessors (SMs) of Streaming Processors (SPs, or cores).
The number of them depends on the GPU. SMs are based on the SIMT (Single-
Instruction Multiple-Thread) model. Basically, in the SIMT model all the threads
execute the same instruction on different piece of data. SMs create, manage, sched-
ule and execute threads in groups of 32 threads (which is the branching granularity
of NVIDIA GPUs). This set of 32 threads is called warp, and each SM can han-
dle many of them. Individual threads of the same warp must start together at
the same program address. However, they are free to branch and execute indepen-
dently, but at cost of serialization and performance (in fact, SIMT is really applied
to the warp). If a warp is broken (because of branching or memory stall), the real
parallelism in CUDA is not achieved.
2.3 Performance considerations
Although CUDA programming model is flexible enough to run any kind of algo-
rithm, the achieved performance depends on how the programmer had designed
the code, and on the target GPU running the program. A CUDA programmer has
to perfectly know the CUDA programming model, but also the idea of the GPU
architecture, since it provides the restrictions to be considered in order to achieve
peak performance. There are several strategies to accomplish it. Next, we stand
out two of them:
• Emphasize parallelism: the warp is the branching granularity on CUDA; that
is, the parallelism unit. Thus, warps must be maximized with active threads,
but minimizing branch divergence between thread: they must be executing the
same instruction simultaneously to reach peak performance.
• Exploit memory bandwidth: the peak bandwidth of using both global and shared
memories is achieved mainly by an access pattern: coalesced access to contigu-
ous (aligned) memory positions. Data is transferred from memory to the GPU
hardware in blocks, which is formed by contiguous bytes in memory. Thus, we
must maximize these blocks with the access to contiguous memory addresses
by contiguous threads within a warp.
The Reduction Problem in CUDA and Its Simulation with P Systems 95
2.4 Parallel Reduction in CUDA
The reduction problem consists in applying an operator to a set of elements. Let us
assume a set of n elements {a1, . . . , an}, and the binary and associative, reduction
operator ⊕. The result of applying reduction to the set of elements is another
element a = a1 ⊕ a2 ⊕ . . . an. Reduction is a well-known primitive in Parallel
Computing, since it resides inside many important algorithms. For example, it can
be used to compute the sum or the maximum of an array of numbers. Nowadays,
reduce is part of the most used algorithm in Big Data and No-SQL data bases,
which is Map-Reduce.
A common way to solve this problem in parallel is by using a tree, in which
partial solutions are computed to reach the final one. The time complexity of this
solution is O(logn). The process is summarized in figure 2.
Fig. 2. Scheme of a parallel reduction solution
The most popular CUDA implementations for the reduction primitive can be
found on the speech given by Harris [8] in 2007. This document introduces seven
kernels from a didactic perspective, in a performance-increasing order. Next, we
discuss the first four ones.
Interleaved addressing
Let us assume that the input array of elements is of size n, and that the maximum
amount of threads per block is nt. That means that we will need to use nb = nnt
thread blocks to process the whole array. However, if each block compute reduce
for nt elements, what would we do with the nb partial results? The answer is to
have a second kernel, with a single block having nb threads, which will compute
96 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
again reduce. It is straightforward to add more kernels in this way until having
nb < nt.
In what follows, we will assume that the size of the input array of elements is
less than nt (maximum number of threads per block). This will mean that just one
thread block is enough to compute reduce. We will disregard the second kernel for
partial results.
A naive implementation of the tree solution in CUDA is to launch n threads
and correspond each one with an element. First, each thread with even ID (called
active threads) will compute the partial result with the next element, and the
process will continue by halving the number of active threads. This process, called
reduce0 (interleaved addressing), is shown in Figure 3 (white-arrowed threads are
inactive).
Fig. 3. Reduce0: interleaved addressing.
The main drawbacks of this solution are:
• Warps are not fulfilled. Since the addressing is interleaved, warps can be filled
by active threads up to the half. Therefore, we are not maximizing memory.
• Access to memory is also interleaved, so the peak bandwidth cannot be reached.
Sequential addressing
A solution to the drawbacks found in interleaved addressing is to compact the
memory accesses to contiguous threads. In order to implement this approach, it
The Reduction Problem in CUDA and Its Simulation with P Systems 97
will be necessary to change the tree of the reduce primitive solution. Now, the
first half of threads will access to the first half of the array to compute the partial
solutions with the second half. It can be seen that the access is coalesced in this
way, as well as warps are also fulfilled. Figure 4 shows the scheme of this approach,
called reduce 3, sequential addressing. Again, white-arrowed threads are inactive
in the beginning of the process.
Fig. 4. Reduce3: sequential addressing.
As it can be noticed, the main drawback of this solution is that half of the
threads are idle on first loop iteration, what is a waste of resources.
First add during load
The third approach, called reduce4 first add during load, is based on taking ad-
vantage of the threads which are inactive at the beginning of reduce3. The idea is
to halve the number of blocks, and to use all the threads at the beginning to apply
the reduction operation to the corresponding element with the element that would
have corresponded to the avoided extra block. That is, reduce4 will compute first
the array {a0,8, a1,9, a2,10, a3,11, a4,12, a5,13, a6,14, a7,15}, and proceed as in reduce3.
98 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
3 The Simulation with P Systems
In this section we use the formal tools developed in [7] to produce a straightforward
simulation with P systems of the reduction problem solved in CUDA as presented
in section 2. In [7] only SIMD machines were considered, and algorithms for
which communication took place only via a network of communication and not
via a shared memory. The present case is different, we have communication via
shared memory.
As a first step we construct for the CUDA model a P system Π(C) in the
spirit of Theorem 5 of [7]. The system must reflect in its membrane structure
the particular CUDA architecture. As a second step, we follow Theorem 7 of [7],
and construct Π(C, Y ) for Y a reduction algorithm. We must specify for each
algorithm the specific sequence of pairs (graph, rules) which compose Rµ(Y ).
Let Graphs denote the set of all possible graphs having n vertices labeled
P1, · · · , Pn. Having fixed the vertices, each element of Graphs will be uniquely
identified by the specific set of edges.
A distinguished element of Graphs is the identity graph, denoted in the sequel
Id: (the set of vertices is fixed as mentioned above) the set of edges is defined as
Id = {(i, i) | 1 ≤ i ≤ n}.
Another distinguished element of Graphs is the total graph, denoted Gtotal:
the set of edges is defined as
Gtotal = {(i, j) | 1 ≤ i, j ≤ n}.
One can also consider the strict total graph, denoted G+total: with set of edges
defined as
G+total = {(i, j) | 1 ≤ i, j ≤ n, i 6= j} = Gtotal \ Id.
We recall from [7] the following two definitions.
Definition 3.1 A P system with dynamic communication graphs is a construct
Π =< V,P1, · · · , Pn, Rµ >,
where P1, · · · , Pn are elementary membranes, and V is an alphabet of symbols used
to codify the contents of the membranes.
Rµ is a set of pairs [graph, rules], with graph ∈ Graphs and such that:
(i) if graph ⊆ Id then its associated rules are rewriting rules;
(ii) if graph ⊆ G+total then its associated rules are communication rules.
Definition 3.2 A P system with dynamic communication graphs will be called
with finite sequential support iff the set Rµ is both finite and totally ordered, i.e.,
if it is a finite sequence.
The Reduction Problem in CUDA and Its Simulation with P Systems 99
Let V be an alphabet of symbols with which we will codify the contents of the
membranes. An integer n will be codified as an (n apparitions of the symbol a, with
a ∈ V ). We assume we solve the reduction problem for a binary commutative and
associative operation ∗, and we assume we can compute n ∗m inside a membrane
by rewriting: i.e. we have symbols a, b ∈ V and a rewriting rule r∗(a, b) such that
r∗(a, b)(anbm) = an∗m.
We associate a hierarchical membrane structure to the CUDA components
in the following manner: (1) each thread is an elementary membrane; (2) each
block is a membrane containing the elementary ones associated to its threads; (3)
the global memory is a separate elementary membrane. The presentation of this
membrane structure is
µ = (B0(P00, P01, · · ·P0n), · · · , Bk(Pk0, Pk1, · · ·Pkn),Mg),
where n = 2t is the number of threads per block, Mg is the global memory mem-
brane, Pij is the membrane corresponding to thread j of block i, and when we
reason inside a block the subscript corresponding to the block may be omitted.
If we have a set Rµ of pairs (graph, rules) to obey the conditions of the defi-
nition, then the construct
Π(C) = (V, (B0(P00, P01, · · ·P0n), · · · , Bk(Pk0, Pk1, · · ·Pkn),Mg), Rµ)
= (V, µ,Rµ)
is a P system with dynamic communication graphs.
Here a discussion may start, comparing the P systems devised in [7] for SIMD-
X machines, and a potential similar candidate for the CUDA paradigm. Such a
candidate depends on a good definition for Rµ, or, at least, the formulation of
criteria for ’admissible’ candidates.
We open the way for this discussion, which is also a reflection on the power
and the limitations of the formalism introduced in [7], by simulating the solving
of the reduction problem in CUDA. More precisely, in the following we construct
sets Rµ(Y ) with the property that Π(C, Y ) = (V, µ,Rµ(Y )) is a P system with
dynamic communications graph, with finite sequential support, which simulates
Y , a reduction algorithm among the ones presented in Section 2.
The admissible communication graphs will have to reflect the communication
properties of CUDA. We will have communication edges between Mg and each
thread membrane Pji to simulate the reading from and the writing to global mem-
ory. Between the individual elementary membranes which simulate the threads
we can have communication only inside the same block, so the communication
graph will have separate connected components for each block. Rewriting rules
inside membranes will be associated to subgraphs of Id and communication rules
to subgraphs of G+total.
We use the shorthand notation (A,B, x) for symbol x traveling on an oriented
edge (A,B) from A to B, (G, x) for symbol x travelling on all oriented edges of G,
100 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
and ({Gj}j , x) for symbol x traveling on all oriented edges of the family of graphs
{Gj}j .
The membrane Mg contains integers nij each codified with a symbol aij
Mg = {anijij | i = 0, · · · , k, j = 0, · · · , n}
The graph for loading from global memory will be {(Mg, Pij) | i, j}, and writing
from thread Pij to global memory will use the edge (Pij ,Mg).
Loading the integers from global memory into the membranes corresponding
to threads will be simulated by the sequence
({(Mg, Pij , aij)}ij , {(Pij , aij → a)}ij),
where the first step is a communication step, and the second a rewriting step.
Writing to global memory from block k will be simulated by the sequence
((Pk0, a→ ak0), (Pk0,Mg, ak0))
where the first step is a rewriting step, and the second a communication step.
We now construct the graph for interleaved addressing and the sequence of
rules which simulate the procedure reduce0 of section 2. We assume we are inside
a block and we omit the block index. For a fixed stride s the graph of interleaved
addressing will be
Gs = {(Pi+s, Pi) | i mod (2s) = 0}.
On one edge of this graph the sequence of rules to be applied is
((Pi+s, a→ ab), (Pi+s, Pi, b), (Pi, r∗(a, b))).
We first rewrite a to ab in Pi+s, then the b symbol travels to Pi, and finally in Pi
the application of the rewriting rule r∗ produces the desired result.
For the entire block, the sequence of rules for stride s will be
Rs = (({Pi+s}i, a→ ab), (Gs, b), ({Pi}i, r∗(a, b))).
To finish the simulation of reduce0 we have to iterate Rs corresponding to the
sequence of strides for this case, i.e. we consider the sequence
(Rs | s = 1, s <= n, s = 2 ∗ s).
For sequential addressing the communication graph inside a block is
G′s = {(Pi+s, Pi) | i = 0, 1, · · · , s− 1}.
For the entire block, the sequence of rules for stride s will be
R′s = (({Pi+s}i, a→ ab), (G′s, b), ({Pi}i, r∗(a, b))).
The Reduction Problem in CUDA and Its Simulation with P Systems 101
To finish the simulation we iterate R′s corresponding to the sequence of strides
for this case, i.e. we consider the sequence
(R′s | s = n = 2t, s > 0, s = s div 2).
We can analogously simulate the remaining versions of the procedure reduce
of section 2. We illustrate with the improvement first add during load.
In this case we halve the number of blocks, and thus of threads. Equivalently,
we can consider that Mg contains a double number of integers, codified with a
number of symbols doubled compared to the number of threads. We denote aij
and bij the symbols which will correspond to the thread Pij . The previous loading
sequence will be replaced by the load-and-add sequence
({(Mg, Pij , aij , bij)}ij , {(Pij , r∗(aij , bij), aij → a)}ij).
References
1. J.M. Cecilia, J.M. Garc´ıa, G.D. Guerrero, M.A. Mart´ınez-del-Amor, I. Pe´rez-
Hurtado, M.J. Pe´rez-Jime´nez. Simulation of P systems with Active Membranes on
CUDA, Briefings in Bioinformatics, 11, 3 (2010), 313–322.
2. J.M. Cecilia, J.M. Garc´ıa, G.D. Guerrero, M.A. Mart´ınez-del-Amor, I. Pe´rez-
Hurtado, M.J. Pe´rez-Jime´nez, Simulating a P system based efficient solution to SAT
by using GPUs, Journal of Logic and Algebraic Programming, 79, 6 (2010), 317–325.
3. R. Ceterchi, C. Mart´ın–Vide: Dynamic P Systems. In ”Membrane Computing”, In-
ternational Workshop, WMC-CdeA 2002, Curtea de Arges¸, Romania, August 2002,
Revised Papers, (G. Pa˘un, G. Rozenberg, A. Salomaa, C. Zandron eds.), LNCS 2597,
Springer 2003, p. 146-186
4. R. Ceterchi, M.J. Pe´rez Jime´nez: Simulating Shuﬄe–Exchange Networks with P Sys-
tems. In Proceedings of the Second Brainstorming Week on Membrane Computing
(Gh. Pa˘un, A. Riscos, F. Sancho and A. Romero, eds.), Report RGNC 01/04, Uni-
versity of Seville 2004, 117-129.
5. R. Ceterchi, M.J. Pe´rez Jime´nez: A Perfect Shuﬄe Algorithm for Reduction Processes
and its Simulation with P Systems. In Proceedings of the International Conference
on Computers and Communications ICCC 2004 (I. Dzitac, T. Maghiar, C. Popescu,
eds.), Editura Univ. Oradea, 2004, 92-97
6. R. Ceterchi, M.J. Pe´rez Jime´nez: On Two-Dimensional Mesh Networks and Their
Simulation with P Systems. In Membrane Computing, 5th International Workshop,
WMC 2004, Revised Selected and Invited Papers (G. Mauri, Gh. Pa˘un, M. J. Pe´rez–
Jime´nez, G. Rozenberg, A. Salomaa, eds.), LNCS 3365 (2005), 259-277.
7. R. Ceterchi, M.J. Pe´rez Jime´nez: On Simulating a Class of Parallel Architectures,
International Journal of Foundations of Computer Science, Vol. 17, No. 1 (2006)
91–110
8. M. Harris: Optimizing Parallel Reduction in CUDA, NVIDIA Developer Technology
(2007).
9. M. Harris. Mapping computational concepts to GPUs, ACM SIGGRAPH 2005
Courses, NY (USA), 2005.
102 R. Ceterchi, M. A´ngel Mart´ınez-del-Amor, M.J. Pe´rez–Jime´nez
10. D. Kirk, W. Hwu. Programming Massively Parallel Processors: A Hands On Ap-
proach, MA (USA), 2010.
11. M.A. Mart´ınez-del-Amor, J. Pe´rez-Carrasco, M.J. Pe´rez-Jime´nez. Characterizing the
parallel simulation of P systems on the GPU. International Journal of Unconven-
tional Computing, 9, 5-6 (2013), 405-424.
12. M.A. Mart´ınez-del-Amor, I. Pe´rez-Hurtado, A. Gastalver-Rubio, A.C. Elster, M.J.
Pe´rez-Jime´nez. Population Dynamics P systems on CUDA. In 10th Conference on
Computational Methods in Systems Biology, CMSB2012, (D. Gilbert, M. Heiner,
eds.), LNBI 7605 (2012), 247-266.
13. Gh. Pa˘un: Computing with Membranes. Journal of Computer and System Sciences,
61, 1 (2000), 108–143, and Turku Center for CS-TUCS Report No. 208, 1998
14. Gh. Pa˘un: Membrane Computing. An Introduction, Springer-Verlag, Berlin, 2002.
15. NVIDIA CUDA website, 2014. https://developer.nvidia.com/cuda-zone
