Improved Implementation of Simulation for Membrane Computing on the Graphic Processing Unit  by Maroosi, Ali et al.
 Procedia Technology  11 ( 2013 )  184 – 190 
2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia.
doi: 10.1016/j.protcy.2013.12.179 
The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013) 
Improved Implementation of Simulation for Membrane Computing 
on the Graphic Processing Unit 
Ali MaroosiF*, Ravie Chandren Muniyandi, Elankovan A. Sundararajan,              
Abdullah Mohd. Zin 
Research Center for Software Technology and Management, Faculty of Technology and Information Science
Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia 
Abstract 
Membrane computing is a theoretical model of computation that inspired from the structure and functioning of cells. Membrane 
computing models naturally have parallel structure. Most of the simulations of membrane computing have been done in a serial 
way on a machine with a central processing unit (CPU). This has neglected the advantage of parallelism in membrane computing. 
This paper uses Graphic Processing Unit (GPU) as a parallel tool to implement membrane computing. The method minimizes 
data transferring which is time consuming procedure between the device and the host by processing all computing on GPU and 
transfer only the final results to CPU. Simulations show that speed increases up to 15 times compared to sequential simulation
and using of shared memory increases speed, up to 38 times. 
© 2013 The Authors. Published by Elsevier B.V. 
Selection and peer-review under responsibility of the Faculty of Information Science and Technology, Universiti Kebangsaan 
Malaysia. 
Keywords:Membrane computing, Graphic Processing Unit, Parallel Processing; 
* Corresponding author.  
E-mail address: ali.maroosi@gmail.com 
Available online at www.sciencedirect.com
 he uthors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
ti  d peer-review under responsib lity of the Faculty of Information Science & Technology, Univers ti Kebangsaan 
l sia.
ScienceDirect
185 Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
1. Introduction 
Membrane computing is one of recent branch of computer science that introduced by Paun [1]. The basic model 
consists of a hierarchical structure composed of several membranes, embedded into a main membrane called the 
skin. Membranes divide the Euclidean space into regions that contain some objects that represented by symbols of 
an alphabet and evolution rules. Using these rules, the objects may evolve and/or move from a region to 
anadjacentregion [2]. 
Software applications for membrane computing normally implement sequential algorithms simulation adapted to 
common CPU architectures [3-6].These kinds of algorithms do not get performance when theproblem size 
increases.To use the advantage of parallelism of membrane computing other efforts has been done to implement 
membrane computing on parallel tools. For example membrane computing has been implemented on computer 
clusters [7], reconfigurable hardware as in the field programming gateway(FPGA) [8] and GPUs [9-11].  
However, previous approaches have some limitations and need more investigation. Using computer cluster is not 
cost effective. Programming and changing the code on FPGA is very time consuming procedure. Therefore, GPU 
introduces very economical and easy way for parallel processing compared to computer clusters and FPGA. In this 
paper variant of membrane computing i.e. active membrane computing is implemented on GPU. This 
implementation has better performance with respect to sequential implementation. Latency of shared memory is 
lower than global memory in GPU. For further improvement of implementation, this paper uses shared memory 
instead of global memory to store data for processing on GPU. 
2. P systems with Active membranes 
P systems with active membranes are formed by a membrane structure, where a label and a polarization is 
associated to each membrane. The model of P system with active membranes usually has different elements as 
shown in Fig. 1 (Further information about active membrane is described by Paun [12]). 
 
b2
b2
b b2
a c
a c
c
ENVIRONMENT
Membrane
Skin
+
+ -
+
2
1
3
0
Charge
label
Multiset of
objects
c c
Rules
 
Fig. 1 Membrane computing structure 
Active membranes are formally defined by a tuple  π = (O, H, μ, w1, ..., wm, R), where:  
(1) m≥ 1 (the initial degree of the system); 
(2) O is the alphabet of objects; 
(3) H is a finite set of membrane labels;  
(4) μis a membrane structure that consists of m membranes with initially neutral polarizations, labeled with H 
elements;  
(5) w1, . . . ,wm are the strings over O, describing the multi sets of objects placed in the m regions of μ; 
186   Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
(6) R is a finite set of developmental rules defined as follows: 
(a) Object evolution rules: Dhua ][ o for h א H, α א {+,−, 0} (electrical charges), a א O and uis a string 
over O that describes a multi set of objects associated with membranes that depends on the label and 
the charge in the membranes; 
(b) “In” communication rules: ED hh ba ][[] o for h א H, α, β א {+,−, 0}, a, b א O. An object is introduced in 
the membrane, possibly modified, and the initial charge α is changed to β; 
(c) “Out” communication rules: ba hh
ED []][ o for h א H, α, β א {+,−, 0}, a, b א O. An object is released 
from the membrane, possibly modified, and the initial charge α is changed to β; 
(d) Dissolving rules: ba h oD][ for h א H, α א {+,−, 0}, a, b א O. A membrane with a specific charge is 
dissolved in reaction to an object (possibly modified); 
(e) Division rules: JED hhh bba ][][][ o  for h א H, α, β, γ א {+,−, 0}, a, b, c א O. A membrane is divided into 
two membranes. The objects inside the membrane are replicated, except for a, which may be 
modified in each membrane. 
The rules are applied according to the following principles: 
(1) All the elements which are not involved in any of the operations to be applied remain unchanged. 
(2) Rules associated with label h are used for all membranes with this label, no matter whether the 
membrane is an initial one or it was generated by division during the computation. 
(3) Rules from (a) to (e) are used as usual in the framework of membrane computing, i.e. in a maximal 
parallel way. In one step, each object in a membrane can only be used by at most one rule (non-
deterministically chosen), but any object which can evolve by a rule must do it (with the restrictions 
indicated below). 
(4) Rules (b) to (e) cannot be applied simultaneously in a membrane in one computation step. 
(5) If a membrane is dissolved, its content (multi set and interior membranes) becomes part of the 
immediately external one. The skin is never dissolved neither divided. 
 
3. Proposed Parallel implementation 
3.1. Graphics Processing Unit 
GPUs constitute nowadays a solid alternative for high performance computing. The way GPUs exploit 
parallelism differ from multi-coreCPUs, which raises new challenges to take advantage of its tremendous computing 
power. GPU is especially well-suited to address problems that can be expressed as data-parallel computations. GPUs 
can support several thousand of concurrent threads providing a massively parallel environment. This parallel 
computation model leads us to look for a highly parallel computational technology where aparallel simulator can run 
efficiently.The smallest parts of GPU are cores. A group of cores is named as streaming multiprocessors (SMP). 
Cores inside each SMP are synchronized to execute same instructions. Each SMP works asynchronously from other 
SMPs. Each core has a very small amount of memory that it is called local memory [13,14]. The smallest amount of 
shared memory is dedicated for each SMP, and all SMPs can access to a large amount of memory as Global RAM. 
From application point of view, instead of cores, SMP and group of SMP, the threads, block and kernel are used 
respectively. A program contains one or several kernels, and each kernel may contain one or more blocks. Each 
block is run on a single SMP and all threads within a block are able to use the same shared memory, and use barrier 
synchronization. Synchronization and sharing of shared memory are impossible across blocks. The programmer 
creates a program that is called a kernel. As illustrated in Fig.2 a kernel grid is subdivided into blocks and eachblock 
is subdivided into various threads. 
 
187 Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
 
Fig. 2. Memory and program architecture for GPU [15] 
 
 
3.2.  Implementation of active membrane computing on GPU 
Two kinds of implementation on GPU have been done in this paper. Implementation on GPU by using global 
memory and implementation on GPU that uses shared memory. It has been developed by using CUDA in Microsoft 
Visual Studio 2008 (C++) environment. The program is divided into two parts: the host (CPU side) and the device 
(GPU side). The host/CPU part of the code is generally responsible for controlling the program execution flow, 
allocating memory in the host or device/GPU, and obtaining the results from the device. 
A simulation algorithm for GPU is given in Algorithm 1. Inputs are loaded first from the host; memory is 
allocated in the device (to receive the loaded inputs).  
 
Algorithm 1: Pseudo code for active membrane systems on GPU 
 Step0: (Host and Device) Input data and initialize structure and content of membranes  
 Step1: (Device) Select the rules that can be applied 
 Step2: (Device) Execute applicable rules  
 Step3: (Device) Repeat step 1 to 3 till termination criteria is met  
 Step4: (Host and Device)Transfer output from Device to Host 
 
For this algorithm we need input parameters as initial multi sets (w1, . . . ,wm), number of regions or membranes 
(m), structure of membranes (μ) and termination criteria. Inputs are then moved from the host to the device and the 
device executes parallel computations on the input data. This procedure includes two main steps of selecting rules 
and executing rules. In our simulator each membrane assign to the block in GPU. 
 
Objects in membrane 1
Thread 1 … Thread n
Threads in Block 1 ...
...
Objects in membrane m
Thread 1 … Thread n
Threads in Block m
GPU
Membrane Computing Model
a1 … an a1 … an
 
Fig. 3: Assigning each thread to the objects in one membrane 
Each thread checks applicability of rules that are related to its object. Then if it is selected as the executable rule, 
it performs them in execution step. Finally, after finishing the execution result, it should be transferred to the Host. 
Transferring output data to the host of CPU memory across the peripheral component interconnect express (PCI-
188   Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
Express) bus has negatively affected to the simulation time. So, this paper minimizes data transfer between the 
device and the host by transferring the processing of the output from host to GPU. 
4. Simulations and results 
This paper presents three types of simulations, i) sequential on CPU ii) parallel on GPU with using global 
memory and iii) parallel on GPU with using shared memory. We show how to improve the efficiency of execution 
membrane on the GPU by making full use of the GPU's computational resources. To evaluate the effect of using 
existing resources in GPU this paper defines the structure of membrane computing for benchmark as follows: 
 
mjniaa jii ddo 0;0;][:rule evolving                                                  (1) 
where n is the number of rules inside each membrane and m is the number of membranes. In this structure, the 
number of membranes and number of objects in each membrane can be changed. Therefore the percentage of using 
computational resources can also be changed.  
To use all computation resources of GPU, the number of objects and number of membrane should be large. 
Results (elapsed time in milliseconds) for simulation of defined benchmark for 1000 iterations on a computer with 
CPU Intel Core-i7-3820, 3.60GHz with RAM 8GB and GPU NVIDIA GTX 480 have been illustrated in 
Table.1,Table.2, Fig.4 and Fig.5. As shown in Fig. 4 when the number of objects (n) inside each membrane (in this 
simulation we consider the number of membranes m= 8192) is small for example n=2 CPU has better performance 
(smaller execution time) than GPU. When number of objects is high for example n=512 then GPU has better 
performance than CPU. This paper uses shared memory strategy. By this way we could improve performance of 
implementation on GPU because the latency of access to global memory is about 400-800 cycles while the latency 
of access to shared memory is about 8-22 cycles [13]. 
Table 1. Comparison of the performance of CPU and GPU for different number of objects inside each membrane for 
number of membrane equal to 8192 
Number of 
Objects 
Sequential on 
CPU (ms) 
GPU 
Shared 
Memory 
(ms) 
GPU 
Global 
memory 
(ms) 
Speed UP 
GPU 
Shared 
Memory 
Speed UP 
GPU 
Global 
Memory 
2 62 129 145 0.48 0.42 
4 140 130 157 1.07 0.89 
8 296 131 186 2.25 1.59 
16 608 133 236 4.57 2.57 
32 1217 135 240 9.01 5.07 
64 2449 137 248 17.87 9.87 
128 4898 140.6 319 34.83 15.35 
256 10077 264 652 38.17 15.45 
512 20155 518 1244 38.90 16.20 
1024 40300 1067 2520 37.76 15.99 
 
189 Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
 
Fig. 4. Comparison execution time of membrane computing model on CPU and GPU with and without shared memory for different number of 
objects 
In Fig. 5, the effect of number of membranes on performance of GPU has been considered. When number of 
membranes increases, then usage of computation resource of GPU increases and GPU has better performance than 
CPU. 
 
Fig. 5. Comparison of the execution time of membrane computing model on CPU and GPU with and without shared memory for different 
number of membranes 
 
Table 2. Comparison of the performance of CPU and GPU for different number of membranes with number of objects 
equal to 512 
Number of 
membranes 
Sequential on 
CPU (ms) 
GPU_Using 
Shared 
Memory 
(ms) 
GPU_Using 
Global 
memory 
(ms) 
Speed UP 
GPU 
Shared 
Memory 
Speed UP 
GPU 
Global 
Memory 
2 4.7 2.01 2.02 2.3 2.3 
4 16 2.05 2.3 7.8 6.9 
8 31 2.15 2.7 14.4 11.4 
16 47 2.25 3.4 20.8 13.8 
32 78 2.37 5.6 32.9 13.9 
64 156 4.7 11.1 33.1 14.0 
128 296 8.5 20.9 34.8 14.1 
256 624 17.4 44 35.8 14.1 
512 1201 35.07 84.4 34.2 14.2 
1024 2449 69.26 167 35.3 14.6 
2048 4914 137.7 331 35.6 14.8 
190   Ali Maroosi et al. /  Procedia Technology  11 ( 2013 )  184 – 190 
4096 10031 272 637 36.8 15.7 
8192 20161 520 1252 38.7 16.1 
16384 40286 1033 2488 38.9 16.1 
32768 80589 2047 4977 39.3 16.1 
 
5. Conclusion 
This paper introduced the use of shared memory efficiently instead of global memory, and this has improved 
performance of execution. For number of objects that equal to 256, GPU with global memory and GPU with shared 
memory have 15.45 and 38.17 times speedup respectively compared to CPU implementation. In the future work, we 
want to use a matrix representation of membrane computing and by this way we decide to solve the limitation at the 
number of levels that should not have more than two levels in the structure of membrane computing to implement 
on GPU. 
Acknowledgements  
This work supported by the Exploratory Research Grant Scheme (ERGS) of the Ministry of Higher Education 
(Malaysia; Grant code: ERGS/1/2011/STG/UKM/03/ 
References 
[1] Paun Gh. Computing with Membranes, Journal of Computer and System Sciences 2000.; 61:108–143. 
[2] Leporati A, Zandron C, Ferretti C, Mauri G. Solving Numerical NP-Complete Problems with Spiking Neural P Systems. Lecture Notes in 
Computer Science 2007; 4860:336-352.  
[3] Muniyandi, R., Abdullah, M.Z., 2012. Modeling  hormone-induced calcium oscillations in liver cell with membrane computing, Romanian 
journal of Information Science and Technology 15. p. 63–76. 
[4] Muniyandi R, Abdullah MZ. Membrane computing as a modeling tool for discrete systems. Journal of Computer Science 2012; 7:1667-1673. 
[5] Garcia-Quismondo M, Gutierrez-Escudero I, Perez-Hurtado R, Perez-Jimenez MJ, Riscos-Nunez A. An overview of P-lingua 2.0.Membrane 
Computing. 10th Intl. Workshop, LNCS 2010; 5957:264–288. 
[6] Gutierrez-Naranjo MA, Perez-Jimenez MJ, Riscos-Nunez A. Available membrane computing software, In Applications of 
MembraneComputing 2006.p. 411–436. 
[7] Ciobanu G, Wenyuan G. P Systems Running on a Cluster of Computers, Lecture Notes in Computer Science 2004. 2933:123-139. 
[8] Nguyen V, Kearney D, Gioiosa G. A Region-Oriented Hardware Implementation for Membrane Computing Applications and Its Integration 
into Reconfig-P, Lecture Notes in Computer Science 2010; 5957:385-409. 
[9] Cecilia JM, Garcıa JM, Guerrero GD, Martinezdel- Amor MA, Perez-Hurtado I, Perez-Jimenez MJ. Simulating a P system based efficient 
solution to SAT by using GPUs, Journal of Logic and Algebraic Programming 2010; 79:317-325. 
[10] Cecilia JM, Garcia JM, Guerrero GD, Martinezdel-Amor MA, Perez-Hurtado I, Perez-Jimenez MJ. Simulation of P systems with active 
membranes on CUDA. Briefings in Bioinformatics 2010; 11:313-322. 
[11] Cecilia JM, Garcia JM, GuerreroGD, Martinezdel-Amor MA, Perez-Jimenez MJ. Ujaldon M. The GPU on the simulation of cellular 
computing models. journal of Soft Comput 2012; 16:231–246. 
[12] Paun Gh. Membrane Computing, An introduction 2002. Springer, Berlin. 
[13] NVIDIA Corporation. NVIDIA CUDA C Programming Guide. Version4.2; 2012, http://docs.nvidia.com/cuda/index.html 
[14]. Takizawa H, Koyama K, Sato K, Komatsu K., Kobayashi H. CheCL: Transparent Check pointing and Process Migration of OpenCL 
Applications, International Parallel and Distributed Processing Symposium (IPDPS11), Anchorage, USA, 2011, pp 864-876. 
[15] Zeller C, NVIDIA Corporation, 2010. CUDA C Basics Supercomputing 2010 Tutorial, 
http://www.nvidia.com/object/sc10_cuda_tutorial.html. 
 
