To investigate the behavior of biochemical systems, many runs of Gillespie's Stochastic Simulation Algorithm (SSA) are generally needed, causing excessive computational costs on Central Processing Units (CPUs). Since all SSA runs are independent, the Intel Xeon Phi coprocessors based on the Many Integrated Core (MIC) architecture can be exploited to distribute the workload. We considered two execution modalities on MIC: one consisted in running exactly the same CPU code of SSA, while the other exploited MIC's vector instructions to reuse the CPU code with only few modifications. MIC performance was compared with Graphics Processing Units (GPUs), specifically implemented in CUDA to optimize the use of memory hierarchy. Our results show that GPU largely outperforms MIC and CPU, but required a complete redesign of SSA. MIC allows a relevant speedup, especially when vector instructions B Andrea Tangherloni
are used, with the additional advantage of requiring minimal modifications to CPUmechanism-based model of prokaryotic gene regulation [16] . A family of synthetic models with different size (i.e., different numbers of molecular species and reactions) was then used as test case to investigate the benefits of exploiting MIC's vector instructions and offload capabilities. In addition to the computational time, we evaluate the costs and power consumption of the hardware employed, and discuss the effort to port the existing code on parallel architectures.
The previous works already focused on the comparison of the performance of Intel Xeon Phi coprocessors against other parallel architectures, showing different results according to the specific problem under investigation. In the context of the simulation of spin systems, a comparison between Intel Xeon Phi 5110P and Nvidia Tesla K20s video card was presented in [2] , highlighting that a careful implementation of the C code allows the MIC to compete with the GPU. On the contrary, in [5] , it was shown that a Nvidia Tesla K20x outperforms an Intel Xeon Phi 5110P for the parallelization of non-bonded electrostatic computation for Virtual Screening; this work pointed out, in particular, the importance of OpenMP source code optimization. The work presented in [9] described the performance comparison among a multi-core Intel CPU, an Nvidia Tesla K20c GPU, and an Intel Xeon Phi 7120P coprocessor for the execution of a tracking algorithm based on the Hough transform: the results highlighted that in this case, the CPU performs better than both GPU and MIC coprocessors. Moreover, the authors suggested that an implementation with offloaded calculations to the coprocessors might help in achieving better performances. A multi-threaded version of an algorithm to tackle the tensor transpose problem was then presented in [13] . In this case, the multi-core CPU and the MIC achieved a relevant speedup with respect to the GPU, since the optimization of L1 cache is easier than the implementation of a coalesced global memory access on the GPU. As a final example, a comparison of the acceleration on an Intel Xeon Phi 5110P and a Nvidia Tesla K20x for protein docking calculation based on the fast Fourier transform was introduced in [22] . The GPU resulted to be five times faster than the MIC, considering the comparable implementation costs required by these architectures.
In this work, we show that, for the problem of executing increasing batches of SSA runs, GPU largely outperforms the other architectures because of its large number of computing units. Nevertheless, GPU required a complete redesign and specific programming of this simulation algorithm. On the other side, we show that MIC coprocessors allow a relevant speedup, especially when the simulated biochemical system is large and MIC vector instructions are used, and have the additional advantage of requiring minimal modifications to CPU code.
The paper is structured as follows. In Sect. 2, we briefly introduce the formalism of mechanism-based stochastic models and the functioning of Gillespie's SSA. We also provide the definition of the biochemical systems that will be considered as case studies to determine the computational performances of CPU, GPU and MIC. In Sect. 3, we show the speedup obtained by Xeon Phi and Nvidia Tesla K80 with respect to CPU, and discuss the main benefits of MIC architecture. Finally, in Sect. 4, we conclude the work with some final remarks about the selection of the proper architecture to execute SSA.
Mechanism-based modeling and Gillespie's SSA
According to the stochastic formulation of chemical kinetics [7] , a mechanismbased model of a biochemical system can be formalized by specifying the set S = {S 1 , . . . , S N } of molecular species occurring in the system and the set R = {R 1 , . . . , R M } of chemical reactions taking place between these species. Each reaction can be formally defined as R j :
β ji S i , where α ji , β ji ∈ N are the stoichiometric coefficients associated, respectively, to the ith reactant and to the ith product of the jth reaction, with i = 1, . . . , N , j = 1, . . . , M. The value c j ∈ R + is the so-called stochastic constant, encompassing all physical and chemical properties of reaction R j .
Mechanism-based stochastic models are able to account for any macroscopic effect in the behavior of biochemical systems caused by biological noise, that is, the random collision and reaction events among the molecular species that are present in very low amounts inside cells. In these cases, the classical deterministic modeling approach can fail in capturing the effects of biochemical stochastic processes [24] . The seminal procedure used to describe the dynamics of stochastic models is Gillespie's SSA [7] , which is able to realize an exact reproduction of the temporal evolution of biochemical networks, under the following assumptions: (i) reactions and species are contained within a single volume, whose physical conditions (e.g., pressure, temperature) remain constant during the whole time of simulation; (ii) the reaction volume is well-stirred, that is, molecules are uniformly distributed in space; (iii) the amount of each molecular species S i is discrete, i.e., it is represented by an integer number x i ∈ N.
Briefly, SSA works as follows. Given the state of the system at time t, represented by the vector x = x(t) ≡ (x 1 (t), . . . , x N (t)), SSA first identifies the reaction to execute in the next time interval [t, t + τ ). To this aim, the probability of each reaction R j to occur in the next infinitesimal time step [t, t + dt) has to be evaluated. This probability is proportional to the so-called propensity function of reaction R j , defined
is the number of distinct combinations of the reactant molecules in R j occurring in state x. Then, SSA computes the time τ before a reaction takes place:
, where a 0 (x) = M j=1 a j (x) and ρ 1 is a random number sampled in [0,1] with a uniform probability. The reaction R j to be actually executed is then chosen by taking the smallest integer in
where ρ 2 is a second random number sampled in [0,1] with a uniform probability. The interested reader is referred to [7] for more details about SSA.
In what follows, we will consider two different test cases of mechanism-based stochastic models to the aim of evaluating the computational performance of MIC, CPU, and GPU when executing large batches of SSA runs.
The first test case is a stochastic model describing the gene expression regulation network in prokaryotes (PGN, in short). In the PGN model, a gene is transcribed into messenger RNA and translated into a protein. The gene itself is inhibited by the binding with a dimer of the protein. Full details of this model, consisting in five species and eight reactions, can be found in [16] .
The second test case is a family of synthetic stochastic models of increasing size (SynSM, in short), which are randomly generated according to the methodology proposed in [20] . Namely, SynSM are characterized by a number of species N and of reactions M ranging from 20 × 20 to 240 × 240; the values of the stochastic constants are randomly sampled with uniform distribution in (0, 1). SynSM are specifically exploited here to evaluated the impact of the size of the model on the performance of the MIC architecture.
Computational results
In this section, we start by showing the comparison between the computational performance achieved with Intel Xeon Phi coprocessors, CPUs, and GPUs for the execution of an increasing number of SSA runs for the PGN model. Then, we investigate some specific MIC features, such as vector instructions and offload capability, to simulate a set of SynSM with increasing size. To evaluate the computational costs of running sequential and parallel SSA runs, we exploited GALILEO, a supercomputer created by the Italian consortium CINECA that combines multiple state-of-the-art accelerators. Namely, it consists of: (a) 768 Intel Xeon Phi 7120P coprocessors, each one with 61 cores, 1.238 GHz; (b) 516 compute nodes, each one equipped with 2 Intel Xeon Haswell E5-2630 v3 (8 cores, 2.40 GHz), for a total of 8256 cores; and (c) 20 Nvidia Tesla K80 GPUs (4992 cores with a dual-GPU design, 560 MHz). Because of its peculiar hybrid architecture, this supercomputer represents an ideal machine for a direct comparison between the three architectures considered here.
To exploit GALILEO, we developed three different implementations of SSA. The first implementation was designed for the x86 architecture, that is, the Intel Xeon Haswell E5-2630 v3 and the Intel Xeon Phi 7120P. The second implementation consisted in reusing the CPU code, modified to exploit the advantages of MIC's vector instructions. The third version was specifically developed for the Nvidia Tesla K80 architecture, and optimized to fully exploit the memory hierarchy of the GPU: the state of the biochemical system is stored in the shared memory, while the matrices of stoichiometric coefficients are stored in the constant memory. All SSA implementations were written using the C++ language and exploit the MRG32K32a pseudorandom numbers generator [12] , which is available in the Intel Math Kernel Library for the CPU and MIC and in the CURAND Library for the GPU.
Simulation of the PGN model For each architecture and its respective SSA implementation, every SSA run of the PGN model was executed for a total simulation time of t = 80 time units, storing the amount of all molecular species along 16 time points of the dynamics. Figure 1 reports the running time (in seconds) required to execute increasing batches of SSA runs on the MIC (light gray bars), MIC using vector instructions (dark gray bars), CPU (black bars), and GPU (white bars).
These results show that the CPU running time linearly increases with the number of simulations and that the GPU largely outperforms the other architectures. In particular, the GPU running time remains almost constant while increasing the number of parallel SSA runs: this is due to the high number of cores available on the GPU used in this work, which allows to distribute the simulations over individual computing units. During this batch of tests, anyway, the GPU Nvidia K80 was far from a full usage of its computing and memory resources. Thus, although for 320 parallel SSA runs, the GPU achieved a speedup of 15× with respect to the CPU, we argue that its computational performances would be even better by running a larger number of simulations. It is also worth noting that, despite the branching of CUDA threads due to the stochastic nature of all (independent) simulations, the simplicity of SSA makes coarse-grain simulation perfectly suitable for GPU's architecture.
In the case of MIC architecture, we observed an acceleration with respect to the sequential SSA execution on the CPU. According to our results, in the case of 240 SSA runs, the speedup achieved by this architecture-without the use of vector instructions-is 4.1× with respect to the CPU (Fig. 1, light gray bars) . Since Xeon Phi coprocessors can execute up to four concurrent threads on each of the 60 available cores, the acceleration provided by MIC scales up to 240 simulations only. To highlight this limitation, we estimated by linear regression the theoretical running time of MIC to execute 280 and 320 simulations, using the running times obtained in the case of 1, . . . , 240 simulations. In Fig. 1 , we show the estimated running times with and without the use of MIC's vector instructions (dotted and dashed lines, respectively). We observe that, using both SSA execution modalities on MIC, the measured running times with 280 and 320 simulations are higher than expected because of the limitation of resources.
The use of MIC vector instructions allows an additional improvement of computational performances (Fig. 1, dark gray bars) , and highlights that this peculiar capability of MIC is necessary to fully leverage the computational power of Intel (Fig. 2) , the running time of the MIC with and without the use of vector instructions changes from 0.15 to 0.27 s, respectively, corresponding to an overall reduction of about 42 % when vector instructions are enabled. Figure 2 also shows the break-even between MIC, CPU, and GPU. It is worth noting that when only a few SSA runs need to be executed, the CPU outperforms both GPUs and MICs, because of its higher clock frequency. The MIC is faster than the CPU when about 15 SSA runs are performed (10 when using vector instructions). The break-even between CPU and GPU is around 20 simulations.
Simulation of SynSM
In the second batch of tests, we specifically focused on MIC architecture to investigate the impact of the size of the simulated model. To evaluate the performances of the Xeon Phi coprocessor, we randomly created a set of synthetic models of increasing size, i.e., 20 × 20, 40 × 40, 80 × 80, and 160 × 160, whose dimensions correspond to the number of chemical species and the number of reactions, respectively. The models were simulated using the SSA implementation, with and without the vector instructions enabled. For every model, each SSA run was executed for a total simulation time of t = 100 time units, storing the amount of 10 molecular species along 11 time points of the dynamics.
The results, summarized in Fig. 3 , show two relevant trends. First, although the running time constantly increases, it is not directly proportional to the size of the model: a single simulation takes 0.11 s for size 20 × 20 and 0.74 s for size 160 × 160. This means that a biochemical system that is eight times bigger (160×160 vs. 20×20) requires a running time that is 6.7× higher. If we consider the case of 240 simulations, the running times are 0.6 and 1.5 s, respectively, corresponding to an increment of just 3.4×. The second observation is that, using the vector instructions, the scalability is Fig. 3 show that, in the case of 240 simulations, if we compare sizes 20 × 20 and 160 × 160, we see that the running time of the largest model is only 2.2× greater than the smallest one. In the same situation, but without using vector instructions (empty bars), the running time of the largest model is only 3.38× greater than the smallest one. Stated otherwise, the use of vector instructions achieves better computational performances when the size of SynSM increases.
MIC's offload capability
As last test, we exploited the explicit offload capabilities of the Xeon Phi coprocessors. To do so, we modified the SSA implementation in two ways: (i) we linearized all data structures, so that the arrays could be automatically transferred to the Xeon Phi; (ii) we added the offload compiler pre-directives, which mark the regions of the source code that must be offloaded (in our case, the initialization of the system, the calculation of the propensity functions and the system's states update). According to our results, even in the case of "large" stochastic models of biochemical systems (e.g., 240×240 chemical species and reactions), the offload does not provide a relevant speedup. The rationale behind this is that SSA is a relatively simple algorithm, so that the reduced number of calculations distributed on MIC's cores-combined with the overhead due to MIC initialization and data transfersaffect the overall performances, making CPU more efficient for this task. Although it is possible that MIC could provide a more relevant speedup to simulate larger models, it is usually not trivial to define stochastic models of such complexity, due to the lack of quantitative parameters and of the initial molecular amounts that usually affect the availability of large-scale mechanistic models of biochemical systems. two alternative architectures: CPU and GPU. According to our results, MICs provide better performances than CPUs when more than 10 SSA runs are executed, although GPUs outperform both MICs and CPUs when more than 80 SSA runs are executed. It is worth noting that the running time on GPUs remains constant until computing resources (i.e., the cores) are available; in the case of MICs, the performances scale well until the maximum number of supported threads (i.e., 240) is reached. Moreover, MIC's performances in native mode are strongly improved when vector instructions are exploited, especially in the case of larger synthetic stochastic models. We also tested the offload capability of the Xeon Phi coprocessors, that is, the possibility of automatically distributing independent calculations over multiple threads. Our results showed that the offload of SSA does not provide a speedup, since the portions of code that can be offloaded consist in a few instructions only; therefore, the offload parallelization is affected by overheads and data transfers.
The empirical analyses shown in this work might facilitate the selection of the proper parallel architecture and SSA implementation when a large number of mutually independent simulations are needed, as is the case of many computationally expensive tasks typically carried out for in-depth investigations of biological systems (see [1, 3, 16, 19, 21] ). However, we highlight that some additional issues should be considered for the selection of a proper parallel architecture. Specifically, we discuss hereby also the efforts of code porting, the power consumption of the three architectures and the financial costs items.
Concerning the cost of code porting, in the case of general-purpose GPU computing on Nvidia video cards, the effort necessary to re-implement any algorithm using the CUDA programming technique is relevant. Since this issue would increase the time-to-market, it should be seriously taken into account. On the contrary, Xeon Phi coprocessors are supposed to be fully compatible with CPUs based on the x86 ISA; however, according to our experience, the code has to be slightly adapted to be correctly executed on MIC (considering the native mode). On the other hand, a comparison between the three architectures should also take into consideration the evaluation of costs, power consumption, and theoretical peak performance. At the time of writing, the CPU Intel Xeon Haswell E5-2630 v3 has a cost of around e600, with a power consumption of 80W and theoretical peak performance in double precision of about 500 GFlops, which is only reachable when fused multiply-add (FMA) and advanced vector instructions (AVX) are simultaneously exploited. The characteristics of the other devices are: e5800, 300W, and 1208 GFlops for Intel Xeon Phi 7120P; e4200, 300W, and 2910 GFlops for Nvidia Tesla K80. According to these data, the same theoretical peak of the Tesla K80 GPU can be achieved using either 6 CPUs or 3 Xeon Phi 7120P, with the consequent increment in terms of financial cost and power consumption. However, to fully leverage the computational power of the CPU, the implementation would require the extensive use of FMA, AVX, and multi-threading, requiring relevant modifications of the code.
As a final remark, we highlight that the performance of GPUs and MICs are affected by the Error Correcting Code (ECC), used to avoid any error caused by natural radiations [6] . This functionality is enabled on both accelerators on the GALILEO supercomputer and introduces a relevant overhead, mainly due to bits verification. According to the tests performed by Fang et al., the ECC on MIC coprocessors causes a bandwidth reduction greater than 20 % [6] . Tesla GPUs exploit ECC over the whole memory hierarchy, including the global memory, L1 and L2 caches, and registers [15] . In addition, in the case of GPUs, the bandwidth reduction is around 20 % [11], so we expect that the speedup obtained using GPUs and MIC for stochastic simulations could be further improved by disabling the ECC.
