Cortical synapse organization supports a range of dynamic states on multiple spatial and temporal scales, from synchronous slow wave activity (SWA), characteristic of deep sleep or anesthesia, to fluctuating, asynchronous activity during wakefulness (AW). Such dynamic diversity poses a challenge for producing efficient large-scale simulations that embody realistic metaphors of short-and long-range synaptic connectivity. In fact, during SWA and AW different spatial extents of the cortical tissue are active in a given timespan and at different levels, which implies a wide variety of loads of local computation and communication. A balanced evaluation of simulation performance and robustness should therefore include tests of a variety of cortical dynamic states. Here, we demonstrate performance scaling of our proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and AW for bidimensional grids of neural populations, which reflects the modular organization of the cortex. We explored networks up to 192×192 modules, each composed of 1250 integrate-and-fire neurons with spike-frequency adaptation, and exponentially decaying inter-modular synaptic connectivity with varying spatial decay constant. For the largest networks the total number of synapses was over 70 billion. The execution platform included up to 64 dual-socket nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz clock. Network initialization time, memory usage, and execution time showed good scaling performances from 1 to 1024 processes, implemented using the standard Message Passing Interface (MPI) protocol. We achieved simulation speeds of between 2.3×10 9 and 4.1×10 9 synaptic events per second for both cortical states in the explored range of inter-modular interconnections.
Introduction
At the large scale, the neural dynamics of the cerebral cortex result from an interplay between local excitability and the pattern of synaptic connectivity. This interplay results in the propagation of neural activity. A case in point is the spontaneous onset and slow propagation of low-frequency activity waves during the deep stages of natural sleep or deep anesthesia [1] [2] [3] [4] .
The brain in deep sleep expresses slow oscillations of activity at the single-neuron and local network levels which, at a macroscopic scale, appear to be synchronized in space and time as traveling waves (slow-wave activity, SWA). The "dynamic simplicity" of SWA is increasingly being recognized as an ideal test bed for refining and calibrating network models composed of spiking neurons. Understanding the dynamical and architectural determinants of SWA serves as an experimentally grounded starting point to tackle models of behaviorally relevant, awake states [5] [6] [7] . A critical juncture in such a logical sequence is the description of the dynamic transition between SWA and asynchronous, irregular activity (AW, asynchronous wake state) as observed during fade-out of anesthesia, for instance, the mechanism of which is still a partially open problem [7] [8] [9] . To help determine the mechanism of this transition, it may be of interest to identify the factors enabling the same nervous tissue to express global activity regimes as diverse as SWA and AW. Understanding this repertoire of global dynamics requires highresolution numerical simulations of large-scale networks of neurons which, while keeping a manageable level of simplification, should be realistic with respect to both nonlinear excitable local dynamics and to the spatial dependence of the synaptic connectivity (as well as the layered structure of the cortex) [10] [11] [12] [13] .
Notably, efficient brain simulation is not only a scientific tool, but also a source of requirements and architectural inspiration for future parallel/distributed computing architectures, as well as a coding challenge on existing platforms. Neural network simulation engine projects have focused on: flexibility and user friendliness, biological plausibility, speed and scalability (e.g. NEST [14, 15] , NEURON [16, 17] , GENESIS [18] , BRIAN [19, 20] ). Their target execution platforms can be either homogeneous or heterogeneous (e.g. GPGPU-accelerated) high-performance computing (HPC) systems, [21] [22] [23] , or neuromorphic platforms, for either research (e.g. SpiNNaker [24] , BrainScaleS [25] ) or application purposes (e.g. TrueNorth [26] ).
From a computational point of view, SWA and AW pose different challenges to simulation engines, and comparing the simulator performance in both situations is an important element in assessing the general value of the choices made in the code design. During SWA, different and limited portions of the network are sequentially active, with a locally high rate of exchanged spikes, while the rest of the system is almost silent. On the other hand, during AW the whole network is homogeneously involved in lower rate asynchronous activity. In a distributed and parallel simulation framework, this raises the question of whether the computational load on each core and the inter-process communication traffic are limiting factors in either cases. We also need to consider that activity propagates for long distances across the modeled cortical patch, therefore the impact of spike delivery on the execution time depends on the chosen connectivity. Achieving a fast and flexible simulator, in the face of the above issues, is the purpose of our Distributed and Plastic Spiking Neural Networks (DPSNN) engine. Early versions of the simulator [27] originated from the need for a representative benchmark developed to support the hardware/software co-design of distributed and parallel neural simulators. DPSNN was then extended to incorporate the event-driven approach of [28] , implementing a mixed time-driven and event-driven strategy similar to the one introduced in [29] . Here, we report the performances of DPSNN in both slow-wave (SW) and AW states, for different sizes of the network and for different connectivity ranges. Specifically, we discuss network initialization time, memory usage and execution times, and their strong and weak scaling behavior.
Materials and methods

Description of the Distributed Simulator
DPSNN has been designed to be natively distributed and parallel, with full parallelism also exploited during the creation and initialization of the network. The full neural system is represented in DPSNN by a network of C++ processes equipped with an agnostic communication infrastructure, designed to be easily interfaced with both Message Passing Interface (MPI) and other (custom) software/hardware communication systems. Each C++ process simulates the activity of one or more clusters of neighboring neurons and their set of incoming synapses. Neural activity generates spikes with target synapses in other processes; the set of "axonal spikes" is the payload of the associated exchanged messages. Each axonal spike carries the identity of its producing neuron and its original time of emission (AER, Address Event Representation [30] ). Axonal spikes are only sent to target processes where at least one target synapse exists.
The memory cost of point-like neuron simulation is dominated by the representation of recurrent synapses which, in the intended biologically plausible simulations, are numbered in thousands per neuron. When plasticity support is switched off, the local description of each synapse includes only the identity of the target and source neurons, the synaptic weight, the transmission delay from the pre-to the post-synaptic neuron, and an additional optional identifier for possible different types of synapse (see Table 1 ). When the synaptic plasticity support is switched on, each synapse also takes note of its own previous activation time and a variable is computed which is a low-pass filter of the sequence of spike-timingdependent plasticity (STDP) potentiation and depression contributions, with a longer timescale than that of neural dynamics, which is used to compute the update of synaptic weights. In the present work, synaptic plasticity is kept off.
During the initialization phase only, each synapse is represented at both the processes storing the source and target neuron. Indeed this is the moment of peak memory usage. Afterwards, the initialization synapses are stored exclusively in the process-hosting target neurons. For each process, the list of all incoming synapses is maintained. The synaptic list is double-ordered: synapses with the same delay are grouped together and are further ordered by presynaptic neuron index as in [28] . Incoming spikes are ordered according to the identity of the presynaptic neurons. The ordering of both the synaptic list and incoming spikes helps a faster execution of the demultiplexing stage that translates incoming spikes into synaptic events because it increases the probability of contiguous memory accesses while exploring the synaptic list.
The neuron data structure is relatively small. It mainly contains the parameters used to describe the dynamics of the single-compartment point-like neuron. Moreover, it also contains the queue of input spikes (both recurrent and external) that arrived during the previous simulation time step, with their associated post-synaptic current value and time of arrival. External spikes are generated as a Poissonian train of synaptic inputs.
In DPSNN there is no memory structure associated with external synapses: for each neuron, for each incoming external spike the associated synaptic current is generated on the fly from a Gaussian distribution with assigned mean and variance (see Section 2.2); therefore external spikes have a computational cost but a negligible memory cost. The described queuing system ensures that the full set of synaptic inputs, recurrent plus external, are processed using an event-driven approach.
Execution Flow: Overview of the Mixed Time-and Event-driven Simulation Scheme
There are two phases in a DPSNN simulation: (1) the creation of the structure of the neural network and its initialization; (2) the simulation of the dynamics of neurons (and synapses, if plasticity is switched on). For the simulation phase we adopted a combined event-driven and time-driven approach partially inspired by [29] . Synaptic events drive the neural dynamics, while the message passing of spikes among processes happens at regular time steps, which must be set shorter than the minimum synaptic delay to guarantee the correct causal relationship in the distributed simulation. A 1 ms time-step was adopted here for the message passing, a timescale comparable to the action potential duration setting a kind of absolute refractory period. Figure 1 describes the main blocks composing the execution flow and the event-or time-driven nature of each block. The simulation phase can be broken down into the following phases: (1) identification of the subset of neurons that spiked during the previous time step, and (when plasticity is switched on) computation of an event-driven STDP contribution; (2) spikes are sent to the cluster of neurons where target synapses exist (inter-process communication blocks in the figure); (3) the list of incoming spikes to each process is placed into the double-ordered synaptic list, waiting for a number of time steps that match the synaptic delays, at which point the corresponding target synapses are activated; (4) synapses inject their event into queues that are local to their post-synaptic neuron and compute the STDP plasticity contribution; (5) each neuron sorts the lists of input events produced by recurrent and external synapses; (6) for each event in the queue, the neuron integrates its own dynamic equations using an event-driven solver. Periodically, at a slower time step (1 s in the current implementation), all synapses modify their efficacies using the integrated plasticity signal described above. Later sections describe the individual stages and data representation in further detail. Figure 1 : Execution flow. Each software process composing a DPSNN simulation represents one (or more) clusters of spatially neighboring neurons and their incoming synapses. Each process iterates over the blocks listed here which simulates the activity of local neurons and incoming synapses, their plasticity and the exchange of messages through axo-dendritic arborizations. It is a sequential flow of event-driven and time-driven computational and communication (inter-and intra-process) blocks (see the label on the left of each block). The measure of the relative execution times of the blocks is used to guide the optimization effort and to drive the co-design of dedicated hardware/software architectures.
Spike Messages: Representation and Communication
Spike messages are defined according to an AER, where each spike is represented by the identity of the spiking neuron, and the spiking time. Spikes sharing a target process (i.e., targeting the cluster of neurons managed by a process) are packed into "axonal spikes" messages that are delivered from the source process to its target processes during the communication phase. During execution, the "axon" arborization is managed by the target process in order to reduce the network communication load. Its construction during the initialization phase involves the following steps: in a "fan-out" step, each process notifies the processes which will host the target synapses of its own neurons. In the next "fanin" step, each target process stores its own list of source neurons, and the associated double-ordered list of incoming synapses. For details about the initial construction of the connectivity infrastructure and the delivery of spiking messages see Supplementary Materials.
Bidimensional Grids of Cortical Columns and their Mapping onto Processes
Neural networks are organized in this study as bidimensional grids of modules (mimicking cortical columns) as in [31] . Each module is composed of 1250 point-like neurons, and further organized into subpopulations. The set of cortical columns can be distributed over a set of processes (and processors): each process can either host a fraction of a column (e.g. ½, ¼. . . ), a whole single column, or several columns (up to 576 columns per process for the experiments reported in this paper). In this respect, our data distribution strategy aims to memorize on the same memory neighboring neurons and their incoming synapses. As usual in parallel processing, for a given problem size the performance tends to saturate and then to worsen, as it is mapped onto an excessively large number of processors, so that for each grid size an optimal partitioning onto processes is heuristically established, in terms of the performance measure defined in Section 2.4.
Each process runs on a single processor; if hyperthreading is active, even more than one process can be simultaneously executed on the same processor. For the measures in this paper, hyperthreading was deactivated, so that one hardware core always runs a single process. In DPSNN, distributing the simulation over a set of processes means that both the initialization and the run phases are distributed, allowing the scaling of both phases with the number of processes, as demonstrated by the measures reported in Section 3.
Model Architecture and Network States
For each population in the modular network, specific parameters for the neural dynamics can be defined, as well as specific intra-and inter-columnar connectivity and synaptic efficacies. Connectivity among different populations can be modeled with specific laws based on distance-dependent probability, specific to each pair of source and target subpopulation. By suitably setting the available interconnections between different populations, cortical laminar structures can also be potentially modeled in the simulation engine.
The grids (see Table 2 ) are squares in a range of sizes (24×24, 48×48, 96×96, 192×192). Each local module is always composed of K = 1250 neurons, further subdivided into subpopulations. We implemented a ratio of 4:1 between excitatory and inhibitory neurons (K I =250 neuron/module); in each module, excitatory neurons (E) were divided into two subpopulations: K F =250 (25%) strongly coupled "foreground" neurons (F), having a leading role in the dynamics, and K B =750 (75%) "background" neurons (B) continuously firing at a relatively low rate. Populations on the grid are connected to each other through a spatial connectivity kernel. The probability of connection from excitatory neurons decreases exponentially with the inter-module distance d:
More specifically, d is the distance between the source (s = {F, B}), and target (t = {F, B, I}) module, and λ a characteristic spatial scale. Simulations are performed considering different λ values (0.4, 0.5, 0.6, 0.7), but C 0 tsλ is set so as to generate the same mean number of projected synapses per neuron (M t = 0.9 * K t , t = {F, B, I}) for all λ values. Connections originating from inhibitory neurons are local (within the same local module) and also in this case M t = 0.9 * K t , t = {F, B, I}. All neurons of the same type (excitatory/inhibitory) in a population share the same mean number of incoming synapses. The connectivity has open boundary conditions on the edges of the two-dimensional surface.
Synaptic efficacies are randomly chosen from a Gaussian distribution with mean J ts and SD ∆J ts , chosen in different experiments so as to set the system in different working regimes and simulate different states. The procedure for the selection of the efficacies is based on a mean-field method described below. Each neuron also receives spikes coming from neurons belonging to virtual external populations, collectively modeled as a Poisson process with average spike frequency ν ext and synaptic efficacy J ext . Excitatory neurons are point-like leaky integrate-and-fire (LIF) neurons with spike frequency adaptation (SFA) [31, 32] . SFA is modeled as an activity-dependent self-inhibition, described by the fatigue variable c(t). The time evolution of the membrane potential V(t), and c(t), of excitatory neurons between spikes is governed by:
τ m is the membrane characteristic time, C m the membrane capacitance, and E the resting potential. SFA is not considered for inhibitory neurons; that is, in (2), the second equation and the g c c C m term in the first are absent. Incoming spikes, generated at times t i , reached the target neuron with delay δ i and provoked instantaneous membrane potential changes of amplitude J i . Alike, external stimuli produce a J i ext increment in the membrane potential, with t poiss i
representing the spike times generated by a Poisson distribution of average ν ext . An output spike at time t
sp was triggered if the membrane potential exceeded a threshold V θ . On firing, the membrane potential was reset to V r for a refractory period τ arp , whereas c was increased by the amount α c . Once the network connectivity and the neural dynamics have been defined, synaptic efficacies and external stimuli can be set to determine the dynamical states accessible to the system, by means of mean-field theory. The mean-field dynamics for the average activity ν i , i = {F, B, I} is determined by the gain functions φ i as follows:
where τ E and τ I are phenomenological time constants. The interplay between the recurrent excitation embodied in the gain function, and the activity-dependent self-inhibition, is the main driver of the alternation between a high-activity (Up) state and a low-activity (Down) state [31, 33] . In the mean-field description of the neural module, synaptic connectivity is chosen so as to match the total average synaptic input the neurons would receive inside the multi-modular network. The fixed points of the dynamics expressed by equations (3) can be analyzed using standard techniques [34] . The nullclines of the system (whereν = 0 orċ = 0) cross at fixed points that can be predicted to be either stable or unstable. In the simulations described here, recurrent synapses J t,s connecting source population s and target population t, and external synapses J t,ext , were randomly chosen from a Gaussian distribution with mean J t,s and variance ∆J t,s = 0.25 × J t,s .
We relied on mean-field analysis to identify neural and network parameters setting the network's modules in SW or AW dynamic regimes. Figure 2A shows an example of nullclines for the mean-field equations (3) of a system displaying SW. The black S-shaped line is the nullcline for the rate ν while the red straight line is the one for the fatigue variable c (for details see [31, 33] ). The stable fixed point, at the intersection of the nullclines, has a low level of activity and is characterized by a small basin of attraction: the system can easily escape from it thanks to the noise, and it gets driven towards the upper branch of the ν nullcline from which, due to fatigue, it is attracted back to the fixed point, thus generating an oscillation (see Figure 2B) .
Network parameters can also be set in order to have an asynchronous state, mainly by setting a lower Foreground-to-Foreground (FF) synaptic efficacy, which generates a more linear ν nullcline close to the fixed point (see Figure 2C ). In this case the basin of attraction of the fixed point is larger, and oscillations do not occur, resulting in a stationary asynchronous state (see Figure 2D ), in which neural activity fluctuates around the mean-field fixed point. 4 in Supplementary Materials reports the complete list of synaptic parameters for both SW and AW states. 
Neural Activity Analysis
The simulation generates the spikes produced by each neuron in the network. From these, the time course of the average firing rate for each subpopulation in the network can be computed using an arbitrary sampling step, which here we set to be 5 ms. Power spectra are computed using the Welch method (see Figures 6B and 6E for examples of SW and AW power spectra including delta band).
Wavefront propagation is analyzed using techniques already developed in a previous paper [31] . A proxy to the multi-unit activity (MUA) of each cortical module is computed as the average firing rate of its subpopulations, with added white noise (with mean 0 and variance 0.5) to emulate unspecific background fluctuations from neurons surrounding the module. This methodological choice allows to map quantitatively the simulated network activity onto the MUA experimentally measured.
For SW, this signal altered between high and low activity states (Up and Down states). T(x, y), the Down-to-Up transition time, is determined by a suitable threshold and is detected for each point in the spatial grid, defining the propagation wavefront. V(x, y), the absolute value of wavefront speed, can be computed as
The average speed during the collective Down-to-Up transitions is obtained by averaging V(x, y) over all the simulated positions (x, y).
Performance Measures
Performances are quantified in terms of "strong scaling" and "weak scaling": the former refers to keeping the problem size fixed and varying the number of hardware cores, while the latter refers to the increase in equal proportions of the problem size and the number of hardware cores.
As a performance measure we computed the equivalent synaptic events per second as the product of the total number of synapses (recurrent and external) and the number of spikes their received across the whole simulation, divided by the elapsed execution time. This way, a comparison of the simulation cost among different problem sizes and hardware/software resources (core/processes) can be captured in a single graph. Similarly, we can define a convenient metric to evaluate the memory efficiency of a simulation: by dividing the total memory required by the simulation by the number of recurrent synapses to be represented. Indeed, as stated in Section 2.1 we expect the memory usage to be dominated by the representation of synapses which are thousands per neuron. Scalability measurements were taken on different neural network sizes, varying the size of the grid of columns and, for each size, distributing it over a different span of MPI processes. We selected four grid sizes: 24×24 columns, including 0.7M neurons and 1.1G synapses; 48×48 columns including 2.8M neurons and 4.4G synapses; 96×96 columns including 12M neurons and 17.6G synapses; 192×192 columns including 46M neurons and 70G synapses. The number of processes over which each network size is distributed varied from a minimum, bounded by memory, and a maximum, bounded by communication or HPC platform constraints (see Table 2 ). For the 192×192 configuration, only one measure was taken because of memory requirements, corresponding to a run distributed over 1024 MPI processes/hardware cores.
Execution times for SW were measured across the time elapsed between the rising edges of two subsequent waves, and for AW across a time span of 3 s. In both cases, initial transients were omitted. 
Validation of Results and Comparison of Performances
We used NEST version 2.12, the high-performance general purpose simulator developed by the NEST Initiative, as a reference for the validation of results produced by the specialized DPSNN engine and for the comparison of its performances (speed, initialization time, memory footprint). The comparison was performed for both SW and AW dynamical states, for a network of 4.4G synapses (48×48 grid of columns). We assessed the correctness of results using power spectral density (PSD) analysis of the temporal evolution of the firing rates of each subpopulation. Figure 3 reports an example of this comparison, showing the PSD match of NEST and DPSNN for each single subpopulation in a randomly chosen position inside the grid of columns. 
Execution Platforms
The benchmark measures assessing the DPSNN scalability in terms of run time, initialization time, and memory usage herein described, have been performed on the Galileo server platform provided by the CINECA consortium. Galileo is a cluster of 516 IBM nodes, each of which includes 16-cores, distributed on two Intel Xeon Haswell E5-2630 v3 octa-core processors clocked at 2.40GHz. The nodes are interconnected through an InfiniBand network. Due to the specific configuration of the server platform, the maximum-allowed partition of the server includes 64 nodes. Hyperthreading is disabled on all cores, therefore the number of MPI processes launched during each run exactly matches the hardware cores, with a maximum of 1024 hardware cores (or equivalently MPI processes) available for any single run.
In the present version of the DPSNN simulator, parameters cannot be changed at runtime, and dynamic equations are explicitly coded (i.e., no meta-description is available as an interface to the user). Simulations used for validation and comparison between DPSNN and NEST are run on a smaller cluster of eight nodes interconnected through a Mellanox InfiniBand network; each node is a dualsocket server with a six-core Intel(R) Xeon(R) E5-2630 v2 CPU (clocked at 2.60GHz) per socket with HyperThreading enabled, so that each core is able to run two processes.
Results
Initialization Times
The initialization time, in seconds, is the time required to complete the setup phase of the simulator, which is necessary to build the whole neural network. Measures reported in Figure 4A show the scaling of the DPSNN initialization time for four network grid sizes, distributed over a growing number of hardware cores, which also correspond to MPI processes. Dashed colored lines represent strong scaling. Weak scaling is represented by the four points connected by the dotted black line (each point refers to a four-fold increase in the network size and number of used cores with respect to the previous one). For the explored network sizes and hardware resources, the initialization time speedup is almost ideal for fewer cores, while for the highest numbers of cores is sub-ideal by a factor between 10% and 20%. Figure 4B reports the scaling of the initialization time for two different values of the average distance between interconnected modules (λ), for a single grid size (96×96 columns). As already explained, for all simulations we kept the total number of synapses per neuron constant. Going from λ = 0.4 to λ = 0.6, each excitatory neuron in a column has synaptic connections with neurons in a number of other columns growing from 44 to 78 (neglecting variations to proximity to columns' boundaries). Almost the same scaling is observed for both values of λ, with expected higher values for the initialization of a network with longer-range connectivity. Indeed, a dependence of the initialization time on synaptic connectivity range is expected, because it affects the proportion of target synapses residing in different columns, for which MPI messaging is needed for the connectivity setup, as explained in Materials and Methods. Table 2 . The value of λ is kept constant, equal to 0.4 for the four networks. The nearly linear decrease in initialization time for fixed problem sizes indicates an efficient strong scaling, while the quality of weak scaling can be evaluated by observing the moderate increase of the initialization time for a four-fold increase in network size and hardware resources (black dotted line). (B) Initialization time scaling for a network of 17.6G synapses (96x96 grid of columns) using two different λ values: 0.4 (light green) and 0.6 (dark green). For higher connectivity ranges, an increase in the initialization time is registered.
Memory Occupation
In DPSNN, we expect the memory usage to be dominated by the representation of synapses, which are thousands per neuron, with only a minor impact of the representation of neurons and external synapses. Therefore, we defined a memory consumption metric as the total required memory divided by the number of recurrent synapses (see Section 2.4).
Each static synapse needs 12 bytes (see Table 1 ). As described in Section 2.1, during network initialization synapses are generated by the process storing the source neuron. Subsequently, their values are communicated to the process containing the target neuron. The result is that, only during the initialization phase, each synapse is represented at both source and target processes. Therefore, we expect a minimum of 24 bytes/(recurrent synapse) to be allocated during initialization, which is the moment of peak memory usage.
The measured total memory consumption is between 25 and 32 bytes per recurrent synapse, for all simulations performed, from 1 to 1024 MPI processes ( Figure 5 ). From Figure 5A we observe a different trend of the memory cost vs number of cores, below and above the threshold of single-node platforms (16 cores, see Section 2.6). Beyond a single node, additional memory is mainly required by the buffers allocated by MPI interconnect libraries that adopt different strategies for communications over shared memory (inside a node) or among multiple nodes ( Figure 5A ). The memory occupation per synapse has been measured for different network sizes. Figure 5A shows that, for a given number of cores, the memory overhead typically decreases for increasing size of the network, as expected.
Figure 5B shows the impact of the average distance λ between interconnected modules on the memory footprint. Here the network has size 96×96 columns (for a total of 17.6G synapses), and λ = 0.4, 0.6. As expected the memory footprint, mainly dependent on the total number of synapses, essentially remains constant. Note that the relative differences for λ = 0.4, 0.6 decrease as the number of cores increases; this is consistent with the fact that, given the network size, for larger core numbers more columns get distributed on processes residing in different cores. This dilutes the effect of λ, which goes in the same direction. 
Simulation Speed and its Scaling
Impact of Simulated State (SW or AW)
DPSNN execution times for SW and AW state simulations are reported in Figure 6 . Figures 6A and 6B show example snapshots of the simulated network activity in the SW and AW states, respectively; Figures 6C and 6D show corresponding power spectra of the network activity, confirming the main features predicted by theory for such states (such as the low-frequency power increase for SW, the highfrequency asymptote proportional to the average firing rate, the spectral resonances related to SFA, and delays of the recurrent synaptic interaction). Figures 6E and 6F represent strong scaling for SW and AW: the wall-clock time required to simulate 1s of activity, vs number of processes, for four network sizes (see Table 2 ). In the same plots, weak scaling behavior is measured by joining points referring to a four-fold increase in both the network size and the number of processes.
Simulation speeds are measured using the equivalent synaptic-events-per-second metric, defined in Section 2.4. Departures from the ideal scaling behavior (linearly decreasing strong scaling plots, horizontal weak scaling plots) are globally captured in Figure 7 , for both SW and AW states and for all the problem sizes.
As numbers representative of the good scalability of the simulator, we remark that in the worst cases (1024 processes; 96×96 for SW, 192×192 for AW), the speedup with respect to the reference case (24×24 on a single process) is about 590 for AW and 460 for SW, compared with the ideal speedup of 1024.
We also notice that performance scaling is slightly better for AW than SW, which is understood in terms of workload balance (due to the more homogeneous use of resources in AW); it is in any case remarkable that, although SW and AW dynamic states imply very different distributions of activity in space and time, the simulator provides comparable performance scaling figures.
Impact of Communication
The impact of communications on DPSNN performances has been evaluated both on SW and AW state simulations, using two different approaches. In the case of SW simulations, the analysis has been carried out with varying λ (therefore varying the ratio of local versus remote excitatory synaptic connections, at a fixed total number of synapses per neuron): λ = 0.4 (60 % local connectivity), λ = 0.6 (35 % local . (E-F) Scaling of wall-clock execution time for 1 s of SW (E) and AW (F) simulated activity. In both SW and AW states, the scaling has been measured on four different network sizes, as in Table 2. connectivity); clearly, higher λ results in higher payload in communication between processes. Also, Figure 8A shows the known linear dependence of the slow-wave speed on λ [35, 36] . As the wave speed increases, the duration of the Up states stays approximately constant (not shown), so that an increasing portion of the network is simultaneously activated, which in turn may impact the simulation performance. In physical units, for an inter-modular distance (imd) of 0.4 mm, λ = 0.6 imd implies a spatial decay scale of 0.24 mm, and the corresponding speed is approximately 15 mm/s, which is in the range of biologically plausible values [31, [37] [38] [39] [40] . Figure 8B shows that the impact of λ on SW simulations is almost negligible. Figure 9 shows, instead, the impact of different mean firing rates on AW simulations. Higher firing rates imply higher payloads. DPSNN also demonstrates good scaling behavior in this case, with a slight performance increase for systems with a higher communication payload; this latter feature is due to communication costs being typically dominated by latencies and not bandwidth in spiking network simulations. Table 3 reports a performance comparison between DPSNN and NEST using the configuration, described in Section 2.5. DPSNN is about 130 times faster than NEST for SW simulations and about 80 times faster for AW cases, and its initialization is about 19 times faster. The memory footprint of DP-SNN is about 2.5 times lower. For the comparison of execution speed, we selected the best execution time for each simulator, for a fixed amount of used hardware resources; that is, the number of nodes of the cluster server. On the NEST simulator, we explored the space of all possible combination of MPI processes and number of threads during the AW simulations, in order to find the configuration performing better on a fixed number of hardware resources. In the same configuration, we also compared the initialization phase, the memory usage, and the SW execution times. Table 3 : Memory footprint, initialization and execution times for 10 s of activity in SW and AW states required by DPSNN and NEST to simulate a neural network made of 2.8M neurons and 4.4G synapses (48×48 grid of cortical columns). The second column reports the number of processes over which simulations are distributed. In case of DPSNN it corresponds to the number of pure MPI processes, while for NEST it is the number of Virtual Processes (VP). Each VP is calculated as MPI processes × number of threads, where six threads are used for all NEST simulations. 
DPSNN and NEST: Comparison of Performances
Discussion
We presented a parallel distributed neural simulator, with emphasis on the robustness of its performance and scaling with respect to quite different collective dynamical regimes. This mixed time-and eventdriven simulation engine ( Figure 1 ) has been used to simulate large-scale networks including up to 46 million point-like spiking neurons interconnected by 70 billion instantaneous current synapses. The development of DPSNN originated from the need for a simple, yet representative benchmark (i.e. a mini-application) developed to support the hardware/software co-design of distributed and parallel neural simulators. Early versions of DPSNN [27] have been used to drive the development of the EURETILE system [41] in which a custom parallelization environment and the APEnet hardware interconnect were tested. DPSNN was then extended to incorporate the event-driven approach of [28] , implementing a mixed time-driven and event-driven strategy inspired by [29] . This simulator version is also currently included among the mini-applications driving the development of the interconnect system of the EXANEST ARM-based HPC architecture [42] . In the framework of the Human Brain Project (https://www.humanbrainproject.eu), DPSNN is used to develop high-speed simulation of SW and AW states in multi-modular neural architectures. The modularity results from the organization of the network into densely connected modules mimicking the known modular structure of cortex. In this modeling framework, the inter-modular synaptic connectivity decays exponentially with the distance.
In this section, we first discuss the main strengths of the simulation engine and then put our work into perspective by comparing the utility and performances of such a specialized engine with those of a widely used general-purpose neural simulator (NEST).
Speed, Scaling and Memory Footprint at Realistic Neural and Synaptic Densities
DPSNN is a high-speed simulator. The speed ranged from 3 × 10 9 to 4.1 × 10 9 equivalent synaptic events per wall-clock second, depending on the network state and the connectivity range, on commodity clusters including up to 1024 hardware cores. As an order of magnitude, the simulation of a square cortical centimeter (∼ 27 × 10 9 synapses) at realistic neural and synaptic densities is about 30 times slower than real time on 1024 cores ( Figure 7) . DPSNN is memory parsimonious: the memory required for the above square cortical centimeter is 837 GB (31 byte/synapses, including all libraries overheads), which is in the range of commodity clusters with few nodes. The total memory consumption ranges between 25 and 32 byte per recurrent synapse for the whole set of simulated neural networks and all configurations of the execution platform (from 1 to 1024 MPI processes, Figure 5 )
The engine has very low initialization times. DPSNN takes about 4 s to set up a network with ∼ 17 × 10 9 synapses (Figure 4) . We note that the initialization time is relevant, especially when many relatively short simulations are needed to explore a large parameter space.
Its performances are robust: good weak and strong scaling have been measured in the observed range of hardware resources and for all sizes of simulated cortical grids. The simulation speed was nearly independent from the mean firing rate (Figure 9 ), the range of connectivity (Figure 8) , and from the cortical dynamic state (AW/SW) (Figure 7 ).
Comparison with a State-of-the-art User-friendly Simulator and Motivations for Specialized Engines
There is a widely felt need for versatile, general-purpose neural simulators that offer a user-friendly interface for designing complex numerical experiments and provide the user with a wide set of models of proven scientific value. This boosted a number of initiatives (notably the NEST initiative, now central to the European Human Brain Project). However, such flexibility comes at a price. Performanceoriented engines, missing all the layers required for offering user generality and flexibility, contain the bare minimum code. In the case of DPSNN, this resulted in higher simulation speed, reduced memory footprint, and diminished initialization times (see Section 3.4 and Table 3 ). In addition, optimization techniques developed for such engines on use-cases of proven scientific value can offer a template for future releases of general-purpose simulators. Indeed, this is what is happening in the current framework of cooperation with the NEST development team. Finally, engines stripped down to essential kernels constitute more easily manageable mini-application benchmarks for the hardware/software co-design of specialized simulation systems, because of easier profiling and simplified customizations on system software environments and hardware targets under development.
Future Works
The present implementation of DPSNN demonstrated to be efficient for homogeneous bidimensional grids of neural columns and for their mapping of up to 1024 processes, and this facilitates a set of interesting scientific applications. However, further optimization could improve DPSNN performances, either in the perspective of moving simulations toward million-core exascale platforms or targeting realtime simulations at smaller scale ( [43]), in particular addressing sleep-induced optimizations in cognitive tasks like classification ( [44] ). For instance, we expect that the delivery of spiking messages will be a key element to be further optimized (e.g., using a hierarchical communication strategy). This will also result beneficial for an efficient support of white-matter long-range connectivity (brain connectomes) between multiple cortical areas.
[ [44] C. Capone, E. Pastorelli, B. Golosio, and P. S. Paolucci, "Sleep-like slow oscillations improve visual classification through synaptic homeostasis and memory association in a thalamo-cortical model," Preprint available at https://arxiv.org/abs/1810.10498, 2018.
Supplementary Material
Initial Construction of Connectivity Infrastructure
During the initialization phase, each process contributes to an awareness about the subset of processes that should be listened to during subsequent simulation iterations. At the end of this construction phase, each "target" process should know about the subset of "source" processes that need to communicate with it, and should have created its database of locally incoming axons and synapses. A simple implementation of the construction phase can be realized using two steps. During the first step, each source process informs other processes about the existence of incoming axons and about the number of incoming synapses to be established. A single word, the synapse counter, is communicated among pairs of processes. Under MPI, this can be achieved by an MPI Alltoall(). Performed once, and with a singleword payload, the cost of this first step creates a cumulative network load proportional to the square of the number of processes. The cost of this operation is negligible in the range of processes explored by this paper. The second step transfers the identities of synapses to be created on each target process. Under MPI, the payload, a list of synapses specific for each pair in the subset of processes to be connected, can be transferred using a call to the MPI alltoallv() library function. The cumulative load created by this second step is proportional to the product of the total number of processes and the subset of target processes reached by each source process. The first step produces two effects: (1) it reduces the cost of the initial construction of synapses, the second step of the construction phase; and (2) the knowledge about the non-existence of a connection between a pair of processes can be used to reduce the cost of spiking transmission during the simulation iterations.
Delivery of Spiking Messages during the Simulation Phase
Here, we describe the present implementation of the delivery of spiking messages. In this first implementation, we did not take advantage of the possibility of delivering spikes to targets just before the deadline imposed by the synaptic-specific delay. Instead, we used a synchronous approach: all spikes are delivered to target processes before proceeding to the simulation of the next time iteration of the neural dynamic. The delivery of spiking messages can be split into two steps, with communications directed toward subsets of decreasing sizes. During the first step, single-word messages (spike counters) are sent to the subset of potentially connected target processes. On each pair of source-target process subsets, the individual spike counter gives information about the actual payload (i.e., axonal spikes) that will have to be delivered, or about the absence of spikes to be transmitted between the pair. The knowledge of the subset has been created during the first step of the initialization phase, described in a previous section. The second step uses the spiking counter information to establish a communication channel solely between pairs of processes that actually need to transfer an axonal spike payload during the current simulation time iteration. In MPI, both steps can be implemented using calls to the MPI Alltoallv() library function. However the two calls establish actual channels among sets of processes of decreasing size, as described previously.
Simulation Parameters
The connectivity of a single module can be fully described by setting the values of recurrent synaptic efficacies (J 0 ts ) and of external stimuli (J 0 t,ext ). According to mean-field theory, specific values must be set depending on the network state that has to be simulated. Table 4 summarizes the values of synaptic efficacies, both for recurrent and external connectivity and for each simulated state. The dynamic of the LIF neurons with SFA, used throughout all the simulations reported in this paper, is described by equation 2 in the main article. The values of all the parameters are summarized in Table  4 . 
