Telling neuronal apples from oranges: analytical performance modeling of
  neural tissue simulations by Cremonesi, Francesco & Schürmann, Felix
ar
X
iv
:1
90
6.
02
75
7v
1 
 [q
-b
io.
NC
]  
6 J
un
 20
19
Telling neuronal apples from oranges: analytical
performance modeling of neural tissue simulations
Francesco Cremonesi1 and Felix Schu¨rmann1*
1 Blue Brain Project, Brain Mind Institute, E´cole polytechnique fe´de´rale de
Lausanne, Switzerland
* felix.schuermann@epfl.ch
Abstract
Computational modeling and simulation have become essential tools in the quest
to better understand the brain’s makeup and to decipher the causal interrela-
tions of its components. The breadth of biochemical and biophysical processes
and structures in the brain has led to the development of a large variety of model
abstractions and specialized tools, often times requiring high performance com-
puting resources for their timely execution. What has been missing so far was
an in-depth analysis of the complexity of the computational kernels, hindering
a systematic approach to identifying bottlenecks of algorithms and hardware,
and their specific combinations. If whole brain models are to be achieved on
emerging computer generations, models and simulation engines will have to be
carefully co-designed for the intrinsic hardware tradeoffs. For the first time, we
present a systematic exploration based on analytic performance modeling. We
base our analysis on three in silico models, chosen as representative examples
of the most widely employed modeling abstractions: current-based point neu-
rons, conductance-based point neurons and conductance-based detailed neurons.
We identify that the synaptic modeling formalism, i.e. current or conductance-
based representation, and not the level of morphological detail, is the most sig-
nificant factor in determining the properties of memory bandwidth saturation
and shared-memory scaling of in silico models. Even though general purpose
computing has, until now, largely been able to deliver high performance, we
find that for all types of abstractions, network latency and memory bandwidth
will become severe bottlenecks as the number of neurons to be simulated grows.
By adapting and extending a performance modeling approach, we deliver a first
characterization of the performance landscape of brain tissue simulations, allow-
ing us to pinpoint current bottlenecks for state-of-the-art in silico models, and
make projections for future hardware and software requirements.
1
Introduction and motivation
In the field of computational neuroscience, simulations of biological neural net-
works represent one of the fundamental tools for hypothesis testing and explo-
ration. A widely used scale of representation are neuron-based approaches, i.e.
models of brain tissue in which the fundamental unit is represented by a neuronal
cell. This representation is important as it allows for a faithful matching of the
model with a range of anatomical and electrophysiological data [18,19,47,58,59].
While determining the adequate level of detail is a formidable challenge with
respect to the system modeled, so is the addressing of an efficient implementa-
tion of the simulation in software. The size of such networks, both in terms of
number of neurons and synapses and rate of synaptic events, as well as the level
of biological detail required to answer meaningful questions about the brain,
mean that these simulations come at a large computational cost.
One axis of exploration to mitigate the computational cost of brain tissue sim-
ulation has been to increase the number of compute nodes and benefit from the
improved performance of individual processors over time, leveraging a constant
trend of ever more powerful computing capabilities of microprocessors dictated
by Moore’s law. Methods that exploit the distributed nature of computer clus-
ters were first introduced in [51] for point neurons and [50] for morphologically
detailed models. Such techniques were then refined and extended to exploit
shared-memory parallelism in [14,57] , while recently developed simulators such
as e.g. CoreNEURON [40], Auryn [73], Arbor [36] have these fundamental ca-
pabilities built-in from the beginning.
A second approach has been to design efficient numerical algorithms and
software able to optimize resource usage. A review of numerical methods with a
focus on computational neuroscience was given in [48], while a staggered time
integration method and a neuron representation scheme that allow to solve a
large class of morphologically detailed neuron models in linear complexity were
introduced in [23]. Time integration techniques that allow adaptation of the
timestep to maximise accuracy and efficiency were explored in [30] for point
neurons and in [43] for small networks of morphologically detailed neurons. To
further improve simulation speed, a method that preserves numerical stability
but allows to split individual morphologically detailed neurons across multiple
parallel processes was developed in [25]. A technique to convert morphologically
detailed neurons into equivalent point neuron models was introuced in [59] that
allows to increase simulation speed while keeping the same dynamics of the
original network of detailed neurons.
In scaling to very large networks of neurons, in silico neuroscientists soon
realized the interprocess communication was becoming an important bottleneck,
so simulation strategies using hybrid parallelism [57], non-blocking communica-
tion [4], explicit-implicit numerical schemes [38] and selective sending [24] were
introduced. Finally, more efficient representations of the connection infrastruc-
ture [35, 42] and techniques for assembling the network in parallel [33] proved
to be essential in scaling simulation code to petascale and exascale regimes.
Despite the multiple years of research in efficient implementations of neuron
2
models, we are still missing a more quantitative treatment of what are the actual
computational characteristics of a given level of detail and how a particular level
of detail may be limited by specific hardware trade-offs. How much more costly
is a morphologically detailed neuron simulation compared to a representation
modeling the same neuron as a point? What is the influence of how the synapses
are being modeled? Can we expect that point neuron models can scale to
massively parallel computers in a similar way than detailed neuron models?
For the first time, we extend performance modeling techniques to the field
of computational neuroscience, allowing us to establish a quantitative relation-
ship between the parameters dictated by the biophysical model, the complexity
properties of the simulation algorithm and the details of the hardware specifica-
tions. Although we require a reasonable level of accuracy and validation against
benchmarks, our goal is not to obtain highly accurate performance predictions,
but rather to design a tool with sufficient generality to identify current and
future bottlenecks for different levels of abstraction on the spectrum of models
of neural cells’ dynamics. Based on our requirements, we choose a performance
modeling approach that is neither purely based on first-principles (see e.g. [70])
nor purely empirical(see e.g. [9]), but is instead an hybrid approach known
as analytical, gray-box, focused on state-of-the-art high performance comput-
ing (HPC) hardware architecture, applied to three published neural network
simulations that have been selected to represent the diversity of neuron mod-
els in the literature. Our analysis shows that features related not only to the
neuron abstraction but also to the scientific question under analysis can cause
variations in the hardware bottlenecks that are most influential to simulation
performance. Additionally, we find instances where the level of morphological
detail is not a factor in distinguishing between models’ performance, while the
synaptic formalism is, allowing us to identify the key factors that determine a
model’s computational profile. Finally, we show that our analysis is strongly
conditioned on the published models’ parameters and simulation dynamics, such
that changes in some values, notably the firing frequency, can significantly alter
the performance profile.
Our analysis demonstrates that while general-purpose computing has, un-
til now, largely been able to deliver high performance, the next generation of
brain tissue simulation will be severely limited by hardware bottlenecks. Indeed
this trend has already started, as hardware accelerators, specifically General
Purpose Graphical Processing Units (GPGPUs), have become more widespread
and easier to program, and several simulators have tried to leverage the poten-
tially large speedup promised by such devices, notably the NeMo library [15],
the code generation framework GeNN [72] and others. We refer the interested
reader to [6] for a review. Going even further than acceleration of simulations,
intrinsic limitations in general-purpose hardware have led researchers to design
custom architectures to be more similar to the brain, introducing the field of
neuromorphic hardware. One approach, used by SpiNNaker, has been to em-
ploy relatively simple, low-energy digital cores linked by a tightly connected
low-latency network [63], and has been succesfully applied to a simulation of
a neural microcircuit composed of 80,000 neurons and 0.3 billion synapses [69].
3
A different approach was taken by the designers of the Spikey chip [56], who
opted for an analog representation of neural circuits able to achieve simulations
faster than real time by a 103 factor [71]. Although the constant improvement
in computing capabilites of modern hardware has supported this growth over
many years, it is likely that the rate at which this happens will soon begin
to decrease, and eventually might even come to a halt [35]. Additionally, the
strategy of using more and better hardware can be difficult to sustain from an
economical point of view. Therefore, we believe that the situation calls for a
better, deeper understanding of how hardware capabilities interact with brain
tissue simulation algorithms. In turn, this would allow a stronger collaboration
between in silico modelers, developers and hardware specialists to orchestrate
the co-design of software and hardware architectures.
In conclusion, our analysis shows that there are significant differences in the
performance profiles of in silico models falling within the same category of cell-
based representations. The tools that we developed for this work constitute a
quantitative tool with which modelers, developers and hardware designers can
collaborate in the task of designing and optimizing future software and hardware
for the next generation of brain tissue simulations.
Related work
As a testament to the growing interest of the community in the performance
of brain tissue simulations, several studies on this topic have been published in
the literature. To our knowledge, however, none of them have used performance
modeling as a tool to explain the empirically observed performance properties
of simulations, nor have they tried to analyze such a wide scope of models as
we do in this paper.
The review work of Brette et al. [7] presented a large number of different sim-
ulators and corresponding in silico models, but included only basic formulas for
asymptotic complexity, without exploring the complicated effect that implemen-
tation and hardware have on measured performance. In a series of papers, the
developers and users of the NEST software have investigated the issues related to
scaling simulations of neurons to very large scales [33,35,42,55] and proposed so-
lutions to avoid the performance bottlenecks they encountered. Similarly, new
simulators have been implemented by Modha et al. [4], Kozlosky et al. [38],
and new spike communication strategies have been explored [24] that address
specific scalability issues. All these studies provide very useful data to com-
pare against our own model and conclusions, but are restricted to distributed
simulations and focus on optimizing efficiency using novel algorithmic and im-
plementation techniques. A performance model for a NEST simulation was
fully described in [62], but focuses only on distributed simulations, neglecting
single-node performance, and uses a different performance modeling approach
based on interpolating empirical observations instead of the semi-analytical ap-
proach used in this work. Focusing on very small clusters composed of only a
few nodes with shared-memory capability, Hines et al. [14] demonstrated how
to exploit multicore processor efficiently in simulations of morphologically de-
4
tailed neurons, while an analysis by Gerstner et al. [73] concluded that real-time
simulations of medium sizes plastic networks are not feasible on small clusters
of CPUs, but will require accelerators or dedicated hardware. The work on
the multisplit method [25] demonstrated that efficient acceleration of individual
neurons on single compute nodes is difficult beyond a restricted number of cores,
and recent work on micro-parallelism [44, 45] has found significant limitations
to strong-scaling due to Amdahl’s law.
Several GPU implementations of brain tissue simulations have been proposed
in the literature [6, 15, 40, 72]. A comparison between GPUs, HPC hardware
and neuromorphic hardware by Nowotny et al. [37] found that, under certain
conditions, GPUs can beat neuromorphic hardware in terms of energy efficiency
but not an HPC server in terms of performance. However, the authors did
not perform a detailed performance analysis nor used a performance model to
explain this comparison, thus providing valuable yet anecdotal evidence. Their
work is tightly linked to the neuron model, the configuration of the simulation
as well as the hardware being analyzed.
In the process of creating the neuromorphic hardware SpiNNaker, the de-
signers were deeply interested by the consequences of their decisions in terms
of performance. The analysis in [54] showed that, in the context of real-time
simulations, SpiNNaker is inherently limited in its scope to a restricted sub-
set of synaptic formalism, because the single node’s memory bandwidth puts
a hard limit on the number of synaptic parameters that can be streamed at
every timestep. In designing the network for SpiNNaker, a performance model
was used to determine the ideal topology and network connectivity [52], proving
that performance modeling can be an extremely valuable tool in optimization
and co-design for brain tissue simulations.
In silico models and experiments
The approach of identifying and singling-out recurrent computational patterns
within a scientific field has been applied with great success in the domain of
parallel computing, leading to the definition of the dwarfs of computing [5]. In
computational neuroscience, the review by Brette et al. [7] proposed a similar ap-
proach and introduced some fundamental concepts, such as Conductance Based
(COBA) and Current Based (CUBA) formalisms for synaptic models, or point
and detailed representations of neuronal morphology. Using the nomenclature
introduced in that review, we base our analysis of the performance landscape
on three published models, chosen as representative of the extent of neuron
models covered in literature. We include an example of detailed morphology,
point neuron, Conductance Based (COBA) and Current Based (CUBA) mod-
els. We denote the set of these representative use cases as in silico models and
experiments.
Fig 1 summarizes the in silico models and experiments considered in this
work. The Brunel model is a randomly connected network reproducing the
property of balanced excitation-inhibition that can be observed in the brain
cortex. It is based on integrate-and-fire (IAF) point neurons with CUBA dy-
5
namics [8, 17]. As a representative example of this model, we consider here a
very large-scale implementation that served as a proof of concept for the feasi-
bility of human brain scale simulations [42].
The Reconstructed microcircuit is based on the reconstruction of neocortical mi-
crocircuit from a mix of experimental data and first principles [47]. This model
uses morphologically detailed neurons and short-term plastic COBA synapses
to capture a large amount of biological detail.
The Simplified model was obtained from [61] by reducing the detailed models
of the Reconstructed microcircuit to pointwise GIF models [59] with similar
transfer functions, while retaining the complexity from COBA synapses.
The scope of this paper is restricted to the inference phase of simulating a
neural network, thus neglecting the simulation of learning, because of the addi-
tional layer of complexity that would arise from including long-term plasticity
in our analysis. Moreover, to allow for reproducibility and ease of modeling,
we have switched the random synaptic release mechanism in the Reconstructed
and Simplified models with a deterministic implementation of the same rule,
based on an average representation. This allows us to remove the uncertainty
in the performance model due to probabilistic release, as well as the overhead
from the random number generation, ultimately leading to better accuracy in
the performance model. Although synaptic noise can be considered a primary
driver of cortical dynamics [53], we consider the complexity associated with ran-
dom number generation in performance modeling outside of the scope of this
paper.
Hardware-agnostic metrics to describe in silico models and experi-
ments As a first approach to navigating the performance properties landscape
of in silico models and experiments we propose a collection of hardware-agnostic
metrics that can be computed directly from inspection of the model specifica-
tions. These metrics have the same value regardless of the underlying simulation
hardware, which can be an interesting property if one wants to compare intrinsic
model features. To set a common ground on which we define these metrics we
identify the following features shared by all in silico models and experiments: i)
all neural networks can be represented by a graph where neurons are nodes and
synaptic connections are edges; ii) synaptic connections can be approximated
by a perfect delayed transmission of information; iii) simulation algorithms in-
clude a clock-driven portion to integrate neuronal states and an event-driven
portion to integrate synaptic events (see Fig 1B1,B2); iv) all simulation algo-
rithms considered here follow the Bulk Synchronous Parallel (BSP) paradigm,
where phases computation carried out independently by each parallel rank alter-
nate with global synchronization steps, happening at fixed time intervals called
minimum network delay (denoted by δmin). We are thus excluding from this
analysis asynchronous communication schemes such as [4,46], variable timestep
schemes [43] and models that explicitly represent axons [38].
We evaluate the performance metrics of in silico models on three aspects:
memory, serial complexity and information propagation. In the memory di-
6
connections per neuron
Phenomenological 
point neuron model
Detailed
Reconstructed 
connectivity
Random
connectivity
Brunel
Diesmann et al.
Simplied 
Roessert et al.
Reconstructed 
Markram et al.
vars per neuron
vars per connection
6000
12000
Brune
vars per neuron
vars per connection connections per neuron
6000
12000
Simpli ed
vars per neuron
vars per connection connections per neuron
6000
12000
Reconstructe
g
iCurrent
Based
Conductance
Based
Brunel Simplified Reconstr.
S
e
q
u
e
n
ti
a
l 
c
o
m
p
re
s
s
ib
il
it
y
 l
im
it
0
2
4
6
8
10
12
14
C
o
u
p
li
n
g
 r
a
ti
o
In
fo
rm
a
ti
o
n
 t
ra
n
s
m
it
te
d
 b
y
 a
 c
o
n
n
e
c
ti
o
n
update voltage
Hines solver
threshold
detection
foreach neuron
foreach  t
until mi n
spike exchange
foreach mi n
until Tstop
event driven
spike delivery
t0
t0 +
 t
2
t0 +
 t
2
t0 + t
t0 + t
t0 +	min
syn current
ion ch current
syn update
ion ch update
stimuli
event driven
spike delivery
update voltage
solve ODE
PSC contributions
threshold
detection
foreach neuron
foreach 
 t
until mi n
spike exchange
foreach mi n
until Tstop
t0
t0 +
 t
2
t0 + t
t0 + t
t0 +min
stimuli
A
B1 B2
C1
D
10 2
10 1
100
101
102
103
104
It
e
ra
ti
o
n
 c
o
m
p
re
s
s
ib
il
it
y
 l
im
it
Brunel Simplified Reconstr.
C2
C3 C4
10 5
10 4
10 3
103
104
Fig 1. In silico models and experiments.
Presentation and summary of the in silico models and experiments examined in
this paper. A Main features of in silico models and experiments. B1,B2 CUBA
(resp. COBA) simulation algorithm. The simulation kernels within light grey
boxes are included for completeness but are not considered in our analysis be-
cause they are not part of the computation loop. The larger boxes denote
a synchronization point for distributed simulations. C1,C2,C3,C4 Hardware-
agnostic descriptive metrics allow to capture different features of the models that
are relevant for performance. They plot, respectively, the couping ratio, the in-
formation transmitted by a connection, the sequential compressibility limit and
the iteration compressibility limit. In the iteration compressibility limit plot,
the darker bars represent event-driven computations while the light bars repre-
sent clock-driven computations. D Variability within in silico models, measured
along three axis: the number of variables required to represent a neuron, the
number of connections per neuron and the number of variables to represent a
connection.
7
mension we consider aspects of the model that can affect the memory capacity
footprint such as the number of parameters and degrees of freedom required
to represent a neuron. Specifically we count the number of state variables and
parametes per neuron, the number of state variables and parameters per con-
nection and the fan-in (i.e. the number of incoming connections) per neuron. In
the serial complexity dimension we consider aspects tied to sequential iterations
such as timestep and the number of state variables updated at each iteration.
We define the sequential compressibility limit as the inverse of the timestep ∆t
and the iteration compressibility limit as the number of state variables updated
in a single time iteration. Finally, in the information propagation dimension, we
consider aspects tied to communication of information between neurons such as
the frequency at which a global synchronization must happen and the amount
of information exchanged in this step. Here we define the coupling ratio as the
number of timesteps that can be taken before a global synchronization point
must happen, given by the formula δmin
∆t
, and the information transmitted by a
connection as the number of variables communicated by a connection on average
during a mindelay period.
Each of the metrics described above can be associated to one or more per-
formance aspects and hardware features. For example, the low values of the
coupling ratio for the Simplified and Reconstructed model (see Fig 1C1) can be
associated to poor strong scaling properties, while the large information trans-
mitted by a connection (see Fig 1C2) for the Brunel model translates to higher
pressure on the network interconnectivity hardware. Concerning time itera-
tions, the large event-driven component of the iteration compressibility limit for
the Brunel model (see Fig 1C4) points to the fact that it could potentially be
bounded by hardware latency aspects (either memory latency or critical paths
in the execution) as well as dynamic imbalance, while the Simplified and Re-
constructed model are more likely affected by throughput of hardware features.
The large number of connections per neuron in the Brunel model (see Fig 1D)
and the large number of variables to represent a neuron in the Reconstructed
entail that these in silico models will be bounded by memory capacity. Finally,
the large variability of individual neurons in the Reconstructed model poses a
potential static load-balancing problem, as was empirically found in [41] in the
context of manycore processors.
Using performance modeling to deliver a quantitative characteriza-
tion of the performance landscape of in silico models and experiments
The metrics described previously can provide an insightful summary of the per-
formance profile of in silico models and experiments, but lack the power to give
quantitative performance predictions and the connection with specific hardware
properties. Therefore, we use performance modeling as a way of bridging the
gap between biophysical models, simulation algorithms and hardware specifi-
cations. In particular, we split the performance prediction in a single-node
component and an interprocess communication component. We address the
single-node performance modeling using the Execution-Cache-Memory (ECM)
8
model [67] and the interprocess communication part using the LogGP model [2].
Both are well-established approaches that have been extensively validated on
several hardware platforms [21, 22, 28], however significant work is required to
extend and adapt them to the simulation kernels in our analysis, for example
accounting for indirect memory accesses in the single-node predictions and the
representation of spikes in the communication component. For this reason, we
also present a validation of runtime predictions specific to our use-case in the
Methods section. Performance modeling enables us to explore the performance
landscape of brain tissue simulations, explain current performance profiles and
make predictions about future bottlenecks and trends. Our work can provide
a common framework allowing modelers, developers and hardware designers to
discuss and share ideas, laying the foundations for the next generation of high
performance brain tissue simulations.
Results and discussion
Our goal is to provide a quantitative appraisal of the performance landscape
of brain tissue simulations and analyze in detail the relationship between an in
silico experiment, the underlying neuron and connectivity model, the simulation
algorithm and the hardware platform being used. We carry out this analysis
with the tool of performance modeling, allowing us to quantify and explain per-
formance bottlenecks without the need of time-consuming and narrowly-scoped
benchmarks. We consider four different simulation regimes, arising from the
combination of two axis. The first axis is defined by the strategy for paralleliza-
tion: shared-memory or distributed-memory. The second axis is defined by the
strategy for scaling the problem size with the available hardware: maximum-
filling or real-time, corresponding respectively to weak-scaling or strong-scaling.
Given that the results of this analysis are tightly dependent on the hardware
platform under consideration, the reference single-node architecture is Intel’s
Skylake-X processor with AVX512 vectorization, considered to be a prototypi-
cal example of state-of-the-art HPC microarchitectures. For distributed simula-
tions, the reference network architecture is a vendor-optimized HPE implemen-
tation of the MPI standard over an Infiniband EDR 100 GB/s fabric.
Shared-memory maximum filling
One of the most common simulation configurations involves scaling the number
of neurons until the memory capacity limit is reached. This configuration has
been used as proof-of-concept for brain tissue simulations to the scale of brain
regions [3, 35] and even the full brain [34], and constitutes a fundamental tool
for neuroscientists to simulate networks whose sizes are representative of the
neural systems they are studying.
Memory bandwidth limits shared-memory parallelism Modern archi-
tectures are typically designed with memory bandwidth as the most relevant bot-
9
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
min
t
2
4
6
8
10
12
14
16
18
N
u
m
b
e
r 
o
f 
th
re
a
d
s
 t
o
 s
a
tu
ra
ti
o
n
Brunel Simplified Reconstructed
100
101
102
103
104
P
e
rf
o
rm
a
n
c
e
 a
t 
s
a
tu
ra
ti
o
n
 p
e
r 
n
e
u
ro
n
 [
s
im
 s
/w
a
ll
c
lo
c
k
 s
]
linear algebra ion channel state
syn state gif state iaf update
ion channel currentsyn current gif current
iaf psc
spike delivery
A B
wallclock time
L3
C
percent runtime at saturation
0 20 40 60 80 100
Reconstructed
Simplied
Brunel
1 4 7 10 13 16
Number of shared memory threads
0
20
40
60
80
100
P
e
rc
e
n
t 
u
ti
li
z
a
ti
o
n
 o
f 
m
e
m
o
ry
 b
a
n
d
w
id
th
Fig 2. Memory bandwidth saturation of in silico neuron models on a
state-of-the-art HPC architecture.
The memory interface utilization and total runtime of different neuron models,
as predicted by the ECM performance model on a Skylake-X architecture with
AVX512 vectorization. A Percentage of memory bandwidth utilization for dif-
ferent neuron models, as a function of the number of shared memory threads.
The dashed black line denotes the threshold for saturation set at 90% utiliza-
tion. B Simulation performance per neuron in simulated seconds per wallclock
second under conditions of full saturation of the memory bandwdith. To obtain
the performance prediction for a whole network divide the performance values
by the number of neurons in the network. Note the logarithmic scale on the y
axis. The inset shows the percent of the total runtime required by each kernel
within a model’s simulation loop. C To mitigate the effect of memory band-
width saturation, a smart ordering of time and neuron loops is implemented by
state-of-the-art simulators, as shown in the diagram on the right. We plot the
number of threads required to reach 90% saturation of memory bandwidth as a
function of the coupling ratio, clipping the total to the maximum allowed by the
reference architecture. The inset shows the maximum bandwidth utilization as
a percent of the theoretical peak bandiwdth, as a function of the coupling ratio.
10
tleneck for shared-memory parallelism [49]. This means that if all the shared
memory parallel threads are used, it is very likely that performance will be
bounded by the memory bandwidth. Indeed, this has been demonstrated to be
the case for simulations of detailed neurons [12] and strongly suspected in the
case of point neurons [73]. To verify the hypothesis that memory bandwidth
could indeed be the main bottleneck in the maximum filling regime we extend
the ECM performance model following our previous work [12] to predict the
utilization of the memory interface, computed as a fraction of the theoretical
maximum memory bandwidth, for the three in silico models considered in this
work. The results are shown in Fig 2A. we find that all models pass the threshold
of 90% utilization well before all available parallel threads are utilized, meaning
that memory bandwidth is indeed a bottleneck in the maximum filling scenario,
under the assumption that data must be pulled from main memory at every
time iteration. Moreover, this also implies that the parallelism exposed by the
architecture cannot be fully exploited in the maximum filling regime. However,
we surprisingly also find that, regardless of the level of morphological detail,
COBA models share a similar pattern of early saturation, while the CUBA IAF
model requires more parallelism to achieve memory bandwidth saturation.
State-of-the-art HPC memory chips can sustain fast simulations of
the Brunel and Simplified models at full saturation From a practical
point of view, in addition to analyzing the scaling behaviour of simulations,
computational neuroscientists are also interested in predicting the actual run-
time for a given model. Therefore, we predict the simulation performance for
the three in silico models under the assumption that memory bandwidth is fully
saturated. The results are plotted in Fig 2B, and Table 1 reports the memory
traffic requirements to simulate one second of activity, alongside the predicted
memory capacity. As unit of measure for performance we chose simulated sec-
onds per wallclock second per neuron, in order to present our results in a way
that is independent from the network size and the duration of the simulation.
Our results indicate that the modern, fast memory chips on the reference archi-
tecture are able to sustain faster-than-real-time simulations of up to roughly 104
neurons in the Brunel model, and 103 neurons in the Simplified model, while
faster-than-real-time simulations of the Reconstructed model are predicted to
be impossible. As an important remark, these predictions only consider the
computational and communication kernels of an in silico model, and notably ne-
glect event bookkeeping efforts, random numbers generation and computation
of stimuli. In a real world scenario, the empirically measured performance of
a model could be much smaller if the model’s execution is not dominated by
computational aspects.
Event-driven synaptic integration dominates CUBA performance, while
clock-driven kernels dominate COBA performance. We also predict the
breakdown of relative importance of different kernels that constitute the simula-
tion algorithm of a model, shown in the inset of Fig 2B, allowing us to highlight
11
Table 1. Predicted memory traffic and capacity requirements.
Volume per biological millisecond [kB] Capacity [kB]
Brunel 8 4.5× 103
Simplified 1.3× 102 9.9× 102
Reconstructed 5.9× 104 3.9× 104
The requirements in this Table are computed considering only the data
structures strictly relevant to computation, thus neglecting overhead from
implementation details.
some interesting differences between models. In the maximum filling scenario
the Reconstructed model is not dominated by a single kernel. Instead, synaptic
and ionic current kernels constitute almost 60% of the execution time, while
state update kernels, which are commonly regarded as more costly, are a few
points short of taking up the remaining 40%. This has been extensively vali-
dated in [12] and can be explained by the fact that current kernels have stronger
data requirements and lower computational requirements, and thus poorer per-
formance when the bottleneck is constituted by the memory bandwidth. The
Simplified model is similarly dominated by the computation of synaptic cur-
rent kernels. Since the Simplified and the Reconstructed model share the same
COBA synaptic formalism, this can explain why some of their performance prop-
erties are very similar, in spite of the fact that their representation of neurons’
morphologies is extremely different. Conversely, the performance of the Brunel
model is determined for more that 60% by a single kernel: the event-driven
integration of synaptic events. In the following sections we explore in detail
how this characteristic has an impact on determining which hardware feature is
most relevant for the performance of in silico models.
Ordering of loops to avoid memory bandwidth saturation. State-of-
the-art simulators employ a specific ordering of the loops over neurons, timesteps
(∆t) and minimum network delay steps (δmin) to minimize the impact of memory
bandwidth by maximizing data locality. This strategy is exemplified on the right
of Fig 2C, and consists of allowing a neuron to step in time from the beginning of
the minimum network delay interval to its end, before initiating the stepping of
the second neuron. This approach is equivalent to swapping the neuron and ∆t
loops in Fig 1B1,B2, and has the benefit of maximizing data locality. Depending
on the memory traffic requirements, data could be stored in the cache after
being read once from main memory during the first timestep, ready to be used
for the next time iterations within the same minimum network delay period.
Throughout this work we make the conservative assumption that, when using
this strategy, data must be fetched from main memory on the first timestep and
from the L3 cache on consecutive timesteps. In special cases where the memory
traffic requirements of a single neuron are very low and the number of synaptic
events integrated in a timestep is sufficiently small, data could potentially come
12
instead from higher levels in the hierarchy such as the L2 cache; however for
the sake of simplicity we make the assumption that reused data must always
be fetched from L3. The number of timesteps within a minimum delay period
has of course a great influence on the effectiveness of this strategy in terms
of reducing pressure on the memory bandwidth. To quantify this, we predict
the maximum memory bandwidth utilization as a function of the coupling ratio
defined by δmin
∆t
. For COBA models this follows a steep decline for small values
of the coupling ratio, indicating that even just a few timesteps exploiting data
locality can provide great benefits in terms of shared memory scaling, while in
the case of the CUBA model this decline is much slower, which could explain
why the published Brunel model [42] has such a large value of the coupling ratio
while the Reconstructed and Simplified model have low values (see Fig 1C1).
Distributed maximum filling
To overcome the limit on network size imposed by the memory capacity of
individual compute nodes, computational neuroscientists have begun executing
parallel distributed simulations on a cluster of compute nodes. In the distributed
maximum filling regime we still consider that the memory capacity limit of each
individual compute node will be maxed out, but we introduce the possibility of
increasing the number of interconnected compute nodes as a means of increasing
the computational power and capacity of the architecture. In terms of scaling
the problem size proportionally to the number of compute nodes, the fact that
we assume the memory capacity limit to be always maxed out translates to
the concept of weak scaling, on which we will base our analysis in this section.
To keep our initial analysis simple we ignore the loop-ordering optimization
described above and assume instead that the memory bandwdith of individual
compute nodes will always be saturated. We will lift this assumption in later
sections.
Weak scaling properties and performance predictions of in silico mod-
els We predict the performance of distributed simulations in a weak scaling,
maximum filling scenario for different numbers of neurons per rank and all in
silico models, using our performance model. Results are presented in Fig 3A,
where the solid lines correspond to 105 neurons per rank while the dashed lines
correspond to a small number of neurons per rank, computed such that its mem-
ory footprint barely exceeds the 25MB of the reference architecture’s L3 cache;
smaller values would not be possible because they would break the memory
bandwidth saturation assumption. As expected, for a fixed configuration and a
small cluster size the Brunel model has the best predicted performance, beating
by roughly a factor 10 the peformance of the Simplified model and roughly a
factor 104 the performance of the Reconstructed model. Interestingly, these
differences are much less pronounced for large cluster sizes, where the difference
between the Brunel and Reconstructed model can be reduced to less than a fac-
tor 10. Networks with the largest numbers of neurons considered here possess
13
gg i
A A1 A2 A3
B B1 B2 B3
C C1 C2 C3
10
1
10
2
103
10
4
10
5
neurons per rank > L3 cache
1e+5 neurons per rank
L5TTPC
GIF
IAF
network latency
memory bandwidth
neurons per rank > L3 cache size
1e+5 neurons per rank
10 5
10 4
10 3
10 2
P
e
rf
o
rm
a
n
c
e
 [
s
im
 s
/w
a
ll
c
lo
c
k
 s
]
100 101 102 103
Number of distributed ranks
10 1
100
101
102
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
2
4
6
8
10
P
re
d
ic
te
d
 s
p
e
e
d
u
p
100 101 102 103
Number of distributed ranks
100 101 102 103
Number of distributed ranks
network latency
memory bandwidth
neurons per rank > L3 cache size
1e+5 neurons per rank
Fig 3. Performance of distributed max-filling scenario and most
relevant hardware bottlenecks.
A): predicted performance of the three in silico models in a weak-scaling, max-
filling scenario. We plot several values of the number of neurons per rank, rang-
ing from the smallest number that would exceed a typical L3 size of 25MB to
105 neurons per rank. The predicted performance is plotted as the number of
model seconds for which the whole network can be simulated in 1 wallclock
second. We make the assumption of full memory-bandwidth saturation. We
take the SKX-AVX512 node architecture connected by an Infiniband EDR 100
GB/s network as representative of an HPC cluster. A1,A2,A3): represent the
detailed COBA, simplified COBA and point CUBA models, respectively. B):
hardware bottlenecks predicted by the performance model as a function of the
number of neurons per rank (y-axis) and the number of distributed ranks (x-
axis). The grey regions denote an area where the number of neurons per rank
would require less memory than a typical L3 cache size, thus invalidating the
saturation assumption. B1,B2,B3): represent the detailed COBA, simplified
COBA and point CUBA models, respectively. C): Speedup predicted by the
model when a single hardware feature is improved by a factor of 10x, as com-
pared to the baseline performance using the HPC architecture described in A.
Again, since the number of neurons per rank plays an important role, the shaded
areas represent the speedups for different values of the number of neurons per
rank, in the range described in A. C1,C2,C3): represent the detailed COBA,
simplified COBA and point CUBA models, respectively.
14
excellent weak scaling properties regardless of the underlying modeling abstrac-
tion, at least up to the cluster sizes considered here. Conversely, only small
networks of the Reconstructed model retain the same quasi-optimal scaling be-
haviour, while both the Brunel and Simplified model suffer from performance
degradation at large cluster sizes, with the average rate of degradation being
significantly larger for the Brunel model.
Explaining weak scaling performance properties through bottleneck
analysis Fig 3B explains the degradation of performance by identifying the
most relevant bottlenecks for different configurations. Here a hardware bot-
tleneck is defined as the most relevant hardware feature as computed by our
performance model, as explained in the Methods Section. The first striking
property revealed by this analysis is that network bandwidth is never a bottle-
neck for the in silico models considered here. Instead, large-scale simulations
are dominated by the latency of the collective communication. This points to
the fact that investigating spike communication strategies such as neighborhood
collectives [35], nonblocking point-to-point schemes [4] or asynchronous execu-
tion [45] is essential to reach brain-scale simulations.
Distributed simulations of large networks can be improved by increas-
ing memory bandwidth, small networks by decreasing network la-
tency The aforementioned bottleneck analysis is useful to understand which
is the most important hardware feature in a given simulation configuration, but
it does not provide any information about the relative importance of the other
features. Therefore we predict the speedup corresponding to a 10× improvement
of a single hardware feature, for different numbers of neurons per distributed
rank, as a function of the number of ranks. The results are plotted in Fig 3C,
which shows that at small cluster sizes all models would benefit from an im-
provement in the nodes’ memory bandwidth, but not from an improvement in
network latency, while the situation is reversed at large cluster sizes. The pre-
dicted speedup obtained by improving the memory bandwidth degrades much
faster for simulations with a small number of neurons per rank versus a large
number of neurons per rank, while the opposite is true for the speedup coming
from an improved network latency.
Single-node simulations in real time
Another widespread simulation regime is not focused on simulating the largest
possible network, but in simulating a fixed size network as fast as possible. We
call this the real time regime, because that represents an ideal target for the per-
formance of a simulation. Currently, on the one hand it is unclear whether real
time is realistically achievable on modern hardware [73], and on the other hand
special hardware that breaks this limit by design has already been conceived
and tested for small networks [1].
15
Memory bandwidth dominates the shared-memory strong scaling of
Brunel and Simplified models, while a mix of hardware features in-
fluences the performance of the Reconstructed model Given that it
is possible to observe superlinear speedup as the dataset is made increasingly
small by virtue of it fitting into faster cache memory, we predict the performance
per neuron of all in silico models assuming the dataset could be fully contained
in different levels of the memory hierarchy. For simplicity, we neglect the fact
that some of these model and cache combinations are infeasible in practice, e.g.,
due to the memory footprint of a single neuron in the Reconstructed model
exceeding the L1 cache size. For DRAM, we assume that the ordering of loops
technique is used to avoid saturation of the memory bandwidth, thus extending
our analysis by discarding the saturation hypothesis made in previous sections.
For all performance predictions we assume that all available threads in the refer-
ence architecture (18 in total) are being utilized. Results are reported in Table 2,
where we give the raw performance value when data is in DRAM, and the cor-
responding (superlinear) speedup factor as the dataset becomes small enough
to fit in higher levels of the cache hierarchy.
Table 2. Predicted performance per neuron without memory bandwidth
saturation assumption, considering all available threads are used.
model performance (DRAM) speedup in L3 speedup in L2 speedup in L1
Brunel 3.9× 104 2.5 8.7 33.9
Simplified 7.9× 102 4.7 6.5 6.6
Reconstructed 3.9 1.8 2.4 2.4
Performance is measured in simulated seconds per wallclock second.
Fig 4 shows the predicted performance breakdown into simulation kernels
as well as hardware features for all in silico models, assuming that the dataset
fits in different levels of the memory hierarchy. When data is in the highest
level of the cache hierarchy (L1), the most important kernels for all models
are state update kernels, and the most relevant hardware feature is the CPU
throughput. Additionally, in the COBA models the computation of the expo-
nential (for updating the synaptic states) constitutes a signficant portion of the
overall execution time. As the dataset increases in size and is only able to fit
in lower levels of the cache (L2 or L3) the predicted performance of the Brunel
model rapidly degrades, while the COBA models’ remains more stable. In prac-
tice, this is an indication that the Brunel model is bounded by the data path
while the COBA models are bounded by the maximum achievable FLOPs. Our
breakdown analysis confirms this, although for the reference architecture COBA
models are best represented by a mix of core-bound and data-bound kernels, es-
pecially when the dataset fits only in the L3 cache. Complementarily, in COBA
models the relative importance of the core-bound state update kernels gradually
loses weight in favour of data-bound current kernels, while in the Brunel model
the weight of the spike delivery kernel gradually increases, eventually becom-
ing the most relevant kernel in the execution. In spite of this technique, both
16
point neuron models are clearly dominated by the saturation of the memory
bandwidth. In particular, the fact that memory bandwidth is the only factor
in determining the performance of the Simplified model can be directly related
to the fact that its coupling ratio has a value of 1, as shown in Fig 1. The
performance profile of the Reconstructed model is more diverse, and while 60%
of the execution time is still dominated by memory bandwidth, the data trans-
fers between the caches, arithmetic instructions, and throughput of exponential
function evaluations also take up a significant portion of the runtime. Again we
stress that this phenomenon is tightly linked to the hardware architecture used
as reference, specifically to its balance of compute power, memory bandwidth
and number of threads. However, we can assume that the general application
of these conclusions still holds because most modern architectures are designed
with a similar balance point [49].
Distributed real time
An effective strategy for improving simulation performance or to handle larger
networks is to dedicate more hardware to the task, distributing the simulated
neural network across multiple compute nodes. For a fixed problem size, this
translates to the concept of strong scaling. The limits to strong scaling of
medium sized plasticity networks have been empirically explored in [73]. Here
we generalize their analysis to other in silico models as well as provide clear
explanations for the causality of bottlenecks, backed by our performance model.
Performance predictions for strong scaling of arbitrarily sized net-
works We predict the performance and scaling bottlenecks of in silico models
in a strong scaling scenario, using our performance model, and present the re-
sults in Fig 5. As in the distributed maximum filling scenario, the problem size
has a major impact on performance, so we include several possible sizes in our
analysis. Some researchers have explored the possibility of splitting a neuron
across more than a single parallel process, such as the multisplit method [25],
branch-parallelism [44] and domain decomposition [38]. We do not include such
techniques in our analysis because their granularity falls outside the scope of
this work. In practice, this imposes a limit on how many distributed ranks a
given neural network could be scaled on. In Fig 5A we present raw performance
predictions for different network sizes in a strong scaling scenario. The dashed
lines correspond to the minimum network size such that strong scaling can be
carried out until occupancy of the full cluster (set here at 104 distributed ranks),
while thinner solid lines represent smaller network sizes that cannot be scaled to
the full cluster. For all in silico models, as long as the network size is sufficiently
large, the performance initially improves as we distribute the problem over in-
creasingly more ranks. However, for all in silico models there exists a threshold
number of ranks after which the benefits from adding hardware become less
prominent. Scaling to larger cluster sizes after this threshold can be counter-
productive, and even result in performance degradation. The threshold value
itself is a function of the hardware architecture, in silico model and problem size.
17
linear algebraion channel state
syn stategif state
iaf update ion channel current
syn currentgif current
iaf psc
spike delivery
TL1L2 DRAM
cpu TL2L3Tload
exp
100%
03%
48%
12%
02%
26%
08%
01%
33%
03%
10%
01%
41%
10%
02%
31%
12%
11%
25%
21%
26%
47%
02%
04%
02%
13%
06% 05%
03%
41%
51%
27%
04%
48%
01%
04%
13%
03%
43%
57%
90%
07%
03% 02%
04%
01%
93%
66%
25%
09% 04%
71%
25%
59%
30%
11% 15%
85%
04%
76%
20%
96%
04%
g
g i
28%
20%
05%
08%
15%
21%
03%
12%
04%
58%
07%
10%
09%
09%
04%
16%
03%
27%
24%
17%
16%
11%
24%
21%
28%
16%
11%
09%
37%
21%
04%
02%
66%
32%
01%
01%
37%
21%
04%
09%
11%
16%
02%
67%
33%
L1
L2
L3
L3
DRAM
Fig 4. Overall runtime contributions from algorithmic kernels and
hardware features.
For each level of the cache hierarchy we show the relative contribution of al-
gorithmic kernels to the total runtime, and the corresponding breakdown into
contributions from hardware features. We assume a single node of the SKX
architecture with AVX512 vectorization and using hte maximum number of
threads. The bar plots show the predicted speedup relative to DRAM perfor-
mance when the number of neurons is computed to fill exactly the corresponding
cache level. For DRAM, the table presents the performance per neuron mea-
sured in [simulated s / wallclock s], assuming the loop ordering technique to
avoid saturation is used.
18
Interestingly the striking differences in performance between in silico models at
small cluster sizes can be evened out quite significantly at large cluster sizes in
this scenario. For example, simulating a large Brunel network on 10 distributed
ranks can be roughly four orders of magnitude faster than a Reconstructed net-
work on the same hardware, but the difference between models goes down to
two orders of magnitude at large cluster sizes.
Network latency and memory bandwidth are the main bottlenecks
in strong scaling Following the same procedure of the maximum filling sce-
nario we investigate the reasons for performance degradation by plotting the
most signficant hardware bottlenecks for all combinations of network size and
cluster size in Fig 5B. We assume that the time and neuron loops are ordered
to minimize pressure on the memory bandwidth. Even though we do not make
the explicit assumption of memory bandwidth saturation, this hardware feature
is still among the most relevant for all in silico models (as was also shown in
Fig 4). Moreover, network bandwidth is never the dominating bottleneck for all
in silico models and all configurations, while network latency always becomes
the most important bottleneck at large cluster sizes. At very small cluster sizes,
we recover the results from Fig 4.
Network latency gives significant performance improvements for point
neuron models, but no improvement in a single factor would be suffi-
cient to increase performance in the Reconstructed model To further
investigate the relevant bottlenecks, we predict the expected speedup corre-
sponding to a tenfold increase in a single hardware feature and present the
results in Fig 5C. Both point neuron models show a similar structure: an im-
provement in the features of a single node such as CPU frequency or cache
throughput yields an improvement in performance only for very small networks,
while a tenfold improvement in network latency would guarantee a significant
improvement in performance for networks of all sizes and sufficiently large clus-
ter sizes. The only difference between the two point neuron models in this
analysis are the benefits to be gained from improving the memory bandwidth:
while there would be almost no benefit for the Brunel CUBA model, it has a
moderate influence in the performance of the Simplified COBA model, espe-
cially for large networks. The situation for the Reconstructed model differs,
and there is no single factor that would result in a significant performance im-
provement at any network size, except for strong scaling of very small networks
where the CPU throughput is the dominating hardware feature. This can be
explained by the diversity of relevant hardware factors identified in Fig 4: when
a single hardware feature is improved, another bottleneck is quickly reached and
the total resulting performance improvement is suboptimal. For large cluster
sizes the situation normalizes to that of the other in silico models, and network
latency becomes the dominant factor, such that a tenfold increase results in a
nearly-equivalent performance boost, especially for small networks.
19
1 o fffiflffi ! "#$%&'y
()* +,-./023 4567e 89:;<=>?@A
BCD EFGHIJKL MNOPQR STUdVWXYZ
g
g i
A A1 A2 A3
B B1 B2
C C1 C2 C3
[\]^ _`arbcf
ghij klmrnpq
L5Trst
uvw
xyz
{|}~  etwork Ły
  ach  
¡¢£¤¥¦§ ¨© ª«¬­®¯ °±²³´µ¶·¸
¹º» ¼½¾¿ÀÁÂÃ ÄÅÆ ÇÈÉÊËÌÍÎy
B3
ÏÐÑÒÓÔÕ Ö× ØÙÚ Û Üxp
10 7
10 5
10 3
10 1
101
Ý
Þ
ß
à
á
â
ã
ä
å
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
õ
ö
÷
ø
ù
ú
û
ü
ý
þ
ß
P
 





100 101 102 103
N	
   fffi
100 101 102 103
flffi !" #$ %&'()*+,-./ 01234
100 101 102 103
56789: ;< =>?@ABCDEFG HIJKL
100 101 102 103
MOQRST UV WXYZ[\]^_`a bcdef
100 101 102 103
ghijkl mn opqrstuvwxy z{|}~
100 101 102 103
  Ł 
10
0
10
1
10
2
10
3
104
10
5
10
6
10
7
10
8
0
2
4
6
8
10









 
¡
¢
£
¤
¥
¦
100 101 102 103
§¨©ª«¬ ­® ¯°±²³´µ¶·¸¹ º»¼½¾
100 101 102 103
¿ÀÁÂÃÄ ÅÆ ÇÈÉÊËÌÍÎÏÐÑ ÒÓÔÕÖ
100 101 102 103
×ØÙÚÛÜ ÝÞ ßàáâãäåæçèé êëìíî
103
105
Fig 5. Performance of distributed real time strong-scaling scenario
and most relevant hardware bottlenecks.
A): predicted performance of the three in silico models in a strong-scaling sce-
nario, as a function of the number of distributed parallel processes. We restrict
our analysis to clusters of up to 104 distributed parallel processes, based on the
SKX-AVX512 architecture connected by an Infiniband EDR 100 GB/s network.
Since the total number of neurons can be an important factor for performance,
we plot performance for different network sizes. The dashed and solid lines
represent the performance of simulations with networks of 104 and 108 neurons
respectively. We plot the predicted performance measured as the number of
model seconds for which the whole network can be simulated in 1 wallclock sec-
ond. Contrary to the max-filling scenario, we do not impose the hypothesis of
memory bandwidth saturation, although we still assume that, whenever possi-
ble, all shared memory threads available will be used. A1,A2,A3): represent
the detailed COBA, simplified COBA and point CUBA models, respectively.
B): hardware bottlenecks predicted by the performance model as a function
of the number of neurons in the whole network (y-axis) and the number of
distributed ranks (x-axis). The grey areas denote an impossible configuration
that would require splitting of individual neurons. B1,B2,B3): represent the
detailed COBA, simplified COBA and point CUBA models, respectively. C):
Speedup predicted by the model when a single hardware feature is improved by
a factor of 10x, as compared to the baseline performance using the reference
HPC architecture. The dashed and solid lines have the same meaning as in A.
C1,C2,C3): represent the detailed COBA, simplified COBA and point CUBA
models, respectively.
20
Dependence of performance on model parameters
Parameters of the in silico models have an important, yet often difficult to
explain, impact on performance. We test the impact of the firing frequency,
minimum network delay and fan-in using our performance model, and present
the results in this section. We neglect the timestep because it has a generally
straightforward relationship with performance, and was omitted for brevity.
Firing frequency’s differential effect on communication and computa-
tion Firing frequency is commonly cited as one of the most impactful param-
eters on simulation performance [72]. In this work we consider it a parameter
even though it usually cannot be explicitly set by the user, and is instead an
emerging property of the simulation. Fig 6A shows the predictions of our perfor-
mance model for the three in silico models, for values of the firing frequency in a
physiological range. To take into account the obvious fact that the total number
of neurons and distributed ranks can introduce a large variability in our predic-
tions, we randomly generate 1000 value pairs of [number of neurons, number
of distributed ranks] and plot the median predicted performance, additionally
broken down into its two components of inter-node communication and on-node
computation. Firing frequency has an effect on communication by changing the
size of the spike message as well as on computation by changing the amount
of events that must be integrated by neurons. In Fig 6A1 it is noticeable that
there exists a threshold frequency below which f does not affect performance,
but once this threshold is passed firing frequency becomes a primary factor,
inducing a linear, almost unit-slope degradation in performance. This effect is
clearly visible in the Simplified model, and even more so in the Brunel models.
For the Reconstructed model this threshold value exists, but is so large that
we can safely assume that, in the median case, firing frequency has no effect on
performance. To investigate the reasons for performance degradation we look at
the breakdown of relative importance of different hardware features as a func-
tion of the firing frequency, plotted in Fig 6A1,A2,A3. Similarly to before, we
randomly generate 1000 couples of [number of neurons, number of distributed
ranks] but we plot the mean relative importance instead of the median to keep
the total constantly equal to 100%. Our analysis shows an interesting behaviour:
as the firing frequency becomes larger, the relative pressure on the memory
bandwidth (and eventually the network bandwidth) becomes larger, while the
relative pressure on the network latency becomes smaller. So not only there is
more computation to be done as firing frequency gets larger, but also the mix
of hardware bottlenecks changes. This behaviour was observed empirically not
only on general-purpose CPUs [73] but also on GPUs which are additionally
more susceptible to dynamic load balancing because of the extrememly large
number of parallel cores [72] We remark that the large variability in the perfor-
mance predictions in Fig 6A1,B1,C1 can be at least partially explained by our
choice of sampling strategy. We generate random numbers of neurons and ranks
following a log-uniform distribution, with the effect that all orders of magnitude
are equally likely, thus introducing a very large variability.
21
101 102 103 104 105
Fan in []
100 102 104 106 108 1010
Total number of neurons
100
101
102
103
104
105
106
N
u
m
b
e
r 
o
f 
d
is
tr
ib
u
te
d
 r
a
n
k
s
ïðñ òóô õö÷
A B C
ø
ù
ú
û
ü
ý
þ
ß
R
 








	



0

1
10 2 10 1 100 101 102 103
f fffiflffi !"#$
%&'
()*
+,-
./2
345
678
g
g
i
10 1 100 101
min [9:;
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
g g
gg g g
10 4
10 3
10 2
10 1
100
N
<
=
>
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
O
101 102 103 104 105
PQS TU VW
10 2 10 1 100 101 102 103
XYZ\]^ _`abcdegh ijkl
10 4
10 3
10 2
10 1
100
101
102
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~










110
1 100 10
min Ł
10 5
10 6
Fig 6. Effect of model parameters on performance.
A Effect of firing frequency on the performance of distributed simulations. A1
Median performance (total and broken down into communication and on-node
computation components) as a function of firing frequency. The median is com-
puted over one thousand randomly generated samples of number of neurons
and number of distributed ranks. The shaded area surrounding the total perfor-
mance represents the 25th and 75th percentiles. A2,A3,A4 Stacked plot of the
median relative contributions from hardware features as a function of firing fre-
quency, respectively for the Brunel, Simplified and Reconstructed model. The
median is computed over one thousand randomly generated samples of number
of neurons and number of distributed ranks as above. B Effect of δmin on the
performance of distributed simulations. B1 Median performance as a function
of δmin. The distribution for computing the median and the shaded areas are
the same as in A1. B2,B3,B4 Stacked plot of the median relative contribu-
tions from hardware features as a function of δmin, respectively for the Brunel,
Simplified and Reconstructed model. The range of acceptable values for δmin
changes across different in silico models because these values were computed as
multiples of the model’s timestep, i.e. by using the definition of coupling ratio.
C Effect of fan-in the performance of distributed simulations. C1 Median per-
formance (total and broken down into communication and on-node computation
components) as a function of fan-in. Fan-in has a direct effect on the memory
requirements of an in silico model, which in turn is potentially an important fac-
tor for performance because it affects the maximum allowed number of neurons
per rank. For this plot, we assumed that all randomly generated configurations
were possible, i.e. that memory capacity was not an issue. C2 Contour plot
of predicted memory requirements of the connections table, as a function of
the total number of neurons (x-axis) and number of distributed ranks (y-axis).
The contour levels corresponding to 1kB, 1MB and 1GB are shown for differ-
ent values of the fan-in. C3 Memory requirements as a function of the fan-in,
normalized by the memory requirements of a model with 10 incoming synapses.
22
Minimum network delay affects the relative importance of hardware
features Another parameter of interest is the minimum network delay, de-
noted δmin. In terms of communication, δmin affects the number of times that
global communication must happen to simulate one second of biological time,
although it doesn’t affect the total number of spikes communicated. In terms
of computation, assuming that the loop ordering strategy to minimize pressure
on the memory bandwidth is employed then δmin affects the number of time
iterations in which data locality can be exploited. Fig 6B1 shows the predictions
of our performance model for the three in silico models, for different values of
the minimum network delay. Since this delay can only be an integer multiple
of the timestep, we exploit the concept of coupling ratio to define a range of
plausible minimum delay values by setting a range of values for the coupling
ratio and obtaining the corresponding δmin by multiplication with ∆t. By look-
ing at the breakdown of performance, we see that larger δmin improves the
performance of inter-node communication but also, somewhat surprisingly, on-
node computation. However, while the communication performance seems to
improve indefinitely, the improvement of on-node computation saturates quite
quickly. For the point neuron models, within the range of δmin values consid-
ered here, there is a transition from a regime dominated by communication to
one dominated by computation, while the Reconstructed model is dominated by
computation for all δmin values. To investigate the reasons for changes in per-
formance, we plot the breakdown of relative importance of different hardware
features in Fig 6B1,B2,B3. In the case of COBA models, larger δmin correspond
to decreased pressure on the memory bandwidth and network latency, and larger
pressure on more scalable hardware features such as CPU instruction through-
put and caches throughput. This points to the fact that simulations based on
COBA models with a large δmin could strongly benefit from shared-memory
parallelism. On the other hand, a larger δmin in the CUBA model results in
decreased pressure on the network latency, but a higher pressure on memory
bandwidth.
Large fan-in can be advantageous for performance of point neuron
models, but has almost no effect on Reconstructed model Finally, we
examine the effect of fan-in, defined as the average number of incoming con-
nections per neuron and denoted by K. This parameter has subtle effects on
performance that are difficult to analyze. For COBA models, a larger fan-in
technically means more synapses to simulate thus an expected degradation of
performance. Additionally, for all models a larger K determines an increase in
event-driven computation, thus once again an expected degradation of perfor-
mance. These hypothesis are confirmed in Fig 6C1, which shows that K does
not affect communication but has a very strong effect on the Reconstructed
model. Unexpectedly, fan-in seems to only marginally affect the performance of
the Simplified model, in spite of it being a COBA model too. This can be easily
explained by the fixed number of synaptic instances in this model (28 excitatory
and 8 inhibitory [61]), such that much like the CUBA Brunel model, ultimately
23
the fan-in affects only the average number of events a neuron must integrate
within a certain time period. Another important point should be made about
the effect of fan-in, because changing the number of connections of a neuron
has an impact on the in silico model’s memory requirements. Fig 6C2 shows
the ratio of neurons that can fit in a Gigabyte of memory, according to our
performance model, as a function of fan-in. In a strong scaling scenario, this
information sheds new light on the conclusions above, because if only 1
x
neurons
fit in a GB, this can result potentially in a x-fold increase in performance from
parallelism (disregarding potential communication bottlenecks). Therefore for
the Brunel and Simplified model it can be advantageous to have a large num-
ber of incoming connections per neuron, because the performance price paid is
more than compensated by the required increase in parallelism. Conversely, in
the Reconstructed model, these two effects appear to balance out almost evenly.
For very large scale simulations, another subtle effect of fan-in is represented
by the size of the connection table, an issue that was raised and investigated
in [42]. The connections table of a given rank contains all the GIDs of presy-
naptic neurons that are relevant for at least one neuron in that rank. The size
of the connection table depends, among other things, on K as well as the total
number of neurons and number of distributed ranks. In Fig 6C3 we plot the
values of the expected total size of the connection table on the (total number of
neurons, number of distributed ranks)-plane as a contour plot, highlighting the
contours corresponding to a total size of 1kB, 1MB and 1GB. Although in some
in silico models connectivity may be determined by complex rules influenced by
cell type and spatial locality, for simplicity we compute here the expected size
of the connections table assuming uniform connection probability and we also
assume that neurons are randomly distributed across ranks. In a strong scal-
ing scenario the size of the connections table steadily decreases as the number
of ranks increases, starting from the minimum of two ranks required by a dis-
tributed simulation. This can be explained by the fact that fixing the network
size and increasing the number of ranks entails that there will be less incoming
connections to a given rank. In a weak scaling scenario, such as the maximum
filling regime, the size of the connections table transiently increases when the
number of distributed ranks is low, but at large scale reaches a constant value
steady state determined only by K.
Conclusions
In this work we have delivered a quantitative characterization of the performance
properties of different published in silico models at the core of state-of-the-art
brain tissue simulations. Using a grey-box model that combines biological and
algorithmic properties with hardware specifications we have identified perfor-
mance bottlenecks under different simulation regimes, corresponding to a vari-
ety of prototypical scientific questions that can be answered by simulations of
biological neural networks.
24
General purpose computing has sustained a diverse performance land-
scape up to now Our results show that there exists a large diversity of
performance profiles and bottlenecks that shape the landscape of brain tissue
simulations, corresponding to the diversity of sizes and scales at which research
questions in simulation neuroscience can be asked. Thus, our research highlights
that the computational neuroscience community is currently greatly benefitting
from the adaptability of general purpose computing, exploiting the ease of de-
velopment and high performance capability to explore different areas of the
modeling landscape.
Memory bandwidth and network latency severely limit maximum fill-
ing and real time strong scaling Using a state-of-the-art HPC server CPU
and cluster as a reference, our analysis revealed that all the in silico models
saturate the memory bandwidth using quite a small number of shared memory
threads. Even when algorithmic improvements are put into place to mitigate this
effect we have identified that the coupling ratio, a dimensionless number that
counts the number of timesteps in a mininum network delay period, strongly reg-
ulates the saturation of memory bandwidth and, in the extreme case of the Sim-
plified model analysed here, effectively prevents any benefit to be gained from
the effort of developing a more efficient algorithm. Additionally, we discovered
that it is not the level of morphological detail, but rather the formalism used to
represent synapses, that is the most important factor in explaining the memory
bandwidth saturation profile, with COBA models saturating much faster than
the CUBA model. In distributed simulations we identified the network latency,
and not the network bandwidth, as the major bottleneck for scaling to very
large networks or very large cluster sizes. This provides a new motivation and
justification for the extensive efforts described in [52] in designing a specific
communication infrastructure for the SpiNNaker neuromorphic system.
Model-specific features have a significant impact on shared-memory
performance Inspection of our performance model allowed us to pinpoint
which kernels, hardware specifications and model parameters have the largest
impact on performance. The Brunel model based on the CUBA formalism and
IAF neurons is mainly bounded by the spike delivery kernel, which exhibits
a good shared-memory scaling behaviour and, in the case of extreme strong
scaling, a strong dependence on the inter-cache data paths for good perfor-
mance. The two COBA models we analyzed, i.e. Simplified and Reconstructed,
have a similar shared-memory scaling behaviour, mainly driven by the current
kernels required to compute the contributions of individual synapses (and ion
channels) to the membrane potential equation. However, while the Simplified
model is 100% dominated by memory bandwidth, the morphologically detailed
Reconstructed model is dominated partially (around 40%) by other hardware
components such as caches and CPU throughput. It becomes clear that a per-
formance model and a detailed performance analysis are fundamental tools to
disentagle the complex web of relationships between in silico models, their soft-
25
ware implementation and hardware concretization.
Static and dynamic model parameters affect performance in signifi-
cant but subtle ways Finally, we examined the impact of model parameters
on the performance profiles described above. We found that firing frequency,
but surprisingly also minimum network delay, can have a large impact on deter-
mining which hardware features may constitute a performance bottleneck. For
firing frequency it is obvious that larger values correspond to more operations
required by the simulation algorithm, and thus a lower performance, but our
analysis shows that different values of the firing frequency also change the rela-
tive importance of hardware features. Interestingly we found that the minimum
network delay, in spite of it not affecting the total number of operations per
simulated second, can have an effect on performance simply by shifting the im-
portance of the hardware bottlenecks. We also found that the average number
of incoming connections per neuron plays a subtle role in influencing perfor-
mance. Trivially, a larger fan-in increases the computational requirements of
a single neuron. However, it also increases the memory capacity requirements,
thus requring a larger degree of parallelism to handle the same network size.
This creates a tradeoff between performance degradation arising from larger
computational requirements and performance improvement from parallelism re-
quirements.
Discussion and future improvements In this work we have concentrated
solely on the aspect of maximizing performance, without considering limitations
such as cost or energy. However, it must be stated that energy efficiency is a
central issue in the computational neuroscience community, and one of the main
selling points of neuromorphic hardware [11, 65]. Therefore, a meaningful ex-
tension to this work would be to incorporate a model for power consumption
alongside performance prediction, as a way to constrain the feasibility and effi-
ciency of certain simulation configurations. To achieve this, one could exploit
already established power consumption models that are easily integrated with
the ECM and have been shown to provide valuable insight into the power and
performance properties of simulation kernels [22, 29]. From the modeling point
of view, an important aspect that we have neglected in this analysis is synaptic
plasticity. A large portion of research questions that require brain tissue simu-
lations involve learning and synaptic plasticity, so this represents an important
extension to our analysis. However, in this work we decided to concentrate on
the inference part of brain tissue simulations because the diversity and complex-
ity of plasticity models warrants a separate analysis. Given that the performance
modeling infrastructure is already in place, we believe that the addition of plas-
ticity would not be a technical challenge, but it would considerably complexify
the resulting analysis. Even though we already considered potential hardware
improvements in our analysis, it would be interesting to extend this study to in-
clude hardware with different a design space such as the non-overlapping caches
of AMD CPUs [20] or massive SIMD parallelism of GPGPUs [37]. Finally,
26
the methods developed in this paper can be extended to different simulation
approaches ranging from different communication strategies [4, 38] up to fully
asynchronous executions [46].
Performance modeling is required to enable next-generation high per-
formance brain tissue simulations Our analysis shows that, if future it-
erations of general-purpose hardware architectures maintain the same balance
as the current state-of-the-art, it will be very difficult to achieve fast, large
scale simulations of brain tissue. Even if hardware peak performance were to
improve significantly over the next years, the required speedup could only be
achieved via specifically targeted advancements and under very restrictive simu-
lation and model configurations. To support the next generation of brain tissue
simulations, the community must therefore focus on the design of dedicated
hardware.
In this work we show that the task of designing the hardware and software
necessary to achieve high performance simulations is made difficult by the fact
that neurons are not just neurons, but these exists instead a large diversity
within the performance landscape. The level of diversity means that design
tradeoffs will inevitably require very restrictive decisions about the scale, model
and configuration of the target simulation, such as e.g. [37, 52]. For example,
while the high-bandwidth delivered by GPGPUs can provide significant speed-
up in simulations of medium to large networks of neurons, our analysis points out
that this improvement can be undermined by their extremely large number of
parallel cores, which makes them susceptible to dynamic load imbalance caused
by irregular firing patterns (see Fig 6), or by static load imbalance caused by
different neuronal morphologies in the case of detailed neurons (see Fig 1). These
phenomena have been observed empirically in the literature, by Yavuz et al. [72]
and Kumbhar et al. [39] respectively. To give another example, in the context of
accelerating small networks of neurons up to real-time, we predict that CUBA
models would benefit from architectures with a simple but fast cache hierarchy,
while for COBA models the cache bandwidth as well as the throughput of CPU
arithmetic operations must be improved in parallel in order observe the desired
speedup. Additionally, simulation parameters can influence the performance
profile in a tangible way. In the case of the Simplified model, for example,
simulations are on average bounded by memory bandwidth and network latency
at the current value of the minimum network delay, but the profile would change
drastically if the δmin were increased to the same value of the Brunel model, with
memory bandwidth losing importance in favour of cache and CPU throughput
(see Fig 6). Even in ideal cases where the in silico experiment falls perfectly
within the design space of the hardware and software being used, simulation
dynamics outside of the control of the modeler such as firing frequency can
rapidly push the performance of simulations towards suboptimal regimes.
Ultimately, we stress that the diversity and complexity of the performance
landscape warrants the need for performance modeling to achieve high perfor-
mance simulations of brain regions and eventually the whole brain. We believe
27
that our work embodies a concrete step in defining and understanding key per-
formance properties of a wide variety of in silico models, necessary to enable
the next generation of brain tissue simulations.
Methods
Single-node performance model
The Execution-Cache-Memory model The Execution-Cache-Memory (ECM)
model is used to obtain runtime predictions at the granularity of individual clock
cycles. The applicability of the ECM model to brain tissue simulations has been
demonstrated in the context of morphologically detailed neurons in [12], and this
work represents an extension of that analysis.
The ECM model, introduced by [67] and refined in [29, 64], is based on
computing two potential bottlenecks separately, then making a prediction based
on the slowest one. The two bottlenecks are the “overlapping” execution time,
represented by the execution of code in the core, and the data-traffic time.
Several assumptions are required in order to reduce the complexity of estimating
these two contributions. For code execution in the core, we assume that either all
instructions in a loop are executed at their maximum throughput, or a Critical
Path (CP) dominates the runtime. In both cases we make extensive use of
the IACA [32] analysis tool provided by Intel to obtain reasonably accurate
predictions. For estimating data-traffic time, we combine the knowledge about
the speed of transfers between caches provided by vendors with the empirical
observation that, on all recent Intel server microarchitectures, the best accuracy
in predictions is obtained assuming that there is no temporal overlap between
any cache transfers [67]. Finally, Intel’s Skylake architecture requires some extra
care in modeling cache transfers because of the L3 victim cache setup, for which
an ECM model was never previously published. In this work, we found that the
assumptions that both clean and dirty cache lines evicted from L2 are copied
in L3, while all read traffic from DRAM goes straight to L2, lead to the most
accurate predictions. An in-depth analysis yielding satisfactory accuracy in the
predictions has been presented in [12, 21].
Reference architecture: Intel Skylake-X Even though the methodology
proposed in this paper is general, to obtain concrete performance predictions and
bottleneck analysis we are required to focus on a target architecture. To uphold
a satisfactory level of relevance for the high performance computing community
as well as generalizability to future architectures, we picked a modern, general-
purpose Intel server architecture, a Intel Skylake (SKX) Intel(R) Xeon(R) Gold
6140 with AVX512 vectorization and Sub-NUMA clustering turned off. The
most relevant hardware characteristics required by the performance model are
summarized in Table 3, but we refer to [12] for the full details. We obtained
these values either directly from the vendor’s spec sheets, by custom-designed
benchmarks or from the reference tables in [16].
28
Table 3. Hardware characteristics of reference architecture
SKX-AVX512.
value unit
CPU freq 2.3 GHz
Peak DP performance 1324.8 Gflop/s
Mem BW 105 GB/s
L1-L2 BW per core 64 B/cy
L2-L3 BW per core 2 × 16 B/cy
vector exp() throughput 1.5 cy/scalar iter
scalar exp() latency 22.2 cy/scalar iter
Simulation of in silico models and experiments We used the CoreNEU-
RON software presented in [40] to benchmark and validate the in silico models
and experiments analyzed in this paper. CoreNEURON is a biological neural
network simulation tool based on the compute engine of the NEURON simulator,
optimised both for memory usage and computational speed. Both NEURON
and CoreNEURON allow scientists to expand the repertoire of available ion
channel, synapse, point neuron models and stimuli by using a domain-specific
language called NMODL. A source-to-source compiler is then in charge of trans-
lating files from NMODL to C, to be loaded as a dynamic library by the simu-
lation engine at execution time.
For the Reconstructed model, a NEURON implementation was available for
download from the Blue Brain NMC portal introduced in [60]. We chose a cor-
tical layer 5 thick-tufted pyramidal cell as a representative example (specifically
the L5 TTPC1 cADpyr232 1 model), because it is one of the largest cells in the
reconstructed circuit and contains several different types of ion channels. In
order to maintain generality, we reversed a hand-tuned optimization that comes
in the downloaded code for the synaptic state update: it exploits the fact that
the differential equations describing the state update are extremely simple to
bypass the code generation step, forcing it to use a precomputed decay value
while avoiding the call to the expensive exponential function. However, we be-
lieve this optimization would significantly hinder the generality of our results,
thus we rolled back the hand-tuned code and reverted to a standard definition
of the synaptic state differential equations, leading to the presence of several
calls to the exponential function in the kernel. For the ion channel state and
ion channel current kernels, to avoid complexity and gain generality we decided
to compute an average of the different ion channel types in the neuron model,
weighted by the number of compartments where that ion channel type is present.
The linear algebra kernel requires an in-depth analysis because its structure pre-
vents vectorization, see [12]. A few additional tweaks to improve readability
and allow better benchmarking were made, but none of them impacted the in-
struction count nor the computational properties of the kernels. We refere the
interested reader to [12] for more details on the ECM model for individual ion
channel and synapse models.
29
For the Simplified model, a NEURON implementation was provided to us by
the authors of [61]. The Simplified model is based on the concept of simplifica-
tion of the Reconstructed neural microcircuit, by reducing the morphologically
detailed neurons to COBA GIF point neuron model as described in [59]. In the
Simplified model all synaptic connections between neurons are “routed” through
one of 36 synaptic processes on each neuron (28 excitatory and 8 inhibitory), de-
pending on their transmission delay and physiological properties. This allows a
significant reduction of memory and computational footprints. Since the synap-
tic model is based on that of the Reconstructed model, we performed the same
optimization rollback described above to maintain generality in our discussion
and results.
For the Brunel model, we implemented the kernels ourselves using the NMODL
language, based on the reference implementation iaf psc alpha.cpp in the NEST
software. The underlying neuron model is the integrate-and-fire model with
alpha-shaped current [17]. Since the Brunel model is based on the CUBA for-
malism, it does not contain any current kernels, although it still has a PSC
kernel in which the total current contributions from synaptic events is added to
the equations governing the membrane potential dynamics.
ECM model of clock-driven kernels and validation We computed the
ECM model for all the relevant kernels constituting the simulation workflow
presented in Fig 1. We present the results in Table 4, following the notation
introduced in [67]:
Tcontributions = {TOL ‖TnOL |TL1L2 |TL2L3 |TL3Mem} cy
Tpredictions =
{
TECML1 ⌉T
ECM
L2 ⌉T
ECM
L3 ⌉T
ECM
Mem
}
cy.
The units for all runtime predictions are cycles per scalar iteration, denoted
as [cy], while the units for performance are Giga-scalar iterations per second,
denoted by [Giga/s].
Table 4. ECM model of clock-driven kernels.
kernel model [cy] predictions [cy]
syn current {7.20 ‖ 3.50 | 3.21 | 8.33 | 4.50} {7.20 ⌉ 7.20 ⌉ 15.04 ⌉ 19.54}
ion channel current {4.68 ‖ 2.56 | 1.92 | 5.07 | 2.70} {4.68 ⌉ 4.68 ⌉ 9.55 ⌉ 12.25}
linear algebra {8.10 ‖ 6.00 | 1.40 | 4.00 | 1.90} {8.10 ⌉ 8.10 ⌉ 11.40 ⌉ 13.30}
syn state {9.70 ‖ 1.70 | 1.50 | 4.00 | 2.10} {9.70 ⌉ 9.70 ⌉ 9.70 ⌉ 9.70}
ion channel state {15.82 ‖ 2.16 | 1.59 | 3.56 | 2.24} {15.82 ⌉ 15.82 ⌉ 15.82 ⌉ 15.82}
filtered synapse current {7.44 ‖ 3.50 | 3.51 | 9.03 | 4.92} {7.44 ⌉ 7.44 ⌉ 16.03 ⌉ 20.95}
gif current {9.88 ‖ 3.38 | 3.75 | 11.00 | 5.26} {9.88 ⌉ 9.88 ⌉ 18.12 ⌉ 23.38}
filtered synapse state {13.31 ‖ 2.62 | 2.00 | 5.50 | 2.80} {13.31 ⌉ 13.31 ⌉ 13.31 ⌉ 13.31}
gif state {28.50 ‖ 6.25 | 2.38 | 6.50 | 3.33} {28.50 ⌉ 28.50 ⌉ 28.50 ⌉ 28.50}
iaf update {2.41 ‖ 1.75 | 2.62 | 8.00 | 3.68} {2.41 ⌉ 4.38 ⌉ 12.38 ⌉ 16.05}
iaf psc {0.62 ‖ 0.38 | 1.25 | 3.00 | 1.75} {0.62 ⌉ 1.62 ⌉ 4.62 ⌉ 6.38}
30
To validate our predictions we benchmarked and measured the performance
of the individual kernels in a simulation that is representative of a typical work-
load. For simplicity, in all the benchmarks presented here we made sure that the
dataset was large enough to only fit in DRAM, however a more general valida-
tion was conducted in [12] for the kernels constituting the Reconstructed model.
For all in silico models we validated our predictions for the serial execution as
well as for shared memory scaling. Validation results for the Brunel, Simplified
and Reconstructed model are presented in Table 5 and Fig 7A1,A2,A3 respec-
tively. In the serial case, the prediction errors are all within 20-30% of the
measured runtime. Obtaining more accurate predictions is not an easy task,
and the reasons for this can vary across kernels. For ion channel state and cur-
rent kernels, our model contains an inevitable bias since it represents an average
of different ion channel types. For the linear algebra kernel, a large variability
was observed in the serial case. While it is still possible to obtain reasonably
accurate predictions for the state kernels, it must be noted that ultimately it is
extremely hard to predict the dynamic behaviour of the out-of-order engine in
a complex, modern architecture. Finally, given that most of the predictions are
optimistic, it is reasonable to assume that performance limiting factors such as
dynamic CPU throttling, as well as intrinsic factors such as critical paths and
instruction latencies, could be impacting the performance negatively. Digging
deeper into the complexity of the computer architecture to obtain more accurate
predictions is outside the scope of this paper, whose purpose is to employ per-
formance modeling as a means of explaining bottlenecks and predicting future
trends, not to implement optimizations at the granularity of individual cycles.
Table 5. Validation of ECM performance model for all kernels.
Measurements are shown as median values ± inter-quantile range from a
dataset of 10 independent benchmark executions.
serial parallel
kernel pred [cy] meas [cy] pred [cy] meas [cy]
syn current 19.5 24.6±1.5 4.5 4.4±0.1
ion channel current 12.2 15.2±0.3 2.7 3.3±0.1
linear algebra 13.3 18.8±5.3 1.9 2.2±0.2
syn state 9.7 13.8±0.2 2.1 2.0±0.0
ion channel state 15.8 20.6±0.1 2.2 2.3±0.0
filtered synapse current 21.0 28.9±1.8 4.9 4.9±0.1
gif current 23.4 33.7±1.7 5.3 6.0±0.2
filtered synapse state 13.3 21.1±0.4 2.8 2.8±0.1
gif state 28.5 25.4±0.0 3.3 3.1±0.0
iaf update 16.1 26.4±0.9 3.7 4.5±0.0
iaf psc 6.4 8.2±0.6 1.8 1.7±0.0
Spike delivery kernel The spike delivery kernel is arguably the most distin-
guishing aspect of brain tissue simulations, when compared to models of other
31
0.00
0.25
0.50
0.75
1.00
1.25
1.50
P
e
rf
o
rm
a
n
c
e
 [
G
ig
a
/s
]
5 10 15
Shared memory threads
5 10 15
Shared memory threads
5 10 15
Shared memory threads
ion channel statesyn stategif stateiaf update
ion channel currentsyn currentgif currentiaf psc
A1
A2 A3
Fig 7. Validation of performance modeling of in silico models and
experiments. Lines represent our predictions using the ECM model, while X
markers represent median benchmark measurements and error bars represent
the 25%-75% percentiles. To improve readibility we used dashed lines for state
update kernels and solid lines for current kernels. All benchmarks were
designed with big enough datasets to ensure data was always coming from
DRAM. A1,A2,A3 Computational kernels of the Brunel, Simplified and
Reconstructed model respectively.
biological or physical phenomena. Regarding its performance aspects, we have
shown that this kernel has a very different impact on overall performance for
CUBA and COBA models.
In terms of algorithm design, all state-of-the-art software use some sort of
priority queue or ring buffer to store synaptic events to be delivered within a
timestep. For benchmarking, we have decided to separate the operations related
bookeeping of events inside the queue from the actual kernel execution, and we
only consider the latter because the scope of this paper is restricted to com-
putational and communication kernels. We slightly changed the benchmarking
approach from [12] because we found that, given the implementation infrastruc-
ture in CoreNEURON, the runtime overhead due to programming design was
comparable to the runtime of a CUBA kernel. Therefore we first tried to esti-
mate this overhead by executing an empty kernel and subtracted the resulting
overhead from the actual benchmarks.
The spike delivery kernel is characterized by erratic memory accesses, be-
cause the order of activation of synapses is unpredictable. In this paper we
always consider the worst possible case in which every spike to be delivered
could not be cached and thus must come from main memory. In [12] we have
shown that the synapses in the Reconstructed model, and therefore also the
Simplified one since they have the same delivery kernel, were affected by the
latency of the memory system but not completely dominated by it. The reason
is that while the CPU is handling the delivery of one spike, it can “look ahead”
and schedule requests for the data of the next one. This is different from the
32
classic purely latency bound kernels in which the CPU is allowed to begin an
iteration only after the previous is fully completed. On the other hand, all the
requests for data are noncontiguous and therefore none of the pipelining and
prefetching techniques are very effective in hiding the latency of fetching the
data.
For the COBA delivery kernel, we found that the strategy of designing a
synthetic benchmark with a similar access pattern to estimate an adjusted la-
tency worked very well. On our validation architecture, the adjusted latency
was around 20 cyper memory access. This led us to a single-thread prediction
of 460 cyper delivered spike, to be compared to the benchmark measurement of
571.6 ± 14.9 cy, which represents roughly a 20% error. For multiple threads,
we assumed the performance scales linearly with the number of threads until
the bottleneck of memory bandwdith is exhausted. It should be noted in this
case that the predicted memory traffic is quite large because we assume that
for every memory access, a full cache line (64 B) must pulled from the memory
even though only a single double-precision variable (i.e. 8 B) is required. At
maximum number of threads, this amounts to a predicted runtime of 32 cyper
delivered spike, against benchmark measurements of 45.0 ± 0.1 cy, i.e. a 28%
error. Applying the same strategy to the CUBA delivery kernel yielded accept-
able but not perfect predictions. For the single thread, we predict a runtime of
60 cybut measured 38.0 ± 1.8 cy, giving an error of 58%, while at maximum
thread we predicted a runtime of 4 cybut measured 2.8 ± 0.04, i.e. a 43% error.
Given our current understanding of this kernel, and the complexity introduced
by the out-of-order execution and memory access scheduling, we deem these
predictions quite satisfactory. In Fig 8 we present the predicted and measured
performance and memory traffic for the spike delivery kernel.
0 2 4 6 8 10 12 14 16 18
0
200
400
600
800
1000
1200



















 
¡
¢
£
¤
¥
¦
COBA - AMPANMDA
CUBA - STATIC
0 500 1000 1500
Memory data volume [B/iteration]Number Shared Memory Threads
A B
g
i
Fig 8. Validation of the spike delivery kernel.
A Performance predictions (dashed lines) and benchmark measurements (X
markers). Performance is measured in 106 spikes delivered per wallclock
second. Error bars represent the 25% and 75% percentiles. B Measured
memory traffic per delivered spike in Bytes. Dashed lines represent the
predictions from our model.
33
ECM model predictions and inference During our analysis, we exten-
sively used the ECM model to perform inference on the performance of in silico
models and experiments. For completeness we report here our strategy for com-
puting derived metrics or quantities from the raw output of the ECM model. To
compute the memory utilization, shown e.g. in Fig 2A, we first computed the
memory bandwidth as the ratio between the total memory traffic in a timestep
divided by the predicted runtime to execute that timestep. Then we divided the
memory bandwidth by the theoretical peak bandwidth of the system to obtain
the memory utilization.
Another derived quantity that we make extensive use of is the concept of hard-
ware bottleneck. To define this, we first introduce the concept of hardware
contribution as the sum of all the predicted runtimes in which a particular hard-
ware feature was the determining factor for performance, as shown in Fig 4. To
give an example, let us take the ECM model for the synaptic current, expressed
in notation as: {7.20 ‖ 3.50 | 3.21 | 8.33 | 4.50} cy. Taking the case where the data
fits in L2, the ECM prediction corresponds to TECML2 = TOL = 7.20 cy, which
are all spent in the core executing arithmetic operations. So even though 3.21
cy are spent in transfers between the L1 and L2 cache, we do not consider this
because the arithmetic operations in the core are the only determining factor.
Eventually, we can refine this metric by splitting 7.20 cy into 1.5 cy for the
exponential (per scalar iteration, assuming full throughput) and 5.7 cy for the
rest of the arithmetic operations. Thus the contributions to the synaptic current
kernel, when data is in L2, would be exp: 1.5 cy and CPU: 5.7 cy. Conversely,
when data is in DRAM, the ECM prediction for a serial execution is given by
TECMMem = TnOL + TL1L2 + TL2L3 + TMem = 19.54 cy. Therefore in this case,
since the only dominating factors correspond to the data transfer, the hardware
contributions consist of load: 3.5 cy, L1→L2: 3.21 cy, L2→L3: 8.33 cy, DRAM
bw: 4.5 cy. Considering a parallel execution using 18 threads, if we assume full
memory bandwidth saturation then the only dominating factor for performance
is the memory bandwidth, and therefore the hardware contributions would be
only DRAM bw: 4.5 cy. Finally, the situation becomes more complicated if we
wish to take into account the special loop ordering to avoid bandwidth satura-
tion. In this case, taking again the example of the synaptic current kernel in
the detailed model, in the first out of four time iteration data comes from mem-
ory, and in the following three out of four time iterations data comes from L3.
Thus the hardware contributions, averaged over time iterations, would be load:
3
4
×
3.5cy
18
= 0.15 cy, L1→L2: 3
4
×
3.21cy
18
= 0.13 cy, L2→L3: 3
4
×
1.57cy
18
= 0.35
cy, DRAM bw: 1
4
× 4.5cy = 1.13 cy.
Interprocess communication performance model
Spike exchange in brain tissue simulations All the in silico models and
experiments considered in this paper are based on the Bulk Synchronous Parallel
(BSP) model [68], which prescribes a clear distinction between an on-node com-
putation phase (happening in a distributed parallel fashion) and an inter-node
communication phase. For brain tissue simulations, the inter-node communica-
34
tion phase corresponds to the spike exchange step in Fig 1B1,B2. Moreover, we
make the assumption that the distributed processing is implemented in MPI,
because it represents the current state of practice in the HPC community. In
all the widely-used state-of-the-art simulators, the spike exchange step is im-
plemented by a blocking collective call, typically to a variant of the Allgather
operation. This entails that all the parallel ranks have, at the end of the commu-
nication step, knowledge of all the spikes produced by the simulation during the
last min-delay period. Recent work has shown that at extremely large scales,
this implementation can become prohibitively expensive in terms of memory re-
quirements, and proposed to use instead the Alltoall operation to deliver spikes
only to the ranks where they are required [35]. Other alternative implementa-
tions have been suggested, using nonblocking point-to-point communication [4]
or spatial decomposition [38]. All these fall outside of the scope of this paper,
which is focused on medium-to-large cluster sizes and well established, widely
used software solutions.
The LogGP model We use the LogGP model [2] to predict and explain
the performance of the spike exchange simulation step. The LogGP model is
an extension to the LogP model [13] that uses an additional parameter, the
gap per byte denoted G, as a way to account for the sending and receiving of
long messages. The main features of all models based on LogP is that their
parameters are easily relatable to hardware characteristics, thus allowing a high
degree of interpretability. In the LogGP model, a distinction is made between
two bottlenecks: the processor overhead and the networking hardware. The
L, g,G parameters are all related to the networks and represent respectively
an upper bound on the latency, the minimum gap between messages (i.e. the
reciprocal of the maximum injection rate) and gap per byte (i.e. the reciprocal
of the network bandwidth). On the processor side, even though the original
model prescribes a single parameter o representing a fixed overhead, we followed
the approach in [28] and considered two parameters oi, os that represent the
intercept and slope of the linear relationship between message size and processor
overhead. Finally, the last parameter of relevance is P , i.e. the total number of
parallel processes.
Reference architecture: Infiniband EDR with HPE-MPI Even though
the performance modeling tools considered here can generalize to several types
of architectures [28], to validate our performance predictions we restrict our
focus to a representative example of high performance network architecture:
a vendor (HPE) provided MPI implementation, based on MPT 2.16 and the
MPI 3.0 standard, over an Infiniband EDR 100 GB/s fabric. We used the
Netgauge v2.4.6 tool introduced in [27] to make a first assessment of the LogGP
parameters, and found that according to the recommended interpretation of the
output from the tool, two different sets of parameters should be used for small
(less than 16kB) and large message sizes. We report the parameters in Table 6.
The LogGP framework allows us to directly model the latency of point-
35
Table 6. LogGP parameters.
small sizes large sizes unit
L 1.54 1.54 µs
oi 0.133 0.0249 µs
os 4.59× 10
−5 6.48× 10−5 µs/B
g 0.526 6.12 µs
G 1.42× 10−4 2.07× 10−4 µs/B
to-point communication using the formula (L + 2oi) + (G + 2os)m, where m
is the message size. However, additional work is required to model collective
communication because different algorithms can be used to disseminate the
messages across the network. For the Allgather operation there are several
implementations available [66], and the decision of which one to use can be a
complex function of the static network characteristics such as the topology as
well as dynamic specifications such as the message size and number of ranks
involved. In this work we consider only the ring algorithm, because it is the
mosts commonly used [66] (especially for large message sizes) and we wish to
keep our analsys simple. Therefore, the total latency to perform an Allgather
operation among P parallel ranks with a total message size of m is:
TAllgather = (P − 1)(L+ 2oi) +
P − 1
P
(G+ 2os)m.
Validation and inference Firstly, it must be noted that the original CoreNEU-
RON implementation of the spike exchange algorithm uses an MPI custom
datatype to represent a spike as an aggregate of a double-precision variable
representing the time of spike and an integer variable representing the ID of
the source neuron. This implementation, however, can turn out to be unsat-
isfactory in terms of performance, in particular when one tries to predict the
runtime of a collective communication, because hidden data shuffling and copy
operations can take place [10]. To address the performance issues deriving from
custom datatypes, we reimplemented the spike exchange operation to perform
two consecutive MPI allgatherv operations: the first on the array of timings
and the second on the array of source neuron IDs. To measure the latency of
the communication itself, we executed a simulation where we artificially filled
every communication buffer on each distributed rank with the same number of
spikes, and measured the communication time per rank.
In the process of validation, we discovered that the raw LogGP model based
on the small-messages parameters from Netgauge was only valid in the context
of very small messages, while the large-messages parameters never seemed rel-
evant for our use case. In particular, we found that for messages larger than
P × 65B the Allgather operation incurred a penalty in both latency and band-
width. The source of this penalty is unclear, as it could be due to a switch
in point-to-point protocols, a change in the underlying algorithm or additional
communication to ensure synchronization. More refined performance models
36
exist such as the LogGPS [31] that try to take global synchronization latencies
into account, but we found that a simpler yet effective way to improve our
model was to introduce two penalty terms: a latency penalty pL = 0.593µs and
a bandwidth penalty pG = 1.875 × 10
−4µs/B. In light of this the formula to
obtain a prediction for the latency of the Allgather operation becomes:
TAllgather =
{
(P − 1)(L+ 2oi) +
P−1
P
(G+ 2os)m m < 65P
(P − 1)(L+ 2oi + pL) +
P−1
P
(G+ 2os + pG)m m ≥ 65P.
We show in Fig 9B1,B2 the measured and predicted TAllgather for P = 4 and
P = 32 respectively. The model has a good overall agreement with the mea-
sured data, rarely exceeding 10% relative error and often within the intrinsic
variability of the data itself. This grants a significant degree of confidence in
our predictions, at least for cluster sizes that remain within reasonable distance
from our maximum validation size of 32 nodes.
0 1000 2000 3000 4000 5000
message size [B]
4
6
8
10
§
¨
©
ª
«
¬
­
®
¯
°
±
²
³
´
µ
¶
·
¸
¹
º
»
0 2500 5000 7500 100001250015000
message size [B]
40
60
80
100
120
A1 A2
Fig 9. Validation of interprocess communication performance
modeling.
Latency of the Allgatherv operation in µs, defined as the median latency
across all ranks involved in the communication, for different values of the total
message size. The error bars represent the maximum and minimum latency
over all parallel ranks. The black lines represent the prediction from our
implementation of the LogGP model. To total number of ranks involved in the
communication in A1 was 4, and in A2 was 32.
Several shortcomings that can impact the modeling of MPI communication
runtimes have been reviewed in [26]. Although our application of the LogGP
model addresses some of them, namely the custom datatypes and the over-
lapping of communication and computation, others such as synchronization,
congestion and topology may negatively affect the accuracy of our predictions,
especially at very large cluster sizes. For example, one of the main points advo-
cating for a multicast spike communication strategy presented in [52] is tightly
linked with the average number of hops in the underlying network hardware,
i.e. a topological parameter which is not considered by our model. We believe,
however, that within the design space of medium to large clusters considered
37
here our predictions are still able to deliver the sufficient accuracy required by
a bottleneck analysis.
Acknowledgments
This work has been funded by the EPFL Blue Brain Project(funded by the
Swiss ETH board). The authors gratefully acknowledge the compute resources
and support provided by the Erlangen Regional Computing Center (RRZE),
and in particular Georg Hager for the fruitful discussions regarding the ECM
model and the interpretation of performance predictions. We are also indebted
to the BlueBrain HPC team, and in particular Bruno Magalhaes and Pramod
Kumbhar, for helpful support and discussion regarding CoreNEURON.
References
1. Syed Ahmed Aamir, Yannik Stradmann, Paul Mu¨ller, Christian Pehle,
Andreas Hartel, Andreas Gru¨bl, Johannes Schemmel, and Karlheinz
Meier. An accelerated lif neuronal network array for a large-scale mixed-
signal neuromorphic architecture. IEEE Transactions on Circuits and
Systems I: Regular Papers, (99):1–14, 2018.
2. Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris
Scheiman. Loggp: Incorporating long messages into the logp model
for parallel computation. Journal of parallel and distributed computing,
44(1):71–79, 1997.
3. Rajagopal Ananthanarayanan, Steven K Esser, Horst D Simon, and Dhar-
mendra S Modha. The cat is out of the bag: cortical simulations with 10
9 neurons, 10 13 synapses. In Proceedings of the Conference on High Per-
formance Computing Networking, Storage and Analysis, page 63. ACM,
2009.
4. Rajagopal Ananthanarayanan and Dharmendra S Modha. Anatomy of a
cortical simulator. In Proceedings of the 2007 ACM/IEEE conference on
Supercomputing, page 3. ACM, 2007.
5. Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt
Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik
Sen, John Wawrzynek, et al. A view of the parallel computing landscape.
Communications of the ACM, 52(10):56–67, 2009.
6. Romain Brette and Dan FM Goodman. Simulating spiking neural net-
works on gpu. Network: Computation in Neural Systems, 23(4):167–182,
2012.
7. Romain Brette, Michelle Rudolph, Ted Carnevale, Michael Hines, David
Beeman, James M Bower, Markus Diesmann, Abigail Morrison, Philip H
38
Goodman, Frederick C Harris Jr, et al. Simulation of networks of spik-
ing neurons: a review of tools and strategies. Journal of computational
neuroscience, 23(3):349–398, 2007.
8. Nicolas Brunel. Dynamics of sparsely connected networks of excitatory
and inhibitory spiking neurons. Journal of computational neuroscience,
8(3):183–208, 2000.
9. Alexandru Calotoiu, Torsten Hoefler, Marius Poke, and Felix Wolf. Us-
ing automated performance modeling to find scalability bugs in complex
codes. In Proc. of the ACM/IEEE Conference on Supercomputing (SC13),
Denver, CO, USA, pages 1–12. ACM, November 2013.
10. Alexandra Carpen-Amarie, Sascha Hunold, and Jesper Larsson Tra¨ff. On
expected and observed communication performance with mpi derived
datatypes. Parallel Computing, 69:98–117, 2017.
11. Andrew S Cassidy, Rodrigo Alvarez-Icaza, Filipp Akopyan, Jun Sawada,
John V Arthur, Paul A Merolla, Pallab Datta, Marc Gonzalez Tallada,
Brian Taba, Alexander Andreopoulos, et al. Real-time scalable cortical
computing at 46 giga-synaptic ops/watt with. In Proceedings of the inter-
national conference for high performance computing, networking, storage
and analysis, pages 27–38. IEEE Press, 2014.
12. Francesco Cremonesi, Georg Hager, Gerhard Wellein, and Felix
Schu¨rmann. Analytic performance modeling and analysis of detailed neu-
ron simulations. arXiv preprint arXiv:1901.05344, 2019. under review.
13. David Culler, Richard Karp, David Patterson, Abhijit Sahay,
Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten
Von Eicken. Logp: Towards a realistic model of parallel computation. In
ACM Sigplan Notices, volume 28, pages 1–12. ACM, 1993.
14. Hubert Eichner, Tobias Klug, and Alexander Borst. Neural simulations
on multi-core architectures. Frontiers in neuroinformatics, 3:21, 2009.
15. Andreas K Fidjeland, Etienne B Roesch, Murray P Shanahan, and Wayne
Luk. Nemo: a platform for neural modelling of spiking neurons using
gpus. In 2009 20th IEEE International Conference on Application-specific
Systems, Architectures and Processors, pages 137–144. IEEE, 2009.
16. Agner Fog. Instruction tables: Lists of instruction latencies, throughputs
and micro-operation breakdowns for intel, amd and via cpus. technical
university of denmark, 2017.
17. Wulfram Gerstner, Werner M Kistler, Richard Naud, and Liam Panin-
ski. Neuronal dynamics: From single neurons to networks and models of
cognition. Cambridge University Press, 2014.
39
18. Espen Hagen, David Dahmen, Maria L Stavrinou, Henrik Linde´n, Tom
Tetzlaff, Sacha J van Albada, Sonja Gru¨n, Markus Diesmann, and
Gaute T Einevoll. Hybrid scheme for modeling local field potentials from
point-neuron networks. Cerebral Cortex, pages 1–36, 2016.
19. Espen Hagen, Solveig Næss, Torbjørn V Ness, and Gaute T Einevoll.
Multimodal modeling of neural network activity: Computing lfp, ecog,
eeg, and meg signals with lfpy 2.0. Frontiers in neuroinformatics, 12,
2018.
20. Georg Hager. Benchmarking the memory hierarchy of the new
amd ryzen cpu using the vector triad. Georg Hager’s blog, 2017.
https://blogs.fau.de/hager/archives/7810.
21. Georg Hager, Jan Eitzinger, Julian Hornich, Francesco Cremonesi,
Christie L Alappat, Thomas Ro¨hl, and Gerhard Wellein. Applying the
execution-cache-memory model: Current state of practice. 2018. poster
at Supercomputing 2018.
22. Georg Hager, Jan Treibig, Johannes Habich, and Gerhard Wellein. Ex-
ploring performance and power properties of modern multi-core chips via
simple machine models. Concurrency and Computation: Practice and
Experience, 28(2):189–210, 2016.
23. Michael Hines. Efficient computation of branched nerve equations. Inter-
national journal of bio-medical computing, 15(1):69–76, 1984.
24. Michael Hines, Sameer Kumar, and Felix Schu¨rmann. Comparison of neu-
ronal spike exchange methods on a blue gene/p supercomputer. Frontiers
in computational neuroscience, 5:49, 2011.
25. Michael L Hines, Henry Markram, and Felix Schu¨rmann. Fully implicit
parallel simulation of single neurons. Journal of computational neuro-
science, 25(3):439–448, 2008.
26. Torsten Hoefler, William Gropp, Rajeev Thakur, and Jesper Larsson
Tra¨ff. Toward performance models of mpi implementations for under-
standing application scaling issues. In European MPI Users’ Group Meet-
ing, pages 21–30. Springer, 2010.
27. Torsten Hoefler, Andre Lichei, and Wolfgang Rehm. Low-overhead loggp
parameter assessment for modern interconnection networks. In 2007
IEEE International Parallel and Distributed Processing Symposium, pages
1–8. IEEE, 2007.
28. Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. Loggp in the-
ory and practice–an in-depth analysis of modern interconnection networks
and benchmarking methods for collective operations. Simulation Mod-
elling Practice and Theory, 17(9):1511–1521, 2009.
40
29. Johannes Hofmann, Georg Hager, and Dietmar Fey. On the accuracy and
usefulness of analytic energy models for contemporary multicore proces-
sors. In International Conference on High Performance Computing, pages
22–43. Springer, 2018.
30. Michael Hopkins and Steve Furber. Accuracy and efficiency in fixed-point
neural ode solvers. Neural computation, 27(10):2148–2182, 2015.
31. Fumihiko Ino, Noriyuki Fujimoto, and Kenichi Hagihara. Loggps: a paral-
lel computational model for synchronization analysis. In ACM SIGPLAN
Notices, volume 36, pages 133–142. ACM, 2001.
32. Intel. Intel Architecture Code Analyzer, 11 2017.
33. Tammo Ippen, Jochen M Eppler, Hans E Plesser, and Markus Diesmann.
Constructing neuronal network models in massively parallel environments.
Frontiers in neuroinformatics, 11:30, 2017.
34. Eugene M Izhikevich and Gerald M Edelman. Large-scale model of mam-
malian thalamocortical systems. Proceedings of the national academy of
sciences, 105(9):3593–3598, 2008.
35. Jakob Jordan, Tammo Ippen, Moritz Helias, Itaru Kitayama, Mitsuhisa
Sato, Jun Igarashi, Markus Diesmann, and Susanne Kunkel. Extremely
scalable spiking neuronal network simulation code: from laptops to exas-
cale computers. Frontiers in neuroinformatics, 12:2, 2018.
36. Wouter Klijn, Stuart Yates, Vasileios Karakasis, Alexander Peyser, and
Benjamin Cumming. Arbor: A morphologically detailed neural network
simulator for modern high performance computer architectures. In 26th
Computational Neuroscience Meeting, number FZJ-2017-05647. Ju¨lich Su-
percomputing Center, 2017.
37. James Courtney Knight and Thomas Nowotny. Gpus outperform cur-
rent hpc and neuromorphic solutions in terms of speed and energy when
simulating a highly-connected cortical model. Frontiers in neuroscience,
12:941, 2018.
38. James Kozloski and John Wagner. An ultrascalable solution to large-scale
neural tissue simulation. Frontiers in neuroinformatics, 5:15, 2011.
39. Pramod Kumbhar and Michael Hines. Coreneuron: Morphologically de-
tailed neuron simulator optimisations on gpus. Presented at GTC: GPU
Technology Conference, 2016.
40. Pramod Kumbhar, Michael Hines, Jeremy Fouriaux, Aleksandr
Ovcharenko, James King, Fabien Delalondre, and Felix Schu¨rmann.
Coreneuron: An optimized compute engine for the neuron simulator.
arXiv preprint arXiv:1901.10975, 2019.
41
41. Pramod S Kumbhar, Subhashini Sivagnanam, Kenneth Yoshimoto,
Michael Hines, Ted Carnevale, and Amit Majumdar. Performance analy-
sis of computational neuroscience software neuron on knights corner many
core processors. In Workshop on Software Challenges to Exascale Com-
puting, pages 67–76. Springer, 2018.
42. Susanne Kunkel, Maximilian Schmidt, Jochen M Eppler, Hans E Plesser,
Gen Masumoto, Jun Igarashi, Shin Ishii, Tomoki Fukai, Abigail Morrison,
Markus Diesmann, et al. Spiking network simulation code for petascale
computers. Frontiers in neuroinformatics, 8:78, 2014.
43. William W Lytton and Michael L Hines. Independent variable time-step
integration of individual neurons for network simulations. Neural compu-
tation, 17(4):903–921, 2005.
44. Bruno Magalhaes, Michael Hines, Thomas Sterling, and Felix Schu¨rmann.
Asynchronous branch-parallel simulation of detailed neuron models (un-
der review). Frontiers in Neuroinformatics, 2019.
45. Bruno Magalhaes, Michael Hines, Thomas Sterling, and Felix Schu¨rmann.
Exploiting flow graph of system of odes to accelerate the simulation of
biologically-detailed neural networks. In Proceedings of 2019 IEEE Inter-
national Parallel & Distributed Processing Symposium (IPDPS). IEEE,
2019.
46. Bruno Magalha˜es and Felix Schu¨rmann. Fully-asynchronous cache-
efficient simulation of detailed neural networks. 2019.
47. Henry Markram, Eilif Muller, Srikanth Ramaswamy, Michael W Reimann,
Marwan Abdellah, Carlos Aguado Sanchez, Anastasia Ailamaki, Lidia
Alonso-Nanclares, Nicolas Antille, Selim Arsever, et al. Reconstruction
and simulation of neocortical microcircuitry. Cell, 163(2):456–492, 2015.
48. Michael V Mascagni, Arthur S Sherman, et al. Numerical methods for
neuronal modeling. Methods in neuronal modeling, 2, 1989.
49. John D. McCalpin. Memory bandwidth and machine balance in current
high performance computers. IEEE Computer Society Technical Commit-
tee on Computer Architecture (TCCA) Newsletter, pages 19–25, Decem-
ber 1995.
50. Michele Migliore, C Cannia, William W Lytton, Henry Markram, and
Michael L Hines. Parallel network simulations with neuron. Journal of
computational neuroscience, 21(2):119, 2006.
51. Abigail Morrison, Carsten Mehring, Theo Geisel, AD Aertsen, and
Markus Diesmann. Advancing the boundaries of high-connectivity
network simulation with distributed computing. Neural computation,
17(8):1776–1801, 2005.
42
52. Javier Navaridas, Mikel Luj’n, Luis A Plana, Jose Miguel-Alonso, and
Steve B Furber. Analytical assessment of the suitability of multicast com-
munications for the spinnaker neuromimetic system. In 2012 IEEE 14th
International Conference on High Performance Computing and Commu-
nication & 2012 IEEE 9th International Conference on Embedded Soft-
ware and Systems, pages 1–8. IEEE, 2012.
53. Max Nolte, Michael W Reimann, James G King, Henry Markram, and
Eilif B Muller. Cortical reliability amid noise and chaos. bioRxiv, page
304121, 2018.
54. Eustace Painkras, Luis A Plana, Jim Garside, Steve Temple, Francesco
Galluppi, Cameron Patterson, David R Lester, Andrew D Brown, and
Steve B Furber. Spinnaker: A 1-w 18-core system-on-chip for massively-
parallel neural network simulation. IEEE Journal of Solid-State Circuits,
48(8):1943–1953, 2013.
55. Alexander Peyser and Wolfram Schenck. The nest neuronal network sim-
ulator: Performance optimization techniques for high performance com-
puting platforms. In Society for Neuroscience Annual Meeting, number
FZJ-2015-06261. Ju¨lich Supercomputing Center, 2015.
56. Thomas Pfeil, Andreas Gru¨bl, Sebastian Jeltsch, Eric Mu¨ller, Paul Mu¨ller,
Mihai A Petrovici, Michael Schmuker, Daniel Bru¨derle, Johannes Schem-
mel, and Karlheinz Meier. Six networks on a universal neuromorphic
computing substrate. Frontiers in neuroscience, 7:11, 2013.
57. Hans E Plesser, Jochen M Eppler, Abigail Morrison, Markus Diesmann,
and Marc-Oliver Gewaltig. Efficient parallel simulation of large-scale neu-
ronal networks on clusters of multiprocessor computers. In European
Conference on Parallel Processing, pages 672–681. Springer, 2007.
58. Tobias C Potjans and Markus Diesmann. The cell-type specific cortical
microcircuit: relating structure and activity in a full-scale spiking network
model. Cerebral cortex, 24(3):785–806, 2012.
59. Christian Pozzorini, Skander Mensi, Olivier Hagens, Richard Naud,
Christof Koch, and Wulfram Gerstner. Automated high-throughput char-
acterization of single neurons by means of simplified spiking models. PLoS
computational biology, 11(6):e1004275, 2015.
60. Srikanth Ramaswamy, Jean-Denis Courcol, Marwan Abdellah, Stanis-
law R Adaszewski, Nicolas Antille, Selim Arsever, Guy Atenekeng, Ahmet
Bilgili, Yury Brukau, and Athanassia et al. Chalimourda. The neocortical
microcircuit collaboration portal: a resource for rat somatosensory cortex.
Frontiers in neural circuits, 9:44, 2015.
61. Christian Ro¨ssert, Christian Pozzorini, Giuseppe Chindemi, Andrew P
Davison, Csaba Eroe, James King, Taylor H Newton, Max Nolte,
43
Srikanth Ramaswamy, Michael W Reimann, et al. Automated point-
neuron simplification of data-driven microcircuit models. arXiv preprint
arXiv:1604.00087, 2016.
62. Wolfram Schenck, AV Adinetz, YV Zaytsev, Dirk Pleiter, and A Morri-
son. Performance model for large–scale neural simulations with nest. In
Extended Poster Abstracts of the SC14 Conference for Supercomputing,
2014.
63. Thomas Sharp, Cameron Patterson, and Steve Furber. Distributed config-
uration of massively-parallel simulation on spinnaker neuromorphic hard-
ware. In The 2011 International Joint Conference on Neural Networks,
pages 1099–1105. IEEE, 2011.
64. Holger Stengel, Jan Treibig, Georg Hager, and GerhardWellein. Quantify-
ing performance bottlenecks of stencil computations using the Execution-
Cache-Memory model. In Proceedings of the 29th ACM International Con-
ference on Supercomputing, ICS ’15, New York, NY, USA, 2015. ACM.
65. Evangelos Stromatias, Francesco Galluppi, Cameron Patterson, and Steve
Furber. Power analysis of large-scale, real-time neural networks on spin-
naker. In The 2013 International Joint Conference on Neural Networks
(IJCNN), pages 1–8. IEEE, 2013.
66. Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of
collective communication operations in mpich. The International Journal
of High Performance Computing Applications, 19(1):49–66, 2005.
67. Jan Treibig and Georg Hager. Introducing a performance model for
bandwidth-limited loop kernels. In Parallel Processing and Applied Math-
ematics, pages 615–624. Springer, 2010.
68. Leslie G Valiant. A bridging model for parallel computation. Communi-
cations of the ACM, 33(8):103–111, 1990.
69. Sacha Jennifer van Albada, Andrew G Rowley, Johanna Senk, Michael
Hopkins, Maximilian Schmidt, Alan Barry Stokes, David R Lester,
Markus Diesmann, and Steve B Furber. Performance comparison of the
digital neuromorphic hardware spinnaker and the neural network simula-
tion software nest for a full-scale cortical microcircuit model. Frontiers
in neuroscience, 12:291, 2018.
70. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An
insightful visual performance model for multicore architectures. Commun.
ACM, 52(4):65–76, 2009.
71. Timo Wunderlich, Akos F Kungl, Andreas Hartel, Yannik Stradmann,
Syed Ahmed Aamir, Andreas Gru¨bl, Arthur Heimbrecht, Korbinian
44
Schreiber, David Sto¨ckel, Christian Pehle, et al. Demonstrating ad-
vantages of neuromorphic computation: A pilot study. arXiv preprint
arXiv:1811.03618, 2018.
72. Esin Yavuz, James Turner, and Thomas Nowotny. Genn: a code gen-
eration framework for accelerated brain simulations. Scientific reports,
6:18854, 2016.
73. Friedemann Zenke and Wulfram Gerstner. Limits to high-speed simula-
tions of spiking neural networks using general-purpose computers. Fron-
tiers in neuroinformatics, 8:76, 2014.
45
