CoreNEURON : An Optimized Compute Engine for the NEURON Simulator by Kumbhar, Pramod et al.
Noname manuscript No.
(will be inserted by the editor)
CoreNEURON
An Optimized Compute Engine for the NEURON Simulator
Kumbhar, Pramod 1* · Hines, Michael 2* · Fouriaux, Jeremy 1 ·
Ovcharenko, Aleksandr 1 · King, James 1 · Delalondre, Fabien 1 ·
Schu¨rmann, Felix 1
Received: date / Accepted: date
Abstract The NEURON simulator has been devel-
oped over the past three decades and is widely used by
neuroscientists to model the electrical activity of neu-
ronal networks. Large network simulation projects us-
ing NEURON have supercomputer allocations that in-
dividually measure in the millions of core hours. Super-
computer centers are transitioning to next generation
architectures and the work accomplished per core hour
for these simulations could be improved by an order of
magnitude if NEURON was able to better utilize those
new hardware capabilities. In order to adapt NEURON
to evolving computer architectures, the compute en-
gine of the NEURON simulator has been extracted
and has been optimized as a library called CoreNEU-
RON. This paper presents the design, implementation
and optimizations of CoreNEURON. We describe how
CoreNEURON can be used as a library with NEU-
RON and then compare performance of different net-
work models on multiple architectures including IBM
BlueGene/Q, Intel Skylake, Intel MIC and NVIDIA
GPU. We show how CoreNEURON can simulate exist-
ing NEURON network models with 4-7x less memory
usage and 2-7x less execution time while maintaining
binary result compatibility with NEURON.
Keywords NEURON, CoreNEURON, Neural Net-
work Simulations, Supercomputing, Performance
Optimization
Corresponding author:
Felix Schu¨rman, Campus Biotech, B1.04, Ch. des Mines 9,
CH-1202 Gene`ve
E-mail: felix.schuermann@epfl.ch
1Blue Brain Project, E´cole Polytechnique Fe´de´rale de Lau-
sanne (EPFL), Switzerland
2Yale University, USA
(*) shared first author
1 Introduction
Simulation in modern neuroscientific research has be-
come a third pillar of the scientific method, comple-
menting the traditional pillars of experimentation and
theory. Studying models of brain components, brain tis-
sue or even whole brains provides new ways to inte-
grate anatomical and physiological data and allow in-
sights into causal mechanisms crossing scales and link-
ing structure to function. Early studies covered for ex-
ample the levels from channels to cell behavior account-
ing for detailed morphology (e.g. [1], [2]) and integrat-
ing this detail into models of networks (eg. [3]). More re-
cently, studies have been accounting for increased elec-
trophysiological detail and diversity in the tissue model
(e.g. [4], [5]), giving a glimpse at functional importance
of the underlying connectome (e.g. [6], [7]) allowing for
example the reinterpretation of aggregate brain signals
such as LFP (e.g. [8]). At the same time, computa-
tional studies have strived to look even deeper into the
biochemical workings of the cell, studying the role of
intracellular cascades in neuromodulation (e.g. [9]) or
metabolism (e.g. [10]), and to abstract some of the de-
tail while maintaining cell type diversity (e.g. [11], [12],
[13]), or to move the integrated and modeled data all
the way to fMRI ([14]).
As the biochemical and biophysical processes of the
brain span many orders of magnitudes in space and
time, different simulator engines have been established
over time incorporating the appropriate idioms, com-
putational representations and numerical methods (e.g.
at the biochemical level - STEPS [15], at the detailed
cellular level - NEURON [16], using simplified neuron
representations - NEST [17], or even more abstract -
TVB [18] to name a few).
ar
X
iv
:1
90
1.
10
97
5v
1 
 [q
-b
io.
NC
]  
30
 Ja
n 2
01
9
2 Kumbhar, Pramod 1* et al.
Table 1: Summary of Network Models
Name Summary #Neurons #Compartments #Synapses
Traub [19] A single column thalamocortical network model 3,560 46,5740 1,099,820
Dentate [20] Dentate Gyrus model including Granule cells with
dendritic compartments
5,137 175,719 1,199,988
Ring [21] Ring network of branching cells 32,768 9,535,488 33,280
Cortex + Plasticity [4] Somatosensory cortex model with synaptic plastic-
ity
219,422 99,581,138 872,922,040
Hippocampus [22] Rat Hippocampus CA1 model 789,595 565,495,731 361,937,388
The more detail is included in these models and the
larger the models become, the larger are the computa-
tional requirements of these simulation engines, making
it necessary to embrace advanced computational con-
cepts and faster computers [23] [24] [25]. Table 1 shows
exemplarily five different network models used in this
paper for benchmarking and indicates their size and
complexity. A single-column thalamocortical network
model [19] is used to better understand population phe-
nomena in thalamocortical neuronal ensembles. It has
3,560 multi-compartment neurons with soma, branch-
ing dendrites and a portion of axon. It consists of 14 dif-
ferent neuron types, 3,500 gap junctions and 1.1 million
connections. A scaled-down variant of the full-scale den-
tate gyrus model [20] developed in the Soltesz lab [26]
is used to understand hippocampal spatial information
processing and field potential oscillations. It consists
of 5,143 multi-compartment neurons and 4,121 Pois-
son spike sources, and includes 6 different cell types,
1.2 million connections and about 600 gap junctions. A
synthetic model with specific computational character-
istics is often needed to evaluate target hardware based
on number of cells, branching patterns, compartments
per branch etc. For this purpose, a multiple ring net-
work model of branching neurons and minimal spike
overhead is used [21]. The Blue Brain Project has pub-
lished a first-draft digital reconstruction of the microcir-
cuitry of somatosensory cortex in 2015 [4]. This model
contains about 219,000 neurons, with 55 layer-specific
morphological and 207 morpho-electrical neuron sub-
types. Together with other partners in the European
Human Brain Project, this group is also working on a
full-scale model of a rat hippocampus CA1 [22]. A first
draft of this model contains about 789,000 neurons with
13 morphological types and 17 morpho-electrical types.
The number of neurons and synapses, however, is
not always the best indicator of the computational com-
plexity of a model. In the model of [4] each neuron av-
erages to about 20,000 differential equations to repre-
sent its electrophysiology and connectivity. To simulate
the microcircuit of 31,000 neurons, it is necessary to
solve over 600 million equations every 25 microseconds
of biological time - a requirement far beyond the capa-
bilities of any standard workstation. It is necessary to
utilise massively parallel systems for such simulations
but fully exploiting the capabilities these systems is a
challenging task for a large number of scientific codes,
including NEURON. Significant efforts are necessary to
prepare scientific applications to fully exploit the mas-
sive amount of parallelism and hardware capabilities
offered by these new systems [27].
In this paper we present our efforts to re-engineer
the internal computational engine of the NEURON
simulator, CoreNEURON, to adapt to emerging archi-
tectures while maintaining compatibility with existing
NEURON models developed by the neuroscience com-
munity. Our work was guided by the goal to leverage
the largest available supercomputers for neuroscientific
exploration by scaling the simulator engine to run on
millions of threads. A key design goal was to reduce
the memory footprint compared to NEURON as total
memory and memory bandwidth are scarce and costly
resources when running at scale. Lastly, for this capabil-
ity to be easily usable by the normal NEURON commu-
nity, we endeavored to tightly integrate CoreNEURON
with NEURON.
2 NEURON Simulation Environment
NEURON is a simulation environment developed over
the last 35 years for modeling networks of neurons
with complex branched anatomy and biophysical mem-
brane properties. This includes extracellular potential
near membranes, multiple channel types, inhomoge-
neous channel distribution and ionic accumulation. It
can handle diffusion-reaction models and integrating
diffusion functions into models of synapses and cellular
networks. Morphologically detailed models simulated
using NEURON are able to represent the spatial diver-
sity of electrical and biophysical properties of neurons.
CoreNEURON 3
Individual neurons are treated as a tree of un-
branched cables called sections. Each section can have
its own set of biophysical parameters, independently
from other sections, and is discretized as a set of adja-
cent compartments (see e.g. [28]). Compartmental mod-
els of neurons take into account not only the connectiv-
ity between neurons but also the individual morpholo-
gies and inhomogeneities of each neuron. The electrical
activity of neurons is modeled using the cable equation
(see e.g. [29]) applied to each section, where the quan-
tity representing the state of a neuron at a given point
in space and instant in time is the membrane potential.
The general form of the cable equation for a section, in
the case of constant parameters and conductance based
synapse modeling, is given by:
d
4Ra
∂2v
∂x2
= cm
∂v
∂t
+ Ipas + Iion + Isyn (1)
where
• d [µm] , Ra [Ωcm] , cm
[
µF
cm2
]
, Ipas
[
mA
cm2
]
are bio-
physical parameters contributing to the passive
component of the cable equation (unit conversion
factors are not shown but each term has the units
of mA/cm2).
• Iion
[
mA
cm2
]
is the active contribution arising from
ion channels along the section, whose conductances
gi and resting potentials ei might depend in a non-
linear fashion upon a set of state variables repre-
senting those channels.
• Isyn
[
mA
cm2
]
is the contribution from the synapses
placed at positions xj , whose conductances gj and
resting potentials ej might depend in a nonlinear
fashion upon a set of state variables and which
take effect in a strongly localized manner. Individ-
ual synapses have units of nA and conversion to
mA/cm2 involves a Dirac delta function, δ(x− xj),
with units 1/um, and the diameter; i.e. conversion
of absolute current to current per unit area implies
division by the compartment area where the synapse
is located.
One needs to couple (1) to a set of additional dif-
ferential equations that describe the evolution of the
states of ion channels and synapses, thus giving rise to
a system of PDEs/ODEs as the final problem. Spatial
discretization of the PDEs results in a tree topology
set of stiff coupled equations which is most effectively
solved by implicit integration methods. In particular,
direct Gaussian elimination with minimum degree or-
dering is computationally optimum [30]. The general
structure of a hybrid clock-event driven algorithm [28]
in NEURON can be divided into a set of operations
that are performed at every integration time step and
an interprocess spike exchange operation where a list
of spike generation times and identifiers are synchro-
nized across all processors every minimum spike delay
interval. The per integration step operations are :
• Event-driven spike delivery step where the callback
function of each synapse activated by a spike at a
given timestep is executed.
• Matrix assembly step where the Iion and Isyn con-
tributions are computed and included in the matrix.
• Matrix resolution step where the membrane poten-
tial for the current step is obtained by solving a
linear system.
• State variables update step where the evolution
equations for the states of ion channels and synapses
are solved to advance to the current timestep.
• Threshold detection step where each neuron is
scanned to see if it has met a particular firing condi-
tion, and if so a particular list of events is updated.
Although the simulator has demonstrated scaling up
to 64,000 cores on the IBM Blue Gene/P system [25],
with the emerging computing architectures (like GPUs,
many-core architectures) the key challenges are numer-
ical efficiency and scalability. The simulator needs to :
1) expose fine grain parallelism to utilize the massive
number of hardware cores, 2) be optimized for mem-
ory hierarchies and 3) fully utilize processor capabili-
ties such as vector units. To simulate models with bil-
lions of neurons on a given computing resource, mem-
ory capacity is another major challenge. In order to
address these challenges, the compute algorithm of the
NEURON simulator was extracted and optimized into
a standalone library called CoreNEURON.
3 CoreNEURON Design and Implementation
The integration interval operations (listed in Section
2) consume most of the simulation time [31]. The goal
of CoreNEURON is to efficiently implement these op-
erations considering different hardware architectures.
This section describes the integration of CoreNEURON
with the NEURON execution workflows, major data
structure changes to reduce memory footprint, mem-
ory transfer between NEURON-CoreNEURON and a
checkpoint-restore implementation to facilitate long
running simulations.
4 Kumbhar, Pramod 1* et al.
Inputs
Model 
Comp
HOC/
PythonHOC/Python
Model Setup NEURON Simulation
Result
Model Setup CoreNEURON SimulationMemory Setup
Result
All DatasetsModel on Disk 
CoreNEURON 
Simulation
Small Network
Large Network
Memory Setup Result
(will partition into 
smaller pieces)
Model
Description
A] NEURON Mode
B] CoreNEURON Online Mode
Model Building Loop
Model Setup 
partition i
C] CoreNEURON Offline Mode
Circuit & 
User Input
Input Intermediate Data Output
NEURON CoreNEURON
Fig. 1: Different execution workflows supported by NEURON simulator with CoreNEURON : A] shows existing
simulation workflow used by all NEURON users; B] shows new CoreNEURON based workflow where in-memory
model is transferred using direct memory access and then simulated by CoreNEURON; C] shows new CoreNEU-
RON based workflow where NEURON partition large network model into smaller chunks, build in-memory model
iteratively and then CoreNEURON loads whole model for simulation
3.1 NEURON to CoreNEURON Workflow
One of the key design goal of CoreNEURON is to be
compatible with the existing NEURON models and user
workflows. With the integration of CoreNEURON li-
brary, the NEURON simulator supports three different
workflows depicted in Figure 1.
• NEURON mode
• CoreNEURON Online mode
• CoreNEURON Offline mode
Existing users are familiar with the default
NEURON mode. The model descriptions written in
NMODL [32] are used to build a dynamically loadable
shared library. The HOC/Python scripting interface is
used to build network model in memory (Model Setup
phase). This in-memory model is then simulated us-
ing the hybrid clock-event driven algorithm described
in Section 2 (Simulation phase). Users have full control
over model structure and can introspect or record all
events, states, and model parameters using the script-
ing or graphical user interface (Result phase).
CoreNEURON Online Mode allows users to run
their models efficiently with minimal changes. After
the Model Setup phase, the in-memory representation is
copied into CoreNEURON’s memory space. CoreNEU-
RON then re-organizes the memory during Memory
Setup phase for efficient execution (see Section 4.2).
The Simulation phase is executed in CoreNEURON and
spike results are written to disk. Note that the same
NMODL model descriptions are used both in NEURON
as well as CoreNEURON.
CoreNEURON Offline mode is intended for large
network models that cannot be simulated with NEU-
RON due to memory capacity constraints. In this mode,
instead loading the entire model at once, the Model
Setup phase builds a subset of the model that fits into
available memory. That subset is written to disk, the
memory used by the subset is freed, and the Model
Setup phase constructs another subset. After all sub-
sets are written by NEURON, CoreNEURON reads the
entire model from the disk and begins the Simulation
phase. Because CoreNEURON’s cell and network con-
nection representations are much lighter weight than
NEURON’s, 4-7x larger models than NEURON can be
simulated with CoreNEURON (see Section 5).
Users can adapt existing models to the CoreNEU-
RON Online Mode workflow with the trivial replace-
ment of the psolve function call with nrncore run of
the ParallelContext class [33]. Presently, however, event
callbacks from CoreNEURON to NEURON interpreter
code are not implemented (see Section 6).
3.2 Data Structure Changes
NEURON is used as a general framework for design-
ing and experimenting with neural models of varying
anatomical detail and membrane complexity. Users can
interactively create cells with branches of varying diam-
CoreNEURON 5
Table 2: Memory footprint comparison for different data structures (in bytes)
Data Structure Purpose NEURON CoreNEURON
Node Compartment of the neuron 128 -
Section Unbranched cable of the neuron 96 -
Object High level HOC object 64 -
Presyn Synapse object at origin 208 64
InputPresyn Similar to Presyn - 24
Point process Synapse overhead 56 8
Prop Property object in compartment 48 -
Netcon Connection between neuron 56 40
Pointer Memory address 8 4
Memb list List of mechanisms or channels 56 64
NrnThreadMembList Mechanism list for group of neurons 34 40
PreSynHelper Helper object for PreSyn - 4
Symbol Token parsed by HOC interpreter 56 -
eters and lengths, insert ionic channels, create synapses,
record and visualize different properties using a GUI.
In order to provide this flexibility and introspection ca-
pability for examining a large number of complex data
structures are created. But, once the users are satis-
fied with the behavior of the model, they run larger/-
longer simulations on workstations or clusters where
interactivity or detailed introspection capabilities are
often not required. In this type of batch execution,
memory overhead from many large, complex data struc-
tures with many mutual pointers can be significantly
reduced by replacing them with fixed arrays of data
structures in which the few necessary pointers are re-
placed by integers. For example, the network connection
object (Netcon) and the common synapse base class
(Point process), which are responsible for a significant
portion of memory usage in NEURON, were reduced
from 56 to 40 bytes and 56 to 8 bytes respectively
in CoreNEURON. Table 2 lists the important data
structures and their memory usage comparison between
NEURON and CoreNEURON. CoreNEURON elimi-
nates the Python/HOC interpreter and so, data struc-
tures like Node, Section, Object are no longer needed.
The memory usage improvements from these optimiza-
tions for different network models are discussed in Sec-
tion 5.
3.3 Pointer Semantics
NEURON users can define their own data structures
and allocate memory through the use of POINTER and
VERBATIM constructs of NMODL [34]. Many inter-
nal data structures of NEURON use pointer variables to
manage various dynamic properties, connections, event
queues etc. As a model is built incrementally using the
scripting interface, various memory pools are allocated
during the Model Setup phase. As data structures be-
tween NEURON and CoreNEURON are different, seri-
alizing memory pools becomes one of the major mem-
ory management challenges of the CoreNEURON im-
plementation. With serialization, pointer variables need
to be augmented with meta information to allow proper
decoding by CoreNEURON. This meta information in-
dicates the pointer semantics. All data variables which
potentially are the target pointers are grouped into a
contigiuous memory pool and pointer variables are con-
verted to an integer offset into the memory pool. When
the NEURON pointers are copied to CoreNEURON’s
memory space, the semantic type associated with the
pointer variable is used to compute the corresponding
integer offset. Table 3 enumerates the different seman-
tics types introduced to facilitate memory serialization.
For example, area represents compartment area (8 byte
double) but pntproc represents the larger Point process
object and hence needs a different decoding mechanism.
3.4 Checkpoint-Restart Support
The network simulations for studying synaptic plastic-
ity can run from a week to a month. Enabling such sim-
ulations of long biological time-scales is one of the im-
portant use cases for CoreNEURON. Most of the cluster
and supercomputing resources have a maximum wall
clock time limit for a single job (e.g. up to 24 hours).
6 Kumbhar, Pramod 1* et al.
Explore from consistent state with other stimulus or seeds
Initial
Model Setup
Model on 
Disk 
Simulation
Execution SaveState
Stim
File 1
Checkpoint 
data
RestoreState SimulationExecution
Stim 
File 2
Sav 
RNG 
Seed 1
RNG 
Seed 2
RNG 
Seed 3
Stim 
File 3
Simulation
Execution Sav RestoreState
User Input Intermediate Data NEURON CoreNEURON
Fig. 2: Simulation workflow with the checkpoint-restart feature with the flexibility to re-launch simulations with
different stimuli or random number streams
Table 3: Semantic type and their purpose
Semantic name Purpose
area area of the compartment
iontype type of ion used (ca, na, k)
cvodeieq element on event queue
netsend network send event
pointer pointer used in NEURON
pntproc point process
corepointer pointer used in CoreNEURON
watch element used as watch statement
diam diameter of the compartment
The checkpoint-restart [35] is commonly used technique
to enable long running simulations and has been imple-
mented in CoreNEURON. Since the checkpoint opera-
tion could take place at anytime with varying degrees
of cell firing activity, it was necessary to account for
generated yet undelivered synaptic events in addition
to saving the in-memory state of the simulator. When
a cell fires, it may have many connections to other cells
with different delivery delays. During the checkpoint
operation, any undelivered messages are collapsed back
into the original event of the firing cell so that a sin-
gle event can be saved. Once the network simulation is
checkpointed, users have flexibility to launch multiple
simulations with different stimuli or random number
streams in order to explore network stability and ro-
bustness. The execution workflow of such simulations
is shown in Figure 2.
3.5 Portability Considerations
CoreNEURON can transparently handle all spiking
network simulations including gap junction coupling
with the fixed time step method. The model descrip-
tions written in NMODL need to be thread safe [36]
to exploit vector units of modern CPUs and GPUs.
This can be achieved with the help of NEURON’s mk-
threadsafe tool. New keywords like COREPOINTER
and CONDUCTANCE have been added to NMODL
to facilitate serialization and improve performance op-
timization respectively. These keywords are also back-
ported to NEURON so that the models remain compat-
ible for either NEURON or CoreNEURON execution.
For scalability and portability of random numbers on
platforms like GPUs, CoreNEURON supports the Ran-
dom123 pseudo-random generator [37].
4 Optimizations
In order to improve the performance of CoreNEU-
RON on different architectures, different optimization
schemes are implemented for multi-threading, memory
layout, vectorization and code generation. These opti-
mizations are described in this section.
4.1 Parallelism
Both NEURON and CoreNEURON use the Mes-
sage Passing Interface (MPI) to implement distributed
memory parallelism. Although NEURON supports
multi-threading based on Pthread [38], users commonly
use pure MPI execution due to better scaling behavior.
But, pure MPI execution will affect scalability due to
CoreNEURON 7
D] Properties of mechanism instances in SoA layout where 
property access (e.g. m) is contigious in memory
B] Mechanisms grouped by type 
C] Properties (m, h, ena, ek) of mechanism instances in AoS layout where 
property access (e.g. m) result into strided memory access
Im Im Im Im Na Na Na K K K K
m m m m h h h h ena ena ena ena
m h ena ek m h ena ek m h ena ek
A] Dendritic structure of neuron cell with mechanisms 
inserted into compartments
K
Na
Im
K
Im
K
Im
Na
Im
Na
K
Fig. 3: A schematic representation of dendritic structure of a neuron with different mechanisms inserted into the
compartment is shown on the left. On the right: B] shows how NEURON and CoreNEURON groups the mechanism
instances of the same type; C] shows how NEURON stores properties of individual mechanism in the AoS layout;
D] shows the new SoA layout in CoreNEURON for storing mechanism properties
MPI communication and memory overhead of internal
MPI buffers when executing at scale [39]. To address
this scalability and parallelism challenge, CoreNEU-
RON relies on three distinct level of parallelism. First,
at the highest level, a set of neurons that have equiv-
alent computational cost are grouped together and as-
signed to each MPI rank on the compute node. Second,
within a node, an individual neuron group is assigned
to an OpenMP [40] thread executing on a core. This
thread simulates the given neuron group for the entire
simulation ensuring data locality. Finally, vector units
of the core are utilized for executing groups of chan-
nels in parallel. This allows simulations with a single
MPI process per compute node to meet the scalability
challenge.
4.2 Memory Layout and Vectorization
Processor memory bandwidth is one of the scarce re-
sources and often the major impediment to improve the
performance of many applications including NEURON.
The compute kernels of channels and synapses are
bandwidth limited and can reach close-to-peak mem-
ory bandwidth [41]. The dendritic structures of a neu-
ron are divided into small compartments and different
membrane channels or mechanisms are inserted into dif-
ferent compartments (Figure 3A). For memory locality,
both NEURON and CoreNEURON groups the channels
by their type as shown in Figure 3B. But, NEURON or-
ganizes properties of individual mechanisms (like m, h,
ena) in the Array of Structs (AoS) memory layout (Fig-
ure 3C). When a specific property is accessed, for ex-
ample, m, it results in strided memory accesses with in-
efficient memory bandwidth utilization and hence poor
performance. To address this issue, CoreNEURON or-
ganizes channel properties into the Structure of Arrays
(SoA) memory layout (Figure 3D). This allows efficient
compiler vectorization and efficient memory bandwidth
utilization for all channel and synapse computations.
The performance improvements from this optimization
is discussed in [41].
4.3 NMODL Source-to-Source Translator
NEURON has had support for code generation through
the model description language, NMODL, since version
2 released in 1989 [42]. The code generation program
of NEURON has been modified into a standalone tool
called MOD2C [43]. This tool is used by CoreNEURON
to support all NEURON models written in NMODL.
Figure 4 shows the high level workflow of MOD2C. The
first step of source-to-source translator is lexical anal-
ysis where lexical patterns in the NMODL code are
detected and tokens are generated. The syntax anal-
ysis step uses those tokens and determine if the se-
ries of tokens are appropriate in the language. The se-
mantic analysis step make sure if syntactically valid
sentences are meaningful as part of the model descrip-
tion. Code generation is the step in which a C++ file is
created with compiler hints for auto-vectorization and
GPU parallelization with the OpenACC programming
model [44]. MOD2C also takes care of code generation
for AoS and SoA memory layouts. MOD2C uses open
source flex and bison tools [45] for this implementation.
More information about the NMODL code generation
pipeline can be found in [42].
8 Kumbhar, Pramod 1* et al.
NEURON {
        SUFFIX hh
        ….
}
BREAKPOINT {
        SOLVE states METHOD cnexp
        gna = gnabar*m*m*m*h
        ina = gna*(v - ena)
        …
}
Lexical
Analyzer
Syntax
Analyzer
Semantic
Analyzer
Code
Printer
namespace coreneuron {
}hh.mod
mod2c
hh.cpp
C++
OpenACCNMODL
void nrn_cur(….) {
    ….
    #pragma ivdep
    #pragma acc parallel loop
    for( i = 0; i < node_count; i++) {
        int id = node_index[i];
        v = voltage[idx];
        gna = gnabar*m*m*m*h ;
        ina = gna * (v-ena ) ; 
        …. 
    }
}
Fig. 4: Code generation workflow for CoreNEURON : different phases of source-to-source compiler are shown in
the middle that translates model description file (hh.mod) to C++ code (hh.cpp) and inserts compiler hints for
CPU/GPU parallelization
4.4 GPU Porting
Prior to the CoreNEURON project, a substantial effort
was made to port NEURON to the GPU architecture
using the CUDA programming model [46]. One of the
two major components of this implementation was the
extension of the NMODL source-to-source compiler to
emit CUDA code. The other major component man-
aged an internal memory transformation from NEU-
RON’s thread efficient AoS memory layout to a more
GPU memory efficient SoA layout. This experimental
NEURON version [47] was quite efficient for matrix
setup and channel state integration for cellular simula-
tions but did not reach network simulation capability.
The project foundered on software administration dif-
ficulties of maintaining two completely separate code-
bases, the difficulty of understanding the data struc-
ture changes involved for memory layout transforma-
tion from AoS to SoA, and the difficulty of manag-
ing pointer updates in the absence of pointer seman-
tics information. It became clear that a more general
view was required that could not only alleviate these
problems for the GPU but had a chance of evolving
to work on future architectures. This view is embodied
in CoreNEURON development. As discussed in Section
4.2, CoreNEURON data structures and memory lay-
out have been optimized for efficient memory access.
MOD2C supports code generation with the OpenACC
programming model that helps to target different accel-
erator platforms. Users need to compile the CoreNEU-
RON library with a compiler that supports OpenACC
(e.g. PGI, Cray).
One of the performance challenges for a GPU imple-
mentation is irregular memory accesses due to the non-
homogeneous tree structure of neurons. For example,
Figure 5A shows three different morphological types
and their compartmental tree connection topology in
the simulator is shown in Figure 5B. The GPU deliv-
ers better performance when consecutive threads (in
groups of 16 or 32) perform the same computations
and load the data from consecutive memory addresses.
When there are a large number of cells per morpholog-
ical type, it is straightforward to achieve optimal per-
formance by interleaving the compartments of identical
cells. But, with few cells per morphological type, Gaus-
sian elimination suffers from non-contiguous layout of
parents relative to a group of nodes. This results in
irregular, strided memory accesses and hence poor per-
formance [48]. To address this, two alternative node or-
derings schemes, Interleaved layout and Constant Depth
layout, are illustrated in Figure 5D and Figure 5E. All
cells have the same number of compartments but each
has a different branching pattern. Nodes (representing
compartments) within a cell are numbered with suc-
cessive integers. In the case of Interleaved layout, a
compartment from each of N cells forms an adjacent
group of N compartments. The groups are in any root
to leaf order but corresponding compartments in iden-
tical cells are adjacent. As an example, for a group of
three threads the vertical square braces highlight parent
indices that have the same order as the nodes. This re-
sults in either contiguous memory loads (CL) or strided
memory load (SL). For each Gaussian elimination oper-
ation the number of threads that can compute in paral-
lel is equal to the number of cells and hence this scheme
is referred as one cell per thread layout. For Constant
Depth layout, all nodes at the same depth from the root
are adjacent. For a given depth, corresponding nodes of
identical cells are adjacent. Children of branch nodes in
the same cell are kept as far apart as possible to min-
imize contention while updating the same node from
different threads.
To analyse the impact of node ordering schemes on
the execution time, we used a multiple Ring network
model of cells with random tree topology [21]. This
test allows to evaluate performance impact when par-
ents of a contiguous group of 32 nodes are not contigu-
ous and executed by a 32 thread (warp). We used a
CoreNEURON 9
1
2 3 4
gid #1
5
6
7
8
gid #2
9
10 11 12
gid #3
C] Different cell topologies
1
9
5
2
10
6
3
11
7
4
12
8
0
1
2
0
1
5
0
1
8
0
1
2
3
4
5
6
7
8
9
10
11
root
i ni par[i]
CL
SL
SL
1
9
5
2
10
6
3
11
4
12
7
8
0
1
2
0
1
0
1
5
10
0
1
2
3
4
5
6
7
8
9
10
11
root
i ni par[i]
CL
CL
SL
D] Interleaved E] Constant Depth F] Execution comparison of D] & E]
A] Morphological types and their dendritic structure B] Dendrograms showing in-memory tree representation
Fig. 5: The top row shows three different morphological types with their dendritic tree structure in A] and
dendrograms showing in-memory tree representation of these types in CoreNEURON in B]. The bottom row
shows different node ordering schemes to improve the memory access locality on GPUs : C] Example topologies
of three cells with the same number of compartments; D] Interleaved Layout where a compartment from each of
N cells forms an adjacent group of N compartments. For ith node, ni is node index and par[i] is its parent index.
With three executor threads, square brace highlight parent indices that result into contiguous memory load (CL)
and strided memory load (SL); E] Constant Depth Layout where all nodes at same depth from root are adjacent;
F] Comparison of two node ordering schemes for Ring network model showing execution time of whole simulation
and Gaussian Elimination step.
multiple Ring model with a total of 131,072 cells com-
prising 10,878,976 nodes running for 10ms on NVIDIA
K20X GPU [49]. Every cell has the same number (83)
of nodes but different cell types have a different random
branching pattern of the 40 dendrites. The number of
identical cells per type ranges from 1 (131,072 distinct
branching patterns) to 32 (4096 distinct branching pat-
terns). Note that regardless of the branching pattern,
Gaussian elimination takes exactly the same number of
arithmetic operations. Figure 5D shows performance of
Interleaved Layout and Constant Depth Layout. For
both node ordering schemes, performance is optimal
with regard to parent ordering when there are at least
32 cells of each type corresponding to the 32 threads
operating in Single Instruction Multiple Data (SIMD)
mode. With fewer cells per type, parent node ordering
becomes less than optimal and the performance of In-
terleaved layout suffers by up to a factor of two. Note
that the total runtime deteriorates more rapidly than
Gaussian elimination time due to the fact that the par-
ent contiguity also affects the performance of tree ma-
trix setup during evaluation of a node’s current balance
equation. The execution time of Constant Depth layout
shows that it is possible to permute node ordering so
10 Kumbhar, Pramod 1* et al.
that parent nodes are more likely to be in significant
contiguous order relative to their children. The con-
stant ratio between total runtime and Gaussian elimi-
nation is due to negligible time contribution of passive
dendrites to matrix setup in combination with the sig-
nificant role of parent ordering in computing the effect
of topologically adjacent nodes on matrix setup of the
current balance equations.
5 Benchmarks and Performance
Not all network models are compute intensive or benefit
equally from CoreNEURON optimizations. In order to
evaluate the performance improvements with the opti-
mizations discussed in the previous section we ran sev-
eral published network models listed in Table1 on dif-
ferent computing architectures. This section describes
the benchmarking platforms and compares performance
between NEURON and CoreNEURON.
The benchmarking systems with hardware details,
compiler toolchains and network fabrics are summa-
rized in Table 4. The Blue Brain IV (BB4) and Blue
Brain V (BB5) systems are based on IBM Blue-
Gene/Q [50] and HPE SGI 8600 [51] platforms respec-
tively, hosted at the Swiss National Computing Cen-
ter (CSCS) in Lugano, Switzerland. The BB4 system
has 4,096 nodes comprising 65,536 PowerPC A2 cores.
The BB5 system has three different compute nodes:
Intel KNLs with low clock rate but high bandwidth
MCDRAM, Intel Skylakes with high clock rate, and
NVIDIA Volta GPUs. Vendor provided compilers and
MPI libraries are used on both systems. The BB4 sys-
tem is used for strong scaling benchmarks as it has a
large core count compared to BB5 system.
We compared the memory footprint of different net-
work models listed in Table 1. Figure 6 on the left shows
memory usage reduction with CoreNEURON simula-
tion compared to NEURON simulation. The memory
reduction factor depends on various model properties
(e.g. number of compartments, sections, synapses, etc.)
but one can expect 4-7x reduction with the use of
CoreNEURON. Note that CoreNEURON Online mode
will need 17x to
1
4x more memory during the Mem-
ory Setup phase. But once the model is transferred to
CoreNEURON for simulation, NEURON can free allo-
cated memory.
Figure 6 on the right shows the speedup achieved
on single node for different models with CoreNEURON
compared to NEURON. Note that the Cortex and Hip-
pocampus models are very large in terms of memory ca-
pacity requirement. For single node performance anal-
ysis we used a smaller subset of these two models.
The memory layout and code vectorization opti-
mization described in Section 4.2 shows greatest im-
provement when most of the computation time is spent
in channel and synapse computations. The Cortex,
Cortex+Plasticity and Hippocampus models have cells
with 200 to 800 compartments and 20 different chan-
nel types. This makes these models compute intensive
and get benefited most with CoreNEURON. The Ring
network model has computations only from passive den-
drites and active soma.
Intel KNL has 512-bit SIMD vectors and high band-
width memory (MCDRAM). One needs to efficiently
utilize these hardware features to achieve best perfor-
mance. In the case of CoreNEURON, NMODL gen-
erated code is auto-vectorized by the compiler and
has SoA memory layout to provide uniform, contigu-
ous memory access. NEURON uses AoS memory lay-
out which results in strided memory accesses. Due
to the lower clock frequency of KNL cores, the per-
formance impact of non-vectorized code and strided
memory accesses is high compared to other archi-
tectures. Hence CoreNEURON delivers better perfor-
mance on KNL compared to NEURON. Note that the
Cortex+Plasticity and Hippocampus models have rel-
atively less improvement (2-4x) compared to the Cor-
tex model (3-7x). This is because some of the channel
and synapse descriptions explicitly request integration
methods that present compilers cannot efficiently vec-
torize. Alternative code generation for these methods is
being considered.
On the BlueGene/Q platform the speedup with
most of the models is limited to 2x. This is because
the IBM XL compiler is not able to vectorize most of
the channel and synapse kernels. Observed performance
improvement on this platform is due to more efficient
memory accesses from the SoA layout discussed in the
Section 4.2.
GPU support has been recently added to CoreNEU-
RON and not all models are adapted for GPU yet. The
Ring network model has large number of identical cells
which suits SIMD computations on GPU and hence
shows significant performance improvement compared
to other architectures. The Traub model has a small
number of cells exposing limited parallelism and the
Dentate model has gap junctions which require copy-
ing of voltages between CPU and GPU every timestep.
This limits the performance improvement on GPU.
Due to the large memory requirement of Cortex
+ Plasticity and Hippocampus models, a minimum of
2048 nodes of the BB4 system are required when NEU-
RON is used. By using CoreNEURON Offline Mode,
users can now simulate the Cortex+Plasticity model
with 128 nodes and the Hippocampus model with 256
CoreNEURON 11
Table 4: Details of Benchmarking Systems
BlueGene/Q (BB4)
Processor IBM PowerPC A2, 16 cores @ 1.6 GHz, 16 GB DRAM
Compiler toolchain IBM XL 12.1 and IBM MPI
Network Integrated 5-D torus
Intel Skylake (BB5)
Processor 2 Xeon 6140, 36 cores @ 2.3 GHz, 384 GB DRAM
Compiler toolchain Intel 2018.1 and HPE-MPI (MPT)
Network InfiniBand EDR
Intel KNL (BB5)
Processor Xeon Phi (7230), 64 cores @ 1.3 GHz, 16 GB MCDRAM, 96 GB DRAM
Compiler toolchain Intel 2018.1 and HPE-MPI (MPT)
Network, InfiniBand EDR
NVIDIA GPU (BB5)
Processor NVIDIA GPU V100 SXM2
Compiler toolchain PGI 18.10, OpenMPI 2.0
Network InfiniBand EDR
R
in
g
T
r
a
u
b
D
e
n
t
a
t
e
C
o
r
t
e
x
C
o
r
t
e
x
+
P
la
s
t
ic
it
y
H
ip
p
o
c
a
m
p
u
s
0
2
4
6
8
M
e
m
o
r
y
R
e
d
u
c
t
io
n
F
a
c
t
o
r
R
in
g
T
r
a
u
b
D
e
n
t
a
t
e
C
o
r
t
e
x
C
o
r
t
e
x
+
P
la
s
t
ic
it
y
H
ip
p
o
c
a
m
p
u
s
0
2
4
6
8
S
p
e
e
d
u
p
BG-Q Skylake KNL GPU
Fig. 6: Memory usage reduction with CoreNEURON compared to NEURON (measured on BB4) on the left and
speedup of CoreNEURON simulations compared to NEURON on various architectures on the right
nodes of BB4. This is a significant usability improve-
ment because of limited compute partition size and
long job queues on this shared computing resource. Fig-
ure 7 shows strong scaling of Cortex+Plasticity and
Hippocampus models simulating one second of biolog-
ical time on the BB4 system with CoreNEURON. As
these models are compute intensive and a small fraction
of execution time is spent in spike communication, the
scaling behavior depends on how well a given number of
cells can be distributed across the available number of
ranks to yield good load balance. Both models show
excellent strong scaling behavior up to 2,048 nodes.
Due to the large size range of morpho-electrical neuron
types, at least 7-10 cells per MPI process are required to
achieve good load balance. With 32,000 MPI processes
(16 ranks per node) and about 219,000 cells of Cor-
tex+Plasticity, the load balance is not as good as with
the Hippocampus model of about 789,000 cells. Hence,
the Cortex+Plasticity model exhibits poorer scaling be-
havior compared to the Hippocampus model.
6 Discussion
Modern compute architectures can significantly boost
application performance and the study of the brain in
silico is in dire need to embrace this capability and
trend. Accordingly, the widely used NEURON simula-
12 Kumbhar, Pramod 1* et al.
128 256 512 1024 2048
2,050
4,090
8,180
16,400
32,700
#Nodes
S
o
lv
e
r
T
im
e
(
S
e
c
)
256 512 1024 2048
2,050
4,090
8,180
16,400
#Nodes
Ideal
CoreNEURON
Fig. 7: Strong scaling of CoreNEURON on BB4 system for Cortex+Plasticity model on the left and Hippocampus
CA1 model model on the right
tor that supports a large variety of models has been over
the years successfully adapted to embrace massively
parallel architectures, but its primary design goals were
to allow for a flexible definition of models and interac-
tive introspection thereof. It was neither designed for
ultimate memory efficiency nor maximal performance.
However, the larger and more detailed the models get,
the larger are the resource requirements to simulate
those models. Eventually, the costs of a system re-
quired for an un-optimized simulator should be weighed
against the effort of reworking the simulator to make
more efficient use of resources. In the context of the
Blue Brain Project we took the decision to contribute to
making the NEURON simulator more efficient for large
models, effectively leading to reduced resource require-
ments, faster time-to-solution, or simply the capability
to run bigger models on a given resource.
Compatibility with existing NEURON mod-
els
As the neuroscience community has developed and
shared thousands of models with NEURON, compat-
ibility and reproducibility has been one of the primary
design goals. To maintain maximal compatibility, we
chose the path of extracting the computational rele-
vant parts of NEURON into a library called CoreNEU-
RON and adapting it to exploit the computational fea-
tures of modern compute architectures. This is a dif-
ferent path as for example taken by the Arbor ini-
tiative [52], [53] which started its developments from
scratch. While such a fresh start has its benefits in
terms of designing for future architectures from the
start, we can show that the transformation approach we
took immediately gives compatibility with a large num-
ber of existing NEURON models with minimal modifi-
cation. Currently, CoreNEURON does not handle non
thread-safe models and requires NMODL modifications
if constructs like POINTER [54] are used. We are work-
ing on handling such models transparently.
Flexibility for Model Building and Efficiency
for Model Simulation
Many modeling workflows related to detailed brain
models require flexibility for quickly inspecting and
changing the models. By extracting the compute en-
gine from the NEURON simulator environment and
providing different methods of how it can interact with
the NEURON simulator, one maintains the flexibility
of NEURON for the construction of the models and
can more easily apply optimizations to the compute
engine for the costly simulation phase. The Offline ex-
ecution mode of CoreNEURON provides flexibility to
build and simulate large network models that cannot be
simulated with NEURON. Thanks to the use of MPI,
and the OpenMP and OpenACC programming models
to achieve portability across different architectures such
as multi-core, many-core CPUs and GPUs.
Reduced Memory and Faster Time-to-
Solution
The data structure changes allow CoreNEURON to
use significantly less memory compared to NEURON.
The SoA memory layout and code vectorisation al-
low CoreNEURON to simulate models efficiently. We
benchmarked five different network models on differ-
ent architectures showing 4-7x memory usage reduction
and 2-7x execution time improvement.
Future Work
We discussed the implementation of the most significant
REFERENCES 13
changes and optimizations in NEURON and CoreNEU-
RON. Although CoreNEURON can be used transpar-
ently within NEURON, users cannot currently access
or modify model properties during integration. Work is
ongoing in regard to bidirectional data copy routines ac-
tivated by normal NEURON variable name evaluation
and assignment syntax ranging in granularity from the
entire model, to specific named arrays, down to individ-
ual variables. On the numerical side, CoreNEURON to-
day supports network simulations using the fixed time
step method but not the variable time step integra-
tion method (CVODE) [55]. The latter is rarely used
in network simulations because state or parameter dis-
continuities in response to synaptic events demand con-
tinuous re-initialization of variable step integrators. Re-
search is ongoing on how to improve the applicability of
variable time step schemes in network simulation and
can be considered for inclusion at a later stage. Cur-
rently, mapping of multiple MPI ranks to GPUs is not
optimal and this will be addressed in future releases.
Lastly, the NMODL source-to-source translator will be
improved to generate efficient code for stiff, coupled,
nonlinear gating state complexes that require the de-
rivimplicit integration method as well as the generation
of optimal code for GPUs.
Availability
CoreNEURON and code generation program MOD2C
are open sourced and available on GitHub[43], [56].
Acknowledgements
This work has been funded by the EPFL Blue Brain
Project (funded by the Swiss ETH board), NIH grant
number R01NS11613 (Yale University), the European
Union Seventh Framework Program (FP7/20072013)
under grant agreement n◦ 604102 (HBP) and the Eu-
ropean Union’s Horizon 2020 Framework Programme
for Research and Innovation under Grant Agreement n◦
720270 (Human Brain Project SGA1) and Grant Agree-
ment n◦ 785907 (Human Brain Project SGA2). The au-
thors would like to thank Bruno Magalhaes, Francesco
Cremonesi, Sam Yates and Timothee Ewart for contri-
butions to the CoreNEURON development.
References
[1] Erik De Schutter and James Bower. “An Ac-
tive Membrane Model of the Cerebellar Purkinje
Cell I. Simulation of Current Clamps in Slice”.
In: Journal of Neurophysiology 71 (Feb. 1994),
pp. 375–400. doi: 10.1152/jn.1994.71.1.375.
[2] Zachary F. Mainen and Terrence J. Sejnowski.
“Influence of dendritic structure on firing pat-
tern in model neocortical neurons”. In: Nature
382.6589 (1996), pp. 363–366. issn: 1476-4687.
doi: 10.1038/382363a0. url: https://doi.
org/10.1038/382363a0.
[3] S. N. Davies. “Neural Networks of the Hip-
pocampus. By Roger D. Traub and Richard
Miles. Pp. 281. Cambridge University Press, 1991.
27.50/$39.50 hardback. ISBN 0 521 36481 7”. In:
Experimental Physiology 77.1 (1992), pp. 238–
238. doi: 10.1113/expphysiol.1998.sp004235.
eprint: https : / / physoc . onlinelibrary .
wiley.com/doi/pdf/10.1113/expphysiol.
1998 . sp004235. url: https : / / physoc .
onlinelibrary.wiley.com/doi/abs/10.1113/
expphysiol.1998.sp004235.
[4] Henry Markram et al. “Reconstruction and sim-
ulation of neocortical microcircuitry”. In: Cell
163.2 (2015), pp. 456–492.
[5] Anton Arkhipov et al. “Visual physiology of the
layer 4 cortical circuit in silico”. In: PLOS Com-
putational Biology 14.11 (Nov. 2018), pp. 1–47.
doi: 10 . 1371 / journal . pcbi . 1006535. url:
https://doi.org/10.1371/journal.pcbi.
1006535.
[6] Michael W. Reimann et al. “Cliques of Neurons
Bound into Cavities Provide a Missing Link be-
tween Structure and Function”. In: Frontiers in
Computational Neuroscience 11 (2017), p. 48.
issn: 1662-5188. doi: 10 . 3389 / fncom . 2017 .
00048. url: https://www.frontiersin.org/
article/10.3389/fncom.2017.00048.
[7] Eyal Gal et al. “Rich cell-type-specific network
topology in neocortical microcircuitry”. In: Na-
ture neuroscience 20 (June 2017). doi: 10.1038/
nn.4576.
[8] Costas A. Anastassiou et al. “Cell type- and
activity-dependent extracellular correlates of in-
tracellular spiking”. In: Journal of Neurophysiol-
ogy 114.1 (2015). PMID: 25995352, pp. 608–623.
doi: 10.1152/jn.00628.2014. eprint: https:
/ / doi . org / 10 . 1152 / jn . 00628 . 2014. url:
https://doi.org/10.1152/jn.00628.2014.
[9] Robert Lindroos et al. “Basal Ganglia Neuromod-
ulation Over Multiple Temporal and Structural
ScalesSimulations of Direct Pathway MSNs In-
vestigate the Fast Onset of Dopaminergic Effects
and Predict the Role of Kv4.2”. In: Frontiers in
Neural Circuits 12 (2018), p. 3. issn: 1662-5110.
doi: 10.3389/fncir.2018.00003. url: https:
//www.frontiersin.org/article/10.3389/
fncir.2018.00003.
14 REFERENCES
[10] Renaud Jolivet et al. “Multi-timescale Modeling
of Activity-Dependent Metabolic Coupling in the
Neuron-Glia-Vasculature Ensemble”. In: PLOS
Computational Biology 11.2 (Feb. 2015), pp. 1–
23. doi: 10.1371/journal.pcbi.1004036. url:
https://doi.org/10.1371/journal.pcbi.
1004036.
[11] Eugene M. Izhikevich and Gerald M. Edelman.
“Large-scale model of mammalian thalamocor-
tical systems”. In: Proceedings of the National
Academy of Sciences 105.9 (2008), pp. 3593–
3598. issn: 0027-8424. doi: 10 . 1073 / pnas .
0712231105. eprint: https://www.pnas.org/
content/105/9/3593.full.pdf. url: https:
//www.pnas.org/content/105/9/3593.
[12] Tobias C. Potjans and Markus Diesmann. “The
Cell-Type Specific Cortical Microcircuit: Relat-
ing Structure and Activity in a Full-Scale Spiking
Network Model”. In: Cerebral cortex. 2014.
[13] David Dahmen et al. “Hybrid Scheme for Mod-
eling Local Field Potentials from Point-Neuron
Networks”. In: Cerebral Cortex 26.12 (Dec. 2016),
pp. 4461–4496. issn: 1047-3211. doi: 10.1093/
cercor / bhw237. eprint: http : / / oup . prod .
sis.lan/cercor/article-pdf/26/12/4461/
8657457/bhw237.pdf. url: https://dx.doi.
org/10.1093/cercor/bhw237.
[14] Gustavo Deco et al. “The Dynamic Brain: From
Spiking Neurons to Neural Masses and Corti-
cal Fields”. In: PLOS Computational Biology 4.8
(Aug. 2008), pp. 1–35. doi: 10.1371/journal.
pcbi . 1000092. url: https : / / doi . org / 10 .
1371/journal.pcbi.1000092.
[15] Stefan Wils and Erik De Schutter. “STEPS: mod-
eling and simulating complex reaction-diffusion
systems with Python”. In: Frontiers in Neuroin-
formatics 3 (2009), p. 15. issn: 1662-5196. doi:
10 . 3389 / neuro . 11 . 015 . 2009. url: https :
//www.frontiersin.org/article/10.3389/
neuro.11.015.2009.
[16] Michele Migliore et al. “Parallel Network Simu-
lations with NEURON”. In: Journal of Compu-
tational Neuroscience 21.2 (2006), pp. 119–129.
doi: 10.1007/s10827-006-7949-5.
[17] Marc-Oliver Gewaltig and Markus Diesmann.
“NEST (NEural Simulation Tool)”. In: Scholar-
pedia 2.4 (2007), p. 1430.
[18] Paula Sanz-Leon et al. “Mathematical framework
for large-scale brain network modeling in The Vir-
tual Brain”. In: NeuroImage 111 (2015), pp. 385 –
430. issn: 1053-8119. doi: https://doi.org/10.
1016/j.neuroimage.2015.01.002. url: http:
//www.sciencedirect.com/science/article/
pii/S1053811915000051.
[19] Roger D. Traub et al. “Single-Column Thala-
mocortical Network Model Exhibiting Gamma
Oscillations, Sleep Spindles, and Epileptogenic
Bursts”. In: Journal of Neurophysiology 93.4
(2005). PMID: 15525801, pp. 2194–2232. doi: 10.
1152/jn.00983.2004.
[20] Jonas Dyhrfjeld-Johnsen et al. “Topological De-
terminants of Epileptogenesis in Large-Scale
Structural and Functional Models of the Dentate
Gyrus Derived From Experimental Data”. In:
Journal of Neurophysiology 97.2 (2007). PMID:
17093119, pp. 1566–1587. doi: 10 . 1152 / jn .
00950 . 2006. eprint: https : / / doi . org / 10 .
1152/jn.00950.2006. url: https://doi.org/
10.1152/jn.00950.2006.
[21] Michael Hines. Ring Network Model of Ball-and-
Stick neurons. 2017. url: https://github.com/
nrnhines/ringtest (visited on 08/19/2018).
[22] Community Models of Hippocampus. url: https:
/ / www . humanbrainproject . eu / en /
brain - simulation / hippocampus/ (visited on
09/01/2018).
[23] Iain Hepburn, Weiliang Chen, and Erik De Schut-
ter. “Accurate Reaction-Diffusion Operator Split-
ting on Tetrahedral Meshes for Parallel Stochastic
Molecular Simulations”. In: The Journal of chem-
ical physics 145 5 (2016), p. 054118.
[24] Tammo Ippen et al. “Constructing Neuronal
Network Models in Massively Parallel Environ-
ments”. In: Frontiers in Neuroinformatics 11
(2017), p. 30. issn: 1662-5196. doi: 10.3389/
fninf . 2017 . 00030. url: https : / / www .
frontiersin.org/article/10.3389/fninf.
2017.00030.
[25] Michael Hines, Sameer Kumar, and Felix Schur-
mann. “Comparison of Neuronal Spike Exchange
Methods on a Blue Gene/P Supercomputer”.
In: Frontiers in Computational Neuroscience 5.49
(2011). doi: 10.3389/fncom.2011.00049.
[26] Soltesz Lab. url: http://med.stanford.edu/
ivansolteszlab/front-page.html (visited on
09/01/2018).
[27] E. brahm et al. “Preparing HPC Applications for
Exascale: Challenges and Recommendations”. In:
2015 18th International Conference on Network-
Based Information Systems. 2015, pp. 401–406.
doi: 10.1109/NBiS.2015.61.
[28] Michael Hines. “NEURON, a program for sim-
ulation of nerve equations”. In: Neural systems:
Analysis and modeling. Springer, 1993, pp. 127–
136.
REFERENCES 15
[29] Henry C Tuckwell. Introduction to theoretical
neurobiology: Volume 2, nonlinear and stochas-
tic theories. Vol. 8. Cambridge University Press,
2005.
[30] Michael Hines. “Efficient computation of
branched nerve equations”. In: International
journal of bio-medical computing 15.1 (1984),
pp. 69–76.
[31] Pramod Kumbhar et al. “Leveraging a Cluster-
Booster Architecture for Brain-Scale Simula-
tions”. In: High Performance Computing. Lecture
Notes in Computer Science 9697 (2016), pp. 18.
363–380. doi: 10.1007/978-3-319-41321-1_19.
url: http://infoscience.epfl.ch/record/
223109.
[32] Michael L Hines and Nicholas T. Carnevale. “Ex-
panding NEURON’s repertoire of mechanisms
with NMODL”. In: Neural Computation 12.5
(2000), pp. 995–1007.
[33] ParallelContext : NEURON User Guide. url:
https : / / www . neuron . yale . edu / neuron /
static/docs/help/neuron/neuron/classes/
parcon.html#psolve (visited on 09/01/2018).
[34] NMODL User Guide. url: https : / / www .
neuron.yale.edu/neuron/static/py_doc/
modelspec/programmatic/mechanisms/nmodl.
html (visited on 11/08/2018).
[35] M Schulz et al. “Implementation and Evalua-
tion of a Scalable Application-Level Checkpoint-
Recovery Scheme for MPI Programs”. In: Dec.
2004, pp. 38–38. isbn: 0-7695-2153-3. doi: 10.
1109/SC.2004.29.
[36] Threads : NEURON User Guide. url: https :
/ / www . neuron . yale . edu / neuron / static /
py_doc/modelspec/programmatic/network/
parcon.html#ParallelContext.Threads (vis-
ited on 01/01/2019).
[37] John K. Salmon et al. “Parallel Random Num-
bers: As Easy As 1, 2, 3”. In: Proceedings of 2011
International Conference for High Performance
Computing, Networking, Storage and Analysis.
SC ’11. Seattle, Washington: ACM, 2011, 16:1–
16:12. isbn: 978-1-4503-0771-0. doi: 10.1145/
2063384.2063405.
[38] Shashi Prasad. Multithreading Programming
Techniques. New York, NY, USA: McGraw-Hill,
Inc., 1996. isbn: 0079122507.
[39] Michael Lange et al. “Achieving Efficient Strong
Scaling with PETSc using Hybrid MPI/OpenMP
Optimisation”. In: ISC. 2013.
[40] L. Dagum and R. Menon. “OpenMP: an industry
standard API for shared-memory programming”.
In: IEEE Computational Science and Engineering
5.1 (1998), pp. 46–55. issn: 1070-9924. doi: 10.
1109/99.660313.
[41] Pramod Kumbhar et al. “Leveraging a Cluster-
Booster Architecture for Brain-Scale Simula-
tions”. In: High Performance Computing: 31st In-
ternational Conference, ISC High Performance
2016, Frankfurt, Germany, June 19-23, 2016,
Proceedings. Springer International Publishing,
2016, pp. 363–380. doi: 10.1007/978-3-319-
41321-1_19.
[42] Inga Blundell et al. “Code generation in compu-
tational neuroscience: a review of tools and tech-
niques”. In: Frontiers in Neuroinformatics 12.68
(2018), pp. 1–35. doi: 10.3389/fninf.2018.
00068. url: http : / / sro . sussex . ac . uk /
79306/.
[43] Blue Brain Project. MOD2C - converter for mod
files to C code. 2015. url: http://github.com/
BlueBrain/mod2c (visited on 08/18/2018).
[44] OpenACC. https : / / www . openacc . org/. Ac-
cessed: 2018-05-14. 2012.
[45] John Levine and Levine John. Flex & Bison.
1st. O’Reilly Media, Inc., 2009. isbn: 0596155972,
9780596155971.
[46] John Nickolls et al. “Scalable Parallel Program-
ming with CUDA”. In: Queue 6.2 (Mar. 2008),
pp. 40–53. issn: 1542-7730. doi: 10 . 1145 /
1365490.1365500. url: http://doi.acm.org/
10.1145/1365490.1365500.
[47] Michael Hines. NEURON (GPU). 2014. url:
https : / / bitbucket . org / nrnhines / nrngpu
(visited on 08/18/2018).
[48] Pedro Valero-Lara et al. “cuHinesBatch: Solving
Multiple Hines systems on GPUs”. In: Procedia
Computer Science 108 (2017). International Con-
ference on Computational Science, ICCS 2017,
12-14 June 2017, Zurich, Switzerland, pp. 566 –
575. issn: 1877-0509. doi: https://doi.org/10.
1016/j.procs.2017.05.145.
[49] TESLA K20X GPU ACCELERATOR. url:
https : / / www . nvidia . com / content / pdf /
kepler/tesla-k20x-bd-06397-001-v07.pdf
(visited on 01/04/2019).
[50] R. Haring et al. “The IBM Blue Gene/Q Com-
pute Chip”. In: IEEE Micro 32.2 (2012), pp. 48–
60. issn: 0272-1732. doi: 10.1109/MM.2011.108.
[51] HPE SGI 8600 System. url: https : / / www .
hpe . com / us / en / product - catalog /
servers / proliant - servers / pip . hpe - sgi -
8600 - system . 1010032504 . html (visited on
09/01/2018).
16 REFERENCES
[52] The Arbor multi-compartment neural network
simulation library. url: https://github.com/
arbor-sim (visited on 01/18/2019).
[53] Wouter Klijn et al. “Arbor: A morphologically de-
tailed neural network simulator for modern high
performance computer architectures”. In: 26th
Computational Neuroscience Meeting, Antwerp
(Belgium), 15 Jul 2017 - 20 Jul 2017. July 15,
2017. url: http : / / juser . fz - juelich . de /
record/836542.
[54] POINTER - NMODL User Guide. url: https:
//www.neuron.yale.edu/neuron/static/py_
doc / modelspec / programmatic / mechanisms /
nmodl.html#pointer (visited on 11/08/2018).
[55] Scott D Cohen and Alan C Hindmarsh. “CVODE,
a stiff/nonstiff ODE solver in C”. In: Computers
in physics 10.2 (1996), pp. 138–143.
[56] Blue Brain Project. CoreNeuron - simulator opti-
mized for large scale neural network simulations.
https://github.com/BlueBrain/CoreNeuron.
2015. (Visited on 08/18/2018).
