An optimizing multi-platform source-to-source compiler framework for the
  NEURON MODeling Language by Kumbhar, Pramod et al.
An optimizing multi-platform source-to-source compiler
framework for the NEURON MODeling Language
Pramod Kumbhar
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
Omar Awile
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
Liam Keegan
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
Jorge Blanco Alonso
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
James King
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
Michael Hines
Yale University
Felix Schürmann
felix.schuermann@epfl.ch
Ecole Polytechnique Fédérale de
Lausanne (EPFL)
ABSTRACT
Domain-specific languages (DSLs) play an increasingly important
role in the generation of high performing software. They allow the
user to exploit specific knowledge encoded in the constructs for the
generation of code adapted to a particular hardware architecture;
at the same time, they make it easier to generate optimized code for
a multitude of platforms as the transformation has to be encoded
only once. Here, we describe a new code generation framework
(NMODL) for an existing DSL in the NEURON framework, a widely
used software for massively parallel simulation of biophysically
detailed brain tissue models. Existing NMODL DSL transpilers lack
either essential features to generate optimized code or capability to
parse the diversity of existing models in the user community. Our
NMODL framework has been tested against a large number of previ-
ously published user models and offers high-level domain-specific
optimizations and symbolic algebraic simplifications before tar-
get code generation. Furthermore, rich analysis tools are provided
allowing the scientist to introspect models. NMODL implements
multiple SIMD and SPMD targets optimized for modern hardware.
Benchmarks were performed on Intel Skylake, Intel KNL and AMD
Naples platforms. When comparing NMODL-generated kernels
with NEURON we observe a speedup of up to 20x, resulting into
overall speedups of two different production simulations by ∼10x.
When compared to a previously published SIMD optimized version
that heavily relied on auto-vectorization by the compiler still a
speedup of up to ∼2x is observed.
KEYWORDS
NEURON, HPC, neuroscience, DSL, compiler, code generation
1 INTRODUCTION
The use of large scale simulation in modern neuroscience is becom-
ing increasingly important (e.g. [1–3]) and has been enabled by
substantial performance progress in neurosimulation engines over
the last decade and a half (e.g. [4–9]).
While excellent scaling has been achieved on a variety of plat-
forms with the conversion to Single Instruction Multiple Data
(SIMD) implementations, domain specific knowledge expressed
in the models is not yet optimally used. In other fields, the use of
DSLs and subsequent code-to-code translation have been effective
in generating high performing codes and allowing easy adaptation
to novel architectures ([10–14]). This is becoming more impor-
tant as the architectural diversity of computers is increasing as a
reaction to mutually exclusive design trade offs forced by more
evident physical and economic limitations when going to the next
generation of computing systems.
Motivated by these observations, we have revisited the widely
adopted NEURON simulator [15], which enables simulations of bio-
physically detailed neuron models on computing platforms ranging
from desktop to petascale supercomputers, and which has over
2,000 reported scientific studies using it. NEURON supports net-
works of neurons with complex branched anatomy and biophysical
properties such as multiple channel types, inhomogeneous channel
distribution, ionic accumulation and diffusion-reactions. One of the
key features of the NEURON simulator is extendability via a domain
specific language (DSL) layer called the NEURON Model Descrip-
tion Language (NMODL) [16]. NMODL allows model authors to
extend NEURON by incorporating a wide range of membrane and
intracellular submodels such as voltage and ligand gated channels,
ionic accumulation and diffusion, and synapse models. The domain
scientists can easily express these channel properties in terms of
algebraic and ordinary differential equations, kinetic schemes, and
finite state machines in NMODL. Working at a descriptive level
allows easy development of models while focusing on biological
aspects rather than worrying about lower level implementation
details such as solver methods, threads, memory layout, parallel
execution etc.
The rate limiting aspect for performance of NEURON simulations
is the execution of channels and synapses written in the NMODL
DSL. The code generated from NMODL often accounts for more
than 80% of overall execution time. For example, Figure 1 shows
the execution profile of a Hippocampus model [17] with about 670
ar
X
iv
:1
90
5.
02
24
1v
1 
 [c
s.M
S]
  6
 M
ay
 20
19
Kumbhar P., et al.
Current Update 
(~45%)
Spike
Communication (~7%)
Linear 
Algebra 
(~3%)
Other (~1%)
State Update 
(~44%)
Figure 1: Execution profile of hippocampus CA1 model
showing the percentage of time spent in various parts of the
NEURON simulator. More than 90% of the execution time is
spent in the kernels generated from the NMODL DSL.
thousand compartments and about 4.3 million synapses simulated
for one second of biological time (see section 7 for details). This
model has 17 morpho-electrical types (me-types) where electrical
firing behavior of an metype is governed by the distribution of vari-
ous channel types across the morphological cell body. Each channel
is described in a separate NMODL file and typically contains two
kernels : state update and current update. The outer ring of the pie
chart shows the percentage of time spent in the different parts of
the NEURON simulator. The current update and state update from
the NMODL DSL accounts for more than 90% of the simulation time.
The execution time of individual mechanisms are shown by arcs in
the inner circle. Even in a strong scaling scenario, it has been shown
that the spike communication remains small and channel/synapse
computations dominate the overall execution time [18].
There aremore than six thousandNMODLfiles that are shared by
theNEURONuser community on theModelDB platform [19]. As the
type and number of mechanisms differ from model to model, hand-
tuning of the generated code is not feasible. The goal of our NMODL
Framework is to provide a tool that can parse all existing models,
and generate optimized code from NMODL DSL code, which is
responsible for more than 80% of the total simulation time.
2 RELATEDWORK
The reference implementation for the NMODL DSL specification
is found in nocmodl [20], which is a component in the NEURON
simulator. Over the years nocmodl underwent several iterations of
development and gained support for a number of newer language
constructs and a C code generator backend for the NEURON simu-
lator. Nocmodl uses lex/flex [21] and yacc/bison [22] for its scanner
and parser implementation respectively. One of the major limita-
tions of nocmodl is its lack of flexibility. Instead of constructing an
intermediate representation, such as an Abstract Syntax Tree (AST),
it performs many code generation steps on the fly, while parsing
input as part of the bison production rules. This leaves little flexi-
bility for performing global analysis, optimizations, adapting code
generation for different languages or targeting a different simulator
altogether. The CoreNEURON library uses a modified version of
nocmodl called mod2c [23]. Mod2c duplicates most of the legacy
code with the modifications needed to generate code with new data
structures present in CoreNEURON, different memory layouts and
GPU support based on the OpenACC programming model. But,
as mod2c shares most of its implementation with nocmodl, it still
has some of the same limitations as nocmodl. Pynmodl [24] is a
Python based parsing and post-processing tool for NMODL. The
primary focus of pynmodl is to parse and translate NMODL DSL to
other computational neuroscience DSLs such as LEMS [25]. Pyn-
modl supports a subset of the NMODL DSL specification and does
not support lower level code generation for a particular simulator.
The modcc source-to-source compiler is being developed as part
of the Arbor simulator [26]. It is able to generate from NMODL
DSL code, optimized C++/CUDA to be used with the Arbor sim-
ulator. Its lexer and parser have been hand-implemented in C++
and generate an intermediate AST representation. However, it only
implements a subset of the NMODL DSL specification and hence
is only able to process a modest number of existing models avail-
able in ModelDB [19]. For a more comprehensive review of current
code-generator techniques in computational neuroscience we refer
the reader to Blundell et al [27]. In summary, we conclude that
current tools either lack support for the full NMODL DSL specifi-
cation, lack the necessary flexibility to be used as a generic code
generation framework, or are unable to adequately take advantage
of modern hardware architectures, and thus are missing out on
available performance from modern computing architectures.
3 NMODL DSL
To understand important code generation aspects for improving
performance, a simple example of a voltage-gated calcium channel
written in NMODL DSL is presented in Figure 2. Due to space
constraints the example is modified to highlight important language
constructs.
Each named block in the NMODL DSL has the general form of
KEYWORD { statements }. There are more than 90 different con-
structs or keywords in the language. Keywords are printed in all up-
percase andmarkedwith boldface in the example. The NEURON block
specifies the name of the mechanism, ion usage, and variables used
in the model. The SUFFIX keyword identifies this as a density mech-
anism (as opposed to POINT_PROCESSwhich would instead identify
it as a synapse or electrode class whose instances are localized
to a single point on the neuron). The USEION statement indicates
the interaction of this mechanism with those other mechanisms at
the same location sharing the same ion type. The RANGE keyword
indicates that the encompassed variables are functions of position
(i.e. each of these variables can have a different value in each com-
partment). LOCAL variables are block scoped variables in the block.
Internal to the mod file, variables that are used as model parameters
are defined in the PARAMETER block. The ASSIGNED block is used to
declare variables that can appear on the left hand side of assignment
statements (i.e. modifiable variables). If a model uses differential
equations, algebraic equations, or kinetic reaction schemes, their
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
1   TITLE T-calcium channel (adapted)
2   
3   NEURON {
4     SUFFIX cat
5     USEION ca READ cai, cao WRITE ica
6     RANGE gcatbar, ica, gcat, hinf, minf, mtau, htau
7   }
8   
9   PARAMETER {
10    celsius = 25 (degC)
11    gcatbar=.003 (mho/cm2)
12    cai = 50.e-6 (mM)
13    cao = 2 (mM)
14  }
20   
21  STATE {
22    m h
23  }
24   
25  ASSIGNED {
26    ica gcat hinf htau minf mtau
27  }
28   
29
30  BREAKPOINT {
31    SOLVE states METHOD cnexp
32    gcat = gcatbar*m*m*h
33    ica = gcat*ghk(v,cai,cao)
34  }
35  DERIVATIVE states {
36    rates(v)
37    m' = (minf - m)/mtau
38    h' = (hinf - h)/htau
39  }
40  
41  PROCEDURE rates(v(mV)) {
42    LOCAL a, b, qt
43    qt= q10^((celsius-25)/10)
44    a = 0.2*(-1.0*v+19.26)
45    b = 0.009*exp(-v/22.03)
46    minf = a/(a+b)
47    mtau = betmt(v)/(qt*a0m*(1+alpmt(v)))
48    a = 1.e-6*exp(-v/16.26)
49    b = 1/(exp((-v+29.79)/10.)+1.)
50    hinf = a/(a+b)
51    htau = beth(v)/(a0h*(1+alph(v)))
52  }
53  
54  FUNCTION ghk(v(mV), ci(mM), co(mM)) (mV) {
55    LOCAL nu,f
56    f = KTF(celsius)/2
57    nu = v/f
58    VERBATIM
59    // C code implementation
60    ENDVERBATIM
61    ghk = -f*(1.0-(ci/co)*exp(nu))*efun(nu)
62  }
often compute  bound
elemental function
if inlined into DERIVATIVE 
minf, mtau, hinf, htau could 
become local variables 
unsafe for optimisations, 
need C lexer/parseroften memory bound
constant variables with 
limited precision
modifiable variables
channel dependency
memory footprint
Figure 2: NMODL example of voltage-gated calcium channel showing different NMODL constructs and summary of optimiza-
tion information available at DSL level
dependent variables are listed in the STATE block. The BREAKPOINT
block is used to update current and conductance at each time step.
The DERIVATIVE block contains differential equations of the form
y′ = expr that are used to assign values to the derivatives of STATE
variables. The SOLVE statement is used to specify the integration
scheme (see section 5) The PROCEDURE and FUNCTION represents
callable functions where the only difference is that the FUNCTION
variation can return a value. Users can also use VERBATIM constructs
to embed C code directly into the DSL. The enclosing statements
will be copied to the generated code in place. This offers flexibility
to implement lower level functionality not exposed by NMODL
DSL, but also makes it difficult to generate portable code. More
information about NMODL specification can be found in [16].
At the DSL level there is a lot of information implicitly expressed
that can be used to generate efficient code and expose more paral-
lelism. Some of the examples are:
• USEION statement describes the dependency of this mecha-
nism with other ion channels (e.g. ca). This information can
be used to build the runtime dependency graph to exploit
micro-parallelism [28]
• PARAMETER block describes the variables that are constant
during runtime, often with limited precision requirements
and often with small range of values (e.g. gcatbar). This
information can be used to improve memory access cost and
reduce memory footprint.
• ASSIGNED statement describes modifiable variables. These
variables can be allocated in fast memory to reduce access
latency (e.g. minf ).
• DERIVATIVE, KINETIC and SOLVE describes ODEs which can
be analyzed and solved analytically to improve the perfor-
mance as well as accuracy.
• BREAKPOINT describes current and conductance update at
each time step. If the current and voltage relation is ohmic
then one can use analytical expression instead of numerical
derivatives to improve the accuracy as well as performance.
• PROCEDURE does not allow a return value and hence often
users use RANGE variables (e.g.minf, mtau). If plotting of such
variables is not required, procedures can be inlined at DSL
level (e.g. rates) to eliminate RANGE variables and thereby
significantly reduce memory access cost as well as memory
footprint.
To use this information and perform such optimizations, often
global analysis of the NMODL DSL is required. For example, to
perform inlining of a PROCEDURE one needs to find all function calls
and recursively inline the function bodies. To verify if a variable can
be made const, one needs -to find all its usages.. As nocmodl lacks
the intermediate AST representation, this type of analysis is diffi-
cult to perform and these optimizations are not implemented. The
NMODL code generation framework is designed to exploit all such
information from DSL specification and perform optimizations.
4 DESIGN AND IMPLEMENTATION
The overall design of the NMODL Framework can be broken down
into six components: lexer/parser implementation, AST and DSL
level optimisation passes, ODE solvers, Python bindings, and finally
Kumbhar P., et al.
Python Bindings
C
ODEs Units
MOD
Collection
hh
mod
na
mod
ca
mod
Lexer
Parser
NMODL
driver
tokens
Detail
User
Preferred
Interface
Abstract Syntax 
Tree (AST) Initial
Improved Abstract 
Syntax Tree
 Inlining
 DefUse
LoopUnroll
Localizer
Renaming
Look Up
Performance
 Kinetic
Conductance
ODESolver
Intermediate
AST
Optimization
Passes
Passes
UtilitySolver
Passes
SymPy
Code Generation Backends
AST to Custom 
Transformer
AST to NMODL 
Transformer
OpenMP
OpenACC
CUDA
ISPC
DefaultAST to C++Transformer
NMODL
Figure 3: Architecture of the NMODL Code Generation Framework showing: A) Input NMODL files are processed by different
lexers & parsers generating the AST; B) Different analysis and optimisation passes further transform the AST; and C) The
optimised AST is then converted to low level C++ code or other custom backends
code generation passes. Figure 3 summarizes the overall architec-
ture.
4.1 Lexer and Parser
As in any compilation process, the first two steps performed on an
input NMODL DSL are lexing and parsing. The lexer implementa-
tion is based on the popular flex package and bison is used as the
parser generator. To make the lexer and parser fully reentrant [29],
we makes use of the C++ Bison Interface parser implementation.
The bison grammar is based on nocmodl, which is a reference imple-
mentation for the NMODL DSL, making our implementation fully
compatible with the NMODL language specification. As parser rules
are executed, appropriate C++ classes are instantiated to construct
an Abstract Syntax Tree (see subsection 4.2). As opposed to noc-
modl and mod2c our approach allows us to keep parsing and code
generation steps completely separated and an arbitrary number
of intermediate analysis and optimization steps can be interposed.
The ODEs, units and inline C code (from VERBATIM blocks) need
extra processing during further distinct compiler passes. To parse
these constructs, we have implemented separate lexers and parsers
using flex and bison.
4.2 Abstract Syntax Tree Representation
The goal of the NMODL Framework is to implement a general
purpose parsing and code generation tool not tied to a particular
simulator. For this, all language semantics need to be preserved
in the Abstract Syntax Tree (AST) representation so that different
higher level tools can use this information for different purposes.
With more than 90 keywords, 130 different AST node types are
required to represent all language semantics. Each node of the AST
is implemented as a C++ class and represents a syntactic element,
or a group of syntactic or semantic elements of the DSL. To avoid
having to implement more than 130 C++ classes by hand, we used
the declarative approach for AST design using a YAML specifica-
tion [30]. Figure 4 shows an excerpt of the YAML specification of
the DSL. This YAML specification is translated to C++ classes using
the Python jinja2 template engine [31]. Using this approach we are
able to generate most of the AST representation and AST visitor
framework in a compact code base. This approach provides flexi-
bility to extend the AST design with minimal changes and hence
higher productivity.
The hierarchy in the YAML specification represents inheritance
relationships with C++ AST classes. The member specification de-
scribes member variables with their types in C++ classes.
Figure 5 shows an example of the AST in-memory representation
constructed by the NMODL parser for the example discussed in
section 3. As the NMODL language supports block scopes, scoped
symbol tables [32] are created at each block scope and attached to
the corresponding AST node.
4.3 Optimization Passes
Modern compilers implement various passes for code analysis, code
transformation, and optimized code generation. Optimizations such
as constant folding [33], inlining [34], and loop unrolling [35] are
commonly found in all of today’s major compilers. For example, the
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
1  - Node:
2    description: "Base class for nodes in the AST"
3    children:
4     - Expression:
5       description: "Represents an expression"
6       children:
7         - Number:
8           description: "Represents a number"
9           children:
10            - Float:
11              description: "Represents a float number"
12              members:
13                - value:
14                  type: float
Figure 4: An example of YAML specification for AST base
nodes like Node, Expression and child node Float
S
S
S
S
SS
S
S
S S
S Global Symbol Table S Local Symbol Table 
Figure 5: InmemoryAST representation of theNMODL code
from Figure 2 showing different AST node types and symbol
tables created.
LLVM compiler framework [36] features more than one hundred
compiler passes [37].
In the context of the NMODL Framework, we focus on a few
optimization passes with very specific objectives. By taking advan-
tage of domain-specific and high-level information that is avail-
able in the DSL but later lost in the lower level C++ or CUDA
code, we are able to provide additional significant improvements
in code performance. For example, all NMODL RANGE, ASSIGNED,
and PARAMETER variables are translated to double type variables in
C++. Once this transformation is done, C/C++ compilers can no
longer infer these high-level semantics from these variables. An-
other example is RANGE to LOCAL transformations with the help of
PROCEDURE inlining discussed in section 3. All RANGE variables in
the NMODL DSL are converted to array variables and are dynami-
cally allocated in C++. Once this transformation is done, the C/C++
compiler can only do limited optimizations.
To facilitate the DSL level optimizations summarized in section 3,
we have implemented the following optimization passes. As most
of these optimization passes are also commonly used in compilers
today, we only summarize their role in the context of NMODL DSL
optimizations.
Inlining : To facilitate optimizations such as RANGE to LOCAL
conversion and facilitate other code transformations, the Inlining
pass performs code inlining of PROCEDURE and FUNCTION blocks at
their call sites.
Variable Usage Analysis : Different variable types such as
RANGE, GLOBAL, ASSIGNED can be analysed to check where and how
often they are used. The Variable Usage Analysis pass implements
Definition-Use (DU) chains [38] to perform data flow analysis.
Localiser : Once function inlining is performed, DU chains can
be used to decide which RANGE variables can be converted to LOCAL
variables. The Localiser pass is responsible for this optimization.
Constant Folding and Loop Unrolling : The KINETIC and
DERIVATIVE blocks can contain coupled ODEs in WHILE or FOR
loop statements. In order to analyse these ODEs with SymPy (see
section 5), first we need to perform constant folding to know the
iteration space of the loop and then perform loop unrolling to make
all ODE statements explicit.
All above-described optimization passes operate on the AST and
are implemented using the Visitor Pattern.
4.4 Code Generation
Once DSL and symbolic optimizations (see section 5) are performed
on the AST, the NMODL Framework is ready to proceed to the
code generation step (cf. Figure 3). Table 1 summarizes the various
target languages supported for different hardware platforms. The
C++ code generator plays a special role, since it constitutes the base
code generator extended by all other implementations. This allows
easy implementation of a new target by overriding only necessary
constructs of the base code generator.
While our initial code generation target was C++, allowing us
to validate the code against the original implementation, our pri-
mary target is the Intel SPMD Program Compiler (ISPC) [39]. ISPC
is built on top of the LLVM compiler framework, leveraging its
modular architecture and powerful code generation backend sup-
porting a multitude of hardware platforms. We chose ISPC for its
performance portability and support for all major vector extensions
on x86 (SSE4.2, AVX, AVX2, AVX-512), ARM NEON and NVIDIA
GPUs (using NVPTX) giving us the ability to generate optimized
SIMD code for all major computing platforms. Since the standard
library of ISPC is lacking double precision implementations for
several transcendental functions such as exp, we provide our own
implementations based on VDT [40]. Measurements of our ISPC im-
plementation of exp show on-par performance with state-of-the-art
mathematical libraries such as Intel’s SVML.
Kumbhar P., et al.
Target Language Hardware Platform Features
C++ CPUs x86 SIMD using ivdep
C++ + OpenMP CPUs multithreading
C++ + OpenACC CPUs, GPUs accelerators
ISPC x86, ARM native SIMD
NVIDIA GPGPUs native SPMD
CUDA NVIDIA GPGPUs native SPMD
Table 1: Summary of supported target platforms by NMODL
Framework. In addition to the standard C++ target, ISPC
is used to target various hardware platforms with optimal
SIMD performance
We have, furthermore, extended the C++ target with an OpenMP
and an OpenACC. These two code generators emit code that is
largely identical to the C++ code generator but add appropriate
pragma annotations to support OpenMP shared-memory paral-
lelism andOpenACCGPU acceleration. Finally, our code-generation
framework supports CUDA as a main backend to target NVIDIA
GPUs. We were able to integrate NMODL as a code generation
backend for CoreNEURON with only a few minor modifications to
the build system. The CoreNEURON build process calls NMODL to
generate C++/ISPC/OpenACC/CUDA files, which are subsequently
compiled and linked to the simulator binary using the appropriate
compilers. We can thus use NMODL as a drop-in replacement for
the current mod2c transpiler, allowing us to easily perform numeri-
cal validation and performance benchmarking. The OpenACC and
CUDA backend targeting GPUs require changes in CoreNEURON
and this is an ongoing effort.
4.5 AST to NMODL Transformation
Various optimization passes discussed in subsection 4.3 are per-
formed at the DSL level by keeping the languange semantics in-
tact as part of the AST. We have implemented a AST-to-NMODL
transformation compiler pass to convert optimized AST back to
the NMODL DSL form. This is not only useful to track the AST
transformations during development/debugging but also to gen-
erate optimized NMODL DSL that can be consumed by existing
transpilers like nocmodl and mod2c which lack such optimization
capabilities.
5 ODE SOLVERS
5.1 Overview
NMODL allows the user to specify the equations that define the
system to be simulated in a variety of ways.
• The KINETIC block describes the system using a mass action
kinetic scheme of reaction equations.
• The DERIVATIVE block specifies a system of coupled ODEs
(note that any kinetic scheme can also be written as an equiv-
alent system of ODEs.)
• Users can also specify systems of algebraic equations to
be solved. The LINEAR and NONLINEAR blocks respectively
specify systems of linear and nonlinear algebraic equations
(applying a numerical integration scheme to a system of
ODEs typically also results in a system of algebraic equations
to solve.)
To reduce duplication of functionality for dealing with these
related systems of equations, we implemented a hierarchy of trans-
formations as shown in Figure 6. First, any KINETIC blocks of mass
action kinetic reaction statements are translated to DERIVATIVE
blocks of the equivalent ODE system. Linear and independent ODEs
are solved analytically. Otherwise a numerical integration scheme
such as implicit Euler is used which results in a system of algebraic
equations equivalent to a LINEAR or NONLINEAR block. If the system
is linear and small, it is solved analytically at compile time using
symbolic Gaussian elimination. Optionally, Common Subexpression
Elimination (CSE) [41] can then be applied.
If the system is linear and large, it is solved (at run time) using
a lower–upper (LU) matrix decomposition. Non-linear systems of
equations are solved at run time by Newton iteration, which makes
use of the analytic Jacobian calculated at compile time.
The use of external library solvers for analytic and numerical
ODEs offers far superior efficiency, precision, and simplicity com-
pared to the legacy implementation. The analytic ODE solver uses
SymPy [42], a Python library for symbolic calculations, which can
simplify, differentiate and integrate symbolic mathematical expres-
sions. Our analytical solver replaces the purely numerical approach
used in other NMODL source-to-source compilers and simulators.
It allows us to perform some computations analytically at com-
pile time that were previously carried out at run time at each time
step using approximate numerical differentiation. This increases
both the numerical accuracy and performance of simulations. The
numerical ODE solver uses the Eigen [43] numerical linear alge-
bra C++ template library, which produces highly optimized and
vectorized routines for solving systems of linear equations.
5.2 SymPy
We use the SymPy library to perform algebraic simplifications,
differentiate and integrate expressions, and identify common sub
expressions. Performing these operations analytically and exactly
at compile time removes a source of numerical errors and increases
the overall run time performance.
Linear and independent ODEs have been typically replaced by
an analytic solution that treats the coefficients as constant over
a time step. NMODL increases the runtime performance by per-
forming algebraic simplification and optionally replacing compu-
tationally expensive exponential calculations with the (1, 1) Pade
approximant [44], consistent with the overall second order correct
simulation accuracy (as suggested in [45], and implemented in [26]).
For coupledODEs, the implicit Euler numerical integration scheme
is applied which results in a set of simultaneous algebraic equations.
For a linear systems of equations, such as those that arise from a set
of linear reaction equations in a KINETIC block, the sparse solver
method is used. For non-linear systems, the derivimplicit solver
method is used.
The sparse solver chooses from two solution methods, depend-
ing on the size of the system to be solved. For small systems (three
or less equations), the system is solved by symbolic Gaussian elimi-
nation at compile time. Additionally the solver will also optionally
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
NMODL
KINETIC
Visitor LINEAR
NONLINEAR
DERIVATIVE
N ≤ 3
CSE
sparse
SymPy Based Code 
Generation
Run-Time
Linear System
Compile-Time
Analytical Solution
Newton Iteration
Abstract 
Syntax Tree
cnexp/euler
derivimplicit
Equations solved 
analytically at compile time
Analytic Jacobian for Newton 
solver constructed at compile time
1   KINETIC kin {
2     ~ x <-> y (a,b)
3     ~ z << (c)
4   }
5   
6   DERIVATIVE kin {
7     x' = (-1*(a*x-b*y))
8     y' = (1*(a*x-b*y))
9     z' = (c)
10  }
11  
12  NRN_STATE SOLVE kin METHOD sparse {
13    LOCAL old_x, old_y, old_z, tmp0, tmp1, tmp2
14    old_x = x
15    old_y = y
16    old_z = z
17    tmp0 = a*dt
18    tmp1 = b*dt
19    tmp2 = 1/(tmp0+tmp1+1)
20    x = tmp2*(old_x*tmp1+old_x+old_y*tmp1)
21    y = tmp2*(old_x*tmp0+old_y*tmp0+old_y)
22    z = c*dt+old_z
23  }
KINETIC block as input
KINETIC visitor pass 
transforms it to 
DERIVATIVE 
Intermediate code in the AST 
using solution from SymPy 
Figure 6: On the left, unified ODE solver workflow showing ODEs from different NMODL constructs either produces compile–
time analytical solutions, or run–time numerical solutions. On the right, example of KINETIC block and its transformation to
SymPy based solution.
perform CSE. Figure 6 contains a simple example of this process.
For larger systems, performing Gaussian elimination at compile
time may not result in a numerically stable solution. Instead we con-
struct a linear system of equations, which can be solved numerically
using Eigen.
The derivimplicit solver constructs a system of non–linear
equations, which we solve using Newton’s method at run time. We
therefore compute the system’s Jacobian, which is then used in
the iterative solver. The existing solver in NEURON calculates the
Jacobian by numerical approximation at run time. Using Sympy
we are, however, able to analytically differentiate the system of
non–linear equations to construct the exact Jacobian at compile
time. By using the exact Jacobian we are able to reduce the number
of Newton solver iterations required, as well as reduce the time per
iteration.
A further improvement which makes use of SymPy involves the
conductance, which is the derivative with respect to voltage of a
current. By default in NEURON these derivatives are numerically
approximated at run time. This introduces an additional numer-
ical error and generally doubles the computation time. To avoid
this issue, the CONDUCTANCE keyword was previously introduced
in NMODL [46], which allows the user to manually indicate the
presence of an ASSIGNED variable that represents the current de-
rivative. Since the AST can be analyzed to determine that a con-
ductance calculation needs to be performed, our implementation
automatically generate an appropriate local conductance variable
and CONDUCTANCE statement to further reduce unnecessary run
time computation.
5.3 Eigen
To solve linear systems of equationswe use the Eigen PartialPivLU()
routine [47], which performs LU decomposition with partial pivot-
ing, a fast and stable algorithm for square invertible matrices.
For non–linear systems, we offer a simple implementation of
Newton’s method, which is an iterative solver that requires a linear
system to be solved at each iteration. For large system matrices this
linear solve also uses PartialPivLU(). For 4×4 or smaller matrices,
however, we directly invert the matrix using the invert() routine.
For such small matrices this is numerically safe. Additionally this
method does not involve any pivoting and hence does not incur
any branching, which leads to significantly faster code than LU
decompositions.
In general, using Eigen routines in favor of our own implemen-
tations provides two main benefits. First, the code becomes more
maintainable: often a long and complex routine can be replaced
with a single Eigen call, which renders the code easier to read and
maintain. Second, the code performance is improved: Eigen offers
highly tuned and parameterized routines and specialized imple-
mentations for different matrix sizes through generic programming.
Replicating or beating the performance offered by a specialized
numerical library is hardly practical and beyond the aim of this
work.
6 ANALYSIS CAPABILITIES
An important goal of the NMODL Framework is to not only gener-
ate fast code but to also provide the computational neuroscientist
with a versatile and easy-to-use tool for introspecting and program-
matically modifying models written in NMODL DSL. Currently, the
NEURON NMODL database counts over 6,300 mechanism descrip-
tions contributed by a world wide community of researchers. Until
now, any meta-analysis of these NMODL files had to be done either
manually or using home-made tools. The problemwith such tools is
that their parsing capabilities are often fragile since they rely only
on regular expressions or simple manually implemented parsers.
In contrast, the NMODL Framework features a robust parser along
with a complete intermediate representation and a powerful visitor
framework. Using the AST and visitor classes, a programmer can
do any analysis or modification they wish. The downside is that us-
ing the framework requires knowledge of modern C++ and ideally
some experience with compiler frameworks. We therefore provide
Kumbhar P., et al.
A) State Update B) Current Update C) State Update - AMMPANMDA
ADD: Addition CM-R : Constant Memory Read GM-R/W : Global Memory Read/WriteDIV: Division MUL: Multiplication SUB: Subttraction exp: exponential
Figure 7: Performance characterization of different mechanisms using high level AST analysis from NMODL Framework : A)
and B) shows global view of different operations performed by State Update and Current Update kernels respectively and in
C) for AMMPANMDA one can zoom into a specific mechanism to understand detailed performance characteristics
1   import nmodl.dsl as nmodl
2   from nmodl import ast, visitor
3
4   class OpVisitor(visitor.AstVisitor):
5     def __init__(self):
6       self.ops = {'+':0, '-': 0, '*': 0, '/': 0}
7       super().__init__()
8
9     def visit_binary_operator(self, node):
10      op = node.eval()
11      if op in self.ops:
12        self.ops[op] += 1
13
14  def analyze(nmodl_string):
15    driver = nmodl.NmodlDriver()
16    nmodl_ast = driver.parse_string(nmodl_string)
17    return OpVisitor().visit_program(nmodl_ast).ops
Figure 8: A simple example of a user-defined AST visitor to
count FLOPs showing how C++ visitors can be extended us-
ing Python API
Python bindings to all AST classes and the visitor framework us-
ing the pybind11 library [48]. Additionally, we expose the ODE
solver tools described in section 5 in the Python API. This gives
the user access to all the ODE solving and simplification function-
ality available within the framework. When combined with SymPy,
this yields a comprehensive analysis framework for mechanism
dynamics described in the NMODL DSL.
Figure 8 shows an example of a simple AST visitor that keeps
track of binary arithmetic operator counts. The programm is self
explanatory and shows how the C++ AST and visitor classes are
exposed in Python and can not only be instantiated, but also sub-
classed and extended. Using these capabilities, we can build rich
analysis tools to introspect various aspects of the models. For exam-
ple, Figure 7 shows a Bipartite graph with theoretical performance
characteristics of different mechanisms from benchmarks discussed
in section 7. Figure 7A shows that the State Update kernels have
more high latency operations like division and exponential (compute
bound) compared to the more memory and bandwidth bound Cur-
rent Update kernels of Figure 7B. This has been previously shown
with hardware counter analysis in other studies [46, 49]. The de-
tailed validation of these theoretical performance metrics is beyond
the scope of this paper but, as discussed in section 8, these high
level performance characteristics can be directly correlated with
the observed performance.
7 BENCHMARKS
To evaluate the achieved performance gains through NMODL, we
have performed comprehensive benchmarks on three major produc-
tion hardware platforms summarised in Table2. Since in this work
we are primarily interested in the on-node performance increase
from the generated code, we focus on single node benchmarks. We
utilize all cores by pinning one MPI process on each core. From
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
0 5 10 15 20 25
cur-AMPANMDA
cur-GABAAB
cur-kca
cur-can
cur-cal
cur-cat
cur-Ca_HVA2
cur-Ca_LVAst
cur-cacum
cur-cacumb
cur-SKv3_1
cur-Ih
state-AMPANMDA
state-GABAAB
state-kca
state-can
state-cal
state-cat
state-Ca_HVA2
state-Ca_LVAst
state-cacum
state-cacumb
state-SKv3_1
state-Ih CPU
0 5 10 15 20 25 30
KNL
0.0 2.5 5.0 7.5 10.0 12.5
AMD
SpeedUp
Figure 9: Speedup of different channels selected from neocortex and hippocampus circuits on Intel Skylake, Intel KNL and
AMD Naples platform
Intel Skylake
Processor 2 Xeon 6140, 36 cores @2.3GHz, 384GB DRAM
Toolchain Intel 2018.1 and HPE-MPI (MPT)
Intel KNL
Processor Xeon Phi (7230), 64 cores @1.3GHz, 16GB MCDRAM, 96GB DRAM
Toolchain Intel 2018.1 and HPE-MPI (MPT)
AMD Naples
Processor 2 AMD EPYC 7451, 24 cores @2.3GHz, 128GB DRAM
Toolchain GCC 7.3.0, Intel MPI 2019
Table 2: Details of Benchmarking Systems
previous work on the parallel scalability of NEURON and CoreNEU-
RON, those single node improvements will translate directly into
equivalent improvements for parallel execution.
Benchmarks performed on the Intel platforms were compiled
with Intel Parallel Studio 2018.1, while those on the AMD platform
were compiled with GCC 7.3.0. All benchmarks have been com-
piled with -O2 -xHost and -O2 flags respectively. Note that these
benchmarks are not meant to compare one platform with other
and hence we did not investigate the full spectrum of hardware
capabilities on each platform (e.g. hyperthreading).
We selected two brain tissue models: a somatosensory microcir-
cuit and a hippocampus region model. The first, a somatosensory
cortex microcircuit of a young rat published by the Blue Brain
Project has 55 layer-specific morphological types and 207 morpho-
electrical types [1]. The second, a model of a rat hippocampus
CA1 [17] is built as part of the European Human Brain Project and
has 13 morphological types and 17 morpho-electrical types. These
models are selected because they are computationally expensive
and have a large number of mechanisms which allow us to assess
performance benefits for different types of kernels used in produc-
tion simulations. Based on these two models, we presented results
for two benchmarks:
Channel Benchmark : This benchmark consists of 21 different
morpho-electrical types selected to include all the unique mecha-
nisms from somatosensory cortex and hippocampus CA1 models. A
total of 2,304 cells are created without network connectivity as this
benchmark is designed to measure performance of code generation
for individual mechanisms.
Simulation Benchmark : This benchmark measures the over-
all performance improvement for production simulations. We used
subcircuits of 1,000 cells from the two models above simulating one
second of biological time using a timestep of 0.025 ms.
8 RESULTS
In order to understand the impact of the various optimizations
and performance of generated code, we implemented a first draft
of code generation backend for the CoreNEURON library. The
same code generation backend can be adapted for the original
NEURON simulator in the future. As the CoreNEURON library is a
compute engine executed under the NEURON simulator producing
binary identical results, the performance results can be directly
compared betweenNEURON and CoreNEURON.More details about
CoreNEURON library can be found in Kumbhar et al., 2010 [6].
We have performed benchmarks for a representative set of NMODL
mechanisms as described in section 7.
For calculating the speedup, we compare the runtimes of the
mechanisms translated using NMODL Framework with ISPC back-
end and simulated using CoreNEURON with the same mechanisms
compiled with the NEURON simulator using nocmodl. The results
of these benchmarks are shown in Figure 9. We restrict ourselves
Kumbhar P., et al.
1   void nrn_state(.....) {
2      for (int id = start; id < end; id++)  {
5       rates(.....);
6       m[id] = m[id]+(1.0-exp(dt*((((-1.0)))/mTau[id])))*
7               (-(((mInf[id]))/mTau[id])/((((-1.0)))/mTau[id])-m[id]);
8       h[id] = h[id]+(1.0-exp(dt*((((-1.0)))/hTau[id])))*
9               (-(((hInf[id]))/hTau[id])/((((-1.0)))/hTau[id])-h[id]);
10    }
11  }
12  void rates(.....) {
13      mAlpha[id] = (0.055*(-27.0-v))/(exp((-27.0-v)/3.8)-1.0);
14      mBeta[id] = (0.9399999999999999*exp((-75.0-v)/17.0));
15      mInf[id] = mAlpha[id]/(mAlpha[id]+mBeta[id]);
16      mTau[id] = 1.0/(mAlpha[id]+mBeta[id]);
17      v = v+12.3;
18      hAlpha[id] = (0.000457*exp((-13.0-v)/50.0));
19      hBeta[id] = (0.0065/(exp((-v-15.0)/28.0)+1.0));
20      hInf[id] = hAlpha[id]/(hAlpha[id]+hBeta[id]);
21      hTau[id] = 1.0/(hAlpha[id]+hBeta[id]);
22  }
25  export void nrn_state(.....) {
27    int uniform start = 0;
28    int uniform end = nodecount;
29    foreach (id = start ... end)  {
32      double mInf, mTau, mAlpha, mBeta, hInf, hTau, hAlpha, hBeta;
       {
34        mAlpha = (0.055*(-27.0-v))/(exp((-27.0-v)/3.8)-1.0);
35        mBeta = (0.9399999999999999*exp((-75.0-v)/17.0));
36        mInf = mAlpha/(mAlpha+mBeta);
37        mTau = 1.0/(mAlpha+mBeta);
38        v = v+12.3;
39        hAlpha = (0.000457*exp((-13.0-v)/50.0));
40        hBeta = (0.0065/(exp((-v-15.0)/28.0)+1.0));
41        hInf = hAlpha/(hAlpha+hBeta);
42        hTau = 1.0/(hAlpha+hBeta);
43        v = v-12.3;
44      }
45      m[id] = mInf-(-m[id]+mInf)*exp(-dt/mTau);
46      h[id] = hInf-(-h[id]+hInf)*exp(-dt/hTau);
47    }
48  }
strength reduction
inlined function
localizer : memory optimisation
Figure 10: Comparison of unoptimized and optimized code (from calcium mechanism, simplified): On the left, nrn_state ker-
nel generated from DERIVATIVE block calling rates function as generated by nocmodl andmod2c. On the right, corresponding
ISPC code generated by the NMODL Framework after the DSL optimizations and SymPy transformations described in subsec-
tion 4.3 and section 5.
here to the 12 most expensive channels and highlight timing for the
two main computational parts: the State Update (denoted as state-
channel-name) and the Current Update (denoted as cur-channel-
name). The three columns correspond to the three tested hardware
platforms. Finally, we show as dotted lines the average speedups
achieved for the shown channels. Generally we observe a higher
speedup on State Update kernels than on Current Update kernels
(e.g. on Intel Skylake 9.6× vs. 5.5×). This is due to the fact that
State Update kernels are typically computationally more expensive,
with a higher FLOP per byte ratio than for Current Update kernels
(see section 6). When comparing the three different benchmark
platforms, we notice that on KNL, and particularly for State Update
kernels, the best results are achieved. We attribute this to the rather
poor performance of the nocmodl code generation backend with
NEURON instead of particularly higher-than-expected performance
of the here-presented framework. Finally, we notice that especially
in the State Update kernel the availability of AVX-512 vector units,
with optimal memory layout offers a performance advantage as
can be seen in the higher performance of the two Intel platforms
compared with the AMD EPYC platform, which only offers AVX2.
When looking at top performers we notice that several of the
high-level optimizations described in sections 4.3 and 5 are at least
equally if not more important than the generation of vectorized
code. We observe that particularly our optimizations on the ODE
statements using SymPy based solvers (e.g. state-cacum) can lead
to speedups of more than 20× (12× on AMD EPYC).
In Figure 10 we present an example State Update kernel generated
by the NMODL Framework. For brevity, we have removed some of
the unnecessary variables and initializations. nocmodl and mod2c
generate code equivalent to the one shown on the left but with
more complex data structures and an indirect addressing scheme. In
practice we have observed that auto-vectorization for these kernels
is highly dependent on the compiler and hardware platform and
cannot be relied upon. On the right side we show the generated
ISPC SPMD code. The code is compiled in a separate object file
and kernel routines must be denoted with export to be callable
from the corresponding C wrappers. Furthermore, we note the
Intel Skylake Intel KNL
0
2
4
6
8
10
Sp
ee
dU
p
Hippocampus CN-NMODL
Hippocampus CN-MOD2C
Plasticity CN-NMODL
Plasticity CN-MOD2C
Figure 11: Speedup ofCN-MOD2C andCN-NMODL compared
toNRN-NOCMODL for somatosensory cortex and hippocam-
pus CA1 models.
foreach keyword available in ISPC, allowing the loop body to be
executed in so-called gangs of vector-elements. At a higher level
this example illustrates the result of three NMODL Framework opti-
mization passes. First, the rates function has been inlined. Second,
the localizer pass transforms global variables to intermediate local
variables, drastically reducing memory requirements and memory
bandwidth. Third, the sympy pass provides analytic integration
further reducing the computational cost of the kernel. Several chan-
nels from benchmarks (e.g. CaT ) enjoy these optimizations and
show speedups of more than 6× on all benchmark platforms.
Figure 11 shows the speedup achieved for whole simulations for
the hippocampus CA1 and somatosensory cortex models. We com-
pared the performance with three different configurations. The first
configuration (NRN-NOCMODL) uses the NEURON simulator with
nocmodl as the code generation backend. The second configuration
(CN-MOD2C) uses the CoreNEURON library withMOD2C as a code
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
Simulation Component Intel Skylake Intel KNL
NRN-NOCMODL CN-MOD2C CN-NMODL NRN-NOCMODL CN-MOD2C CN-NMODL
Hippocampus
State Update 1298.83 316.11 138.15 4786.48 713.7 233.64
Current Update 1102.56 239.84 165.53 2732.88 181.94 348.72
Other 154.13 44.08 40.78 828.94 236.11 203.24
Total 2555.52 600.04 344.46 8348.29 1131.74 785.6
Speedup wrt NEURON 1 4.26 7.42 1 7.38 10.63
Plasticity
State Update 199.8 33.17 25.41 846.5 63.77 43.86
Current Update 179.93 36.2 21.58 422.39 38.43 61.73
Other 47.71 19.1 17.12 310.93 89.43 78.49
Total 427.44 88.46 64.11 1579.82 191.63 184.08
Speedup wrt NEURON 1 4.83 6.67 1 8.24 8.58
Table 3: Absolute time(s) and speedup of the hippocampus and somatosensory cortex simulations on Intel Skylake and Intel
KNL platform using NEURON with nocmodl (NRN-NOCMODL), CoreNEURON with MOD2C (CN-MOD2C) and CoreNEURON
with NMODL Framework (CN-NMODL)
generation backend. The third configuration (CN-NMODL) uses the
CoreNEURON library with the here-presented NMODL Framework
as a code generation backend.
The NMODL Framework shows up to 7× speedup on Skylake
and up to 11× speedup on the KNL platform. The hippocampus
model shows a larger speedup compared to the somatosensory
cortex model because it uses cacum, cacumb and kca mechanisms
with the derivimplicit integration scheme. The Eigen based solver
implementation in NMODL brings additional performance improve-
ments. When compared with CN-MOD2C, CN-NMODL shows up to
2× performance improvement. Note that the CN-MOD2C is heavily
dependent on auto-vectorization capabilities of the compiler. For
example, if the GCC compiler is used instead of Intel, CN-NMODL
becomes 5× faster compared to CN-MOD2C. The breakdown of the
execution times is shown in Table 3. As discussed before, State Up-
date and Current Update represent the time spent in the execution of
code generated from NMODL DSL and the rest of the time is shown
as Other. On the Intel KNL platform the Current Update kernel is
∼2× slower in CN-NMODL comapred to CN-MOD2C. This is due to
a performance issue found with ISPC when atomic reductions are
used in the kernel. This performance issue will be addressed in a
future release of NMODL Framework.
9 CONCLUSIONS
Having real-world scientific applications make efficient use of mod-
ern computer architectures’ performance features is an involved
task and heavily relies on the successful combination of using
optimized libraries, auto-vectorizing compilers, and exposure of
parallelism and hints by the programmer. Good performance on one
architecture does not imply automatically good performance on
the next architecture. In times of increasing architectural diversity
this poses real challenges.
Additionally, many scientific applications do not encode only a
single mathematical problem, but the scientific users provide the
mathematical equations that need to be integrated by the solvers on
a case by case basis. In the worst case, this can severely impact the
success of auto-vectorization. In the best case, the way the users
express their specific equations, e.g. through DSLs may help in
producing optimized code.
In this paper we presented a novel NMODL Code Generation
Framework for the DSL of the widely used NEURON simulator.
The DSL constructs are translated into an AST that lends itself
to specific optimization passes before it is handed off to different
backends for generation of optimized code for the target platform.
We have implemented optimization passes that relate to straight-
forward transformation of the DSL code, but also more advanced
optimization passes that intercept ODE statements for which an
analytical solution can be used instead of having to resort to nu-
merical integration. This functionality is built on top of the SymPy
and Eigen libraries.
For code generation we have developed backends for C++ and
OpenMP targeting CPUs and ISPC to target a wide variety of CPU
architectures providing optimal SIMD performance and reducing
the dependency on auto-vectorization capabilities of the compiler.
Furthermore, we have developed both a CUDA backend specifically
with NVidia GPUs in mind as well as a more generic OpenACC.
Both GPU backends will, however, require more integration work
with the simulators and benchmarking.
We have benchmarked kernels from production simulations of
two different large-scale brain tissue models on Intel SKL, Intel KNL
and AMD EPYC platforms. On those individual kernels, we saw
performance improvements from 5× to 20×. In order to test how
those kernel improvements translate into speedup of the entire
simulations (which use the kernels in different ratios or not at
all), we benchmarked production simulations on Intel KNL and
Intel SKL platforms. Compared to the regular NEURON simulation
environment, a speedup of 6 − 10× has been observed. Compared
to an optimized version of the NEURON simulator, CoreNEURON,
which heavily relies on auto-vectorization of the compiler, the work
presented here nonetheless resulted in a speedup of up to 2×.
Beyond the absolute performance, a central goal of our effort
was the ability to parse all previously published models. By using
the grammar specification from the original NEURON NMODL
language, we were able to demonstrate compatibility with 6,370
Kumbhar P., et al.
channels from the public model repository ModelDB. We further-
more took care to maintain the language semantics of the DSL in
the AST, allowing retranslation of the AST optimization into the
DSL constructs, thus making the optimizations available to the
regular NEURON simulator without having to rely on our code gen-
eration backends (albeit with reduced overall speedup). Lastly, our
framework is exposed through a Python interface, providing great
flexibility to use NMODL as a generic NMODL parsing framework
and build new tools on top of it.
Availability
NMODL Framework is open sourced and available on GitHub[50].
ACKNOWLEDGMENTS
This work has been funded by the EPFL Blue Brain Project (funded
by the Swiss ETH board), NIH grant number R01NS11613 (Yale
University) and partially funded by the European Union’s Horizon
2020 Framework Programme for Research and Innovation under
Grant Agreement number 785907 (Human Brain Project SGA2).
We would like to thank Antonio Bellotta, Francesco Cremonesi,
Ioannis Magkanaris, Matthias Wolf, Samuel Melchior and Tristan
Carel for fruitful discussions and their contributions to the NMODL
development. The AMD system for benchmarking was provided
by Erlangen Regional Computing Center (RRZE).
REFERENCES
[1] H. Markram, E. Muller, S. Ramaswamy, M. W. Reimann, M. Abdellah, C. A.
Sanchez, A. Ailamaki, L. Alonso-Nanclares, N. Antille, S. Arsever, et al., “Re-
construction and simulation of neocortical microcircuitry,” Cell, vol. 163, no. 2,
pp. 456–492, 2015.
[2] A. Arkhipov, N. W. Gouwens, Y. N. Billeh, S. Gratiy, R. Iyer, Z. Wei, Z. Xu,
R. Abbasi-Asl, J. Berg, M. Buice, N. Cain, N. da Costa, S. de Vries, D. Denman,
S. Durand, D. Feng, T. Jarsky, J. A. A. Lecoq, B. Lee, L. Li, S. Mihalas, G. K. Ocker,
S. R. Olsen, R. C. Reid, G. Soler-Llavina, S. A. Sorensen, Q. Wang, J. Waters,
M. Scanziani, and C. Koch, “Visual physiology of the layer 4 cortical circuit in
silico,” PLOS Computational Biology, vol. 14, pp. 1–47, Nov. 2018.
[3] M. Schmidt, R. Bakker, C. C. Hilgetag, M. Diesmann, and S. J. van Albada, “Multi-
scale account of the network structure of macaque visual cortex,” Brain Structure
and Function, Nov. 2017.
[4] M. Migliore, C. Cannia, W. W. Lytton, H. Markram, and M. L. Hines, “Parallel
network simulations with NEURON,” Journal of Computational Neuroscience,
vol. 21, pp. 119–129, Oct. 2006.
[5] M. Hines, “Comparison of neuronal spike exchange methods on a Blue Gene/P
supercomputer,” Frontiers in Computational Neuroscience, vol. 5, 2011.
[6] P. Kumbhar, M. Hines, J. Fouriaux, A. Ovcharenko, J. King, F. Delalondre, and
F. Schürmann, “CoreNEURON : An Optimized Compute Engine for the NEURON
Simulator,” arXiv:1901.10975 [q-bio], Jan. 2019.
[7] A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann, “Advanc-
ing the Boundaries of High-Connectivity Network Simulation with Distributed
Computing,” Neural Computation, vol. 17, pp. 1776–1801, Aug. 2005.
[8] S. Kunkel, M. Schmidt, J. M. Eppler, H. E. Plesser, G. Masumoto, J. Igarashi, S. Ishii,
T. Fukai, A. Morrison, M. Diesmann, and M. Helias, “Spiking network simulation
code for petascale computers,” Frontiers in Neuroinformatics, vol. 8, Oct. 2014.
[9] J. Jordan, T. Ippen, M. Helias, I. Kitayama, M. Sato, J. Igarashi, M. Diesmann, and
S. Kunkel, “Corrigendum: Extremely Scalable Spiking Neuronal Network Simula-
tion Code: From Laptops to Exascale Computers,” Frontiers in Neuroinformatics,
vol. 12, p. 34, July 2018.
[10] A. Deutsch, J. Starruß, L. Brusch, and W. de Back, “Morpheus: A user-friendly
modeling environment for multiscale and multicellular systems biology,” Bioin-
formatics, vol. 30, pp. 1331–1332, Jan. 2014.
[11] Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen,
F. Ham, A. Aiken, K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan, “Liszt: A
Domain Specific Language for Building Portable Mesh-based PDE Solvers,” in
Proceedings of 2011 International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’11, pp. 9:1–9:12, ACM, 2011.
[12] F. Rathgeber, G. R. Markall, L. Mitchell, N. Loriant, D. A. Ham, C. Bertolli, and
P. H. J. Kelly, “PyOP2: A High-Level Framework for Performance-Portable Simu-
lations on Unstructured Meshes,” in Proceedings of the 2012 SC Companion: High
Performance Computing, Networking Storage and Analysis, SCC ’12, pp. 1116–1123,
IEEE Computer Society, 2012.
[13] C. Schmitt, S. Kuckuk, F. Hannig, H. Köstler, and J. Teich, “ExaSlang: A Domain-
Specific Language for Highly Scalable Multigrid Solvers,” in 2014 Fourth Interna-
tional Workshop on Domain-Specific Languages and High-Level Frameworks for
High Performance Computing, pp. 42–51, Nov. 2014. ISSN:.
[14] R. Membarth, F. Hannig, J. Teich, and H. Kostler, “Towards Domain-Specific
Computing for Stencil Codes in HPC,” in Proceedings of the 2012 SC Companion:
High Performance Computing, Networking Storage and Analysis, SCC ’12, pp. 1133–
1138, IEEE Computer Society, 2012.
[15] M. L. Hines and N. T. Carnevale, “The NEURON simulation environment.,” Neural
computation, 1997.
[16] M. L. Hines andN. T. Carnevale, “Expanding NEURON’s repertoire of mechanisms
with NMODL,” Neural computation, vol. 12, pp. 995–1007, 2000.
[17] H. B. P. , “Community Models of Hippocampus.”
https://www.humanbrainproject.eu/en/brain-simulation/hippocampus/.
[18] A. Ovcharenko, P. Kumbhar, M. Hines, F. Cremonesi, T. Ewart, S. Yates, F. Schuer-
mann, and F. Delalondre, “Simulating Morphologically Detailed Neuronal Net-
works at Extreme Scale,” Advances in Parallel Computing, 2015.
[19] P. M. Nadkarni, G. M. Shepherd, P. L. Miller, M. D. Healy, and B. E. Peterson,
“ModelDB: An Environment for Running and Storing Computational Models
and Their Results Applied to Neuroscience,” Journal of the American Medical
Informatics Association, vol. 3, pp. 389–398, Nov. 1996.
[20] M. Hines, “Nocmodl - NEURON Simulator,” 1989.
[21] V. Paxson, “Flex : Scanner generator for lexing in C and C++.”
https://github.com/westes/flex, 1987.
[22] R. Corbett, “Bison : Parser Generator.” https://www.gnu.org/software/bison/,
1985.
[23] B. B. P. , “MOD2C - converter for mod files to C code.”
http://github.com/BlueBrain/mod2c, 2015.
[24] B. Marin, “Pynmodl: Python infrastructure for parsing and generating code from
NMODL,” 2018.
[25] R. C. Cannon, P. Gleeson, S. Crook, G. Ganapathy, B. Marin, E. Piasini, and
R. A. Silver, “LEMS: A language for expressing complex biological models in
concise and hierarchical form and its use in underpinning NeuroML 2,” Frontiers
in Neuroinformatics, vol. 8, Sept. 2014.
[26] N. A. Akar, B. Cumming, V. Karakasis, A. Küsters, W. Klijn, A. Peyser, and S. Yates,
“Arbor – amorphologically-detailed neural network simulation library for contem-
porary high-performance computing architectures,” in Eprint arXiv:1901.07454,
2019.
[27] I. Blundell, R. Brette, T. A. Cleland, T. G. Close, D. Coca, A. P. Davison, S. Diaz-Pier,
C. Fernandez Musoles, P. Gleeson, D. F. M. Goodman, M. Hines, M. W. Hopkins,
P. Kumbhar, D. R. Lester, B. Marin, A. Morrison, E. Müller, T. Nowotny, A. Peyser,
D. Plotnikov, P. Richmond, A. Rowley, B. Rumpe, M. Stimberg, A. B. Stokes,
A. Tomkins, G. Trensch, M. Woodman, and J. M. Eppler, “Code Generation in
Computational Neuroscience: A Review of Tools and Techniques,” Frontiers in
Neuroinformatics, vol. 12, 2018.
[28] B. Magalhaes, “Exploiting Implicit Flow Graph of System of ODEs to Acceler-
ate the Simulation of Neural Networks,” in International Parallel & Distributed
Processing Symposium, (Rio de Janeiro, Brazil), 2019.
[29] S. Pankevich, “Reentrant parser using Flex and Bison.”
https://stanislaw.github.io/2016/03/29/reentrant-parser-using-flex-and-
bison.html, 2016.
[30] B. Ingerson, “YAML : YAML Ain’t Markup Language.” https://yaml.org, 2001.
[31] D. Lord, “The Jinja2 template engine.” http://jinja.pocoo.org/, 2008.
[32] R. Pivak, “Scopes and Scoped Symbol Tables.” https://ruslanspivak.com/lsbasi-
part14/#scopes-and-scoped-symbol-tables, 2016.
[33] M. N. Wegman and F. K. Zadeck, “Constant propagation with conditional
branches,” ACM Transactions on Programming Languages and Systems, vol. 13,
pp. 181–210, Apr. 1991.
[34] D. Das, “Function inlining versus function cloning,”ACM SIGPLANNotices, vol. 38,
p. 23, June 2003.
[35] V. Sarkar, “Optimized Unrolling of Nested Loops,” International Journal of Parallel
Programming, vol. 29, no. 5, pp. 545–581, 2001.
[36] C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong Program
Analysis & Transformation,” in Proceedings of the International Symposium on
Code Generation and Optimization: Feedback-Directed and Runtime Optimization,
CGO ’04, pp. 75–, IEEE Computer Society, 2004.
[37] L. P. , “LLVM’s Analysis and Transform Passes,” 2019.
[38] K. Kennedy, “Use-definition chains with applications,” Computer Languages, vol. 3,
pp. 163–179, Jan. 1978.
[39] M. Pharr and W. R. Mark, “Ispc: A SPMD compiler for high-performance CPU
programming,” in 2012 Innovative Parallel Computing (InPar), (San Jose, CA, USA),
pp. 1–13, IEEE, May 2012.
[40] D. Piparo, V. Innocente, and T. Hauth, “Speeding up HEP experiment software
with a library of fast and auto-vectorisable mathematical functions,” Journal of
Physics: Conference Series, vol. 513, p. 052027, June 2014.
An optimizing multi-platform source-to-source compiler framework for the NEURON MODeling Language
[41] J. Cocke, “Global common subexpression elimination,” ACM SIGPLAN Notices,
vol. 5, pp. 20–24, July 1970.
[42] A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Ku-
mar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P.
Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R.
Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz,
“SymPy: Symbolic computing in Python,” PeerJ Computer Science, vol. 3, p. e103,
Jan. 2017.
[43] G. Guennebaud, B. Jacob, et al., “Eigen v3.” http://eigen.tuxfamily.org, 2010.
[44] W. H. Cambridge, ed., Padé Approximants (Section 5.12) : Numerical Recipes: The
Art of Scientific Computing. Cambridge, UK ; New York: Cambridge University
Press, 3rd ed ed., 2007. OCLC: ocn123285342.
[45] F. Casalegno, F. Cremonesi, S. Yates, M. L. Hines, F. Schürmann, and F. Delalondre,
“Error analysis and quantification in NEURON simulations,” in Proceedings of the
VII European Congress on Computational Methods in Applied Sciences and Engineer-
ing (ECCOMAS Congress 2016), (Crete Island, Greece), pp. 1366–1380, Institute
of Structural Analysis and Antiseismic Research School of Civil Engineering
National Technical University of Athens (NTUA) Greece, 2016.
[46] P. Kumbhar, M. Hines, A. Ovcharenko, D. A. Mallon, J. King, F. Sainz, F. Schür-
mann, and F. Delalondre, “Leveraging a Cluster-Booster Architecture for Brain-
Scale Simulations,” in High Performance Computing: 31st International Conference,
ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings,
pp. 363–380, Springer International Publishing, 2016.
[47] B. Jacob and G. Guennebaud, “PartialPivLU : LU
decomposition of a matrix with partial pivoting.”
https://eigen.tuxfamily.org/dox/classEigen_1_1PartialPivLU.html.
[48] W. Jakob, J. Rhinelander, and D. Moldovan, “Pybind11 – Seamless operability
between C++11 and Python,” 2017.
[49] F. Cremonesi, “Analytic Performance Modeling and Analysis of Detailed Neuron
Simulations,” arXiv:1901.05344, Jan. 2019.
[50] P. Kumbhar, “NMODL Framework.” https://github.com/BlueBrain/nmodl.
