Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a single acceleration (or homogeneous) platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform, incorporating three distinct acceleration technologies, a Dataflow Engine, a Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different instances of a state-of-the-art neuron model, modeling the InferiorOlivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronalnetwork dimensions but also different network-connectivity circumstances that can drastically change application workload characteristics. Main results: The synthetic approach of three HPC technologies demonstrated that BrainFrame is better able to cope with the modeling diversity encountered. Our performance analysis shows clearly that the model directly affect performance and all three technologies are required to cope with all the model use cases. Significance: The BrainFrame framework is designed to transparently configure and select the appropriate back-end accelerator technology for use per simulation run. The PyNN integration provides a familiar bridge to the vast array of modeling work already conducted. Additionally it gives a clear roadmap for extending the platform support beyond the proof-of-concept, with improved usability and directly useful features to computational-neuroscience community, paving the way for wider adoption.
Introduction
Through the efforts of biologists and computational neuroscientists in recent decades, advance models of cortical neurons were developed using Spiking Neural Networks (SNNs) [1] . These models do not just abstractly mirror aspects of biological processes, like Artificial Neural Networks (ANNs), but directly emulate them. Information between nodes in SNNs are conveyed using spikes in the input of each node and thus the information is not just encoded in the firing rates of the spikes but also other waveform characteristics such as spike amplitude. The more complex information encoding ensures that SNNs have greater computational capacity [2, 3] than ANNs and allow computational scientists to begin making biologically accurate models of brain subsystems, furthering their study. Greater understanding of brain functionality can lead to leaps in medical technology concerning brain disease and brain rescue implant techniques, but also leaps in engineering, concerning more refined Artificial Intelligence or even new nonvon-Neumann computer architectures. As a result, SNNs are widely used in neuroscientific research to complement in-vivo and in-vitro experiments.
In-vivo and in-vitro experiments are a traditional tool of neuroscientific research. They are powerful experimentation methods, but are also time-consuming and not always reliable. A number of factors can contaminate results like, for example, the influence of anesthesia in in-vivo experiments. what is more, most systemic phenomena require the monitoring of biological systems of very large scale and many such techniques do not allow for this kind of study. Computational neuroscientists use SNNs to circumvent such issues. By incorporating SNN models of varied complexity (which themselves are derived by biological experiments) they create predictive simulators that can test their scientific hypotheses and drive more targeted, thus more reliable and refined, biological experimentation [4] .
A major challenge has to do with the sheer computational complexity that many SNN models include, compared to less accurate modeling. Even the less complex types of models have significant demands as the studied neuronal network increases in size both in terms of computation and data transfer or storage. Traditional methods of simulation, using tool-flows like MATLAB or specific neuro-modeling tools (eg NEURON or BRIAN), are not up to the task of simulating many large-scale or very accurate neural networks in practical amounts of time, limiting the benefits that SNNs can provide. High Performance Computing (HPC) has been recently recognized as being able to provide a variety of solutions to cope with this limitation [5] [6] [7] [8] [9] [10] . Unfortunately, the challenge of executing these simulation applications does not stop just at providing the necessary computational power.
In scientific applications, such as neuronal simulations, the model aspects have immense effect on simulation performance. The variety of options of viable SNN models used in studies is significant. Every single type of model has scientific merit, depending on the subject under study, and models have significantly different characteristics when treated as computational workloads [9, 11] . Extra modeling features, like interconnection between neurons of various density (the modeling of which also varies according to the biological system that is under study) can break the dataflow nature that most neuron models have, significantly changing the behavior of the application.
In addition to this, there are also two distinct general types of simulations that are relevant in computational neuroscience. The first one has to do with very highly accurate (biophysically accurate and even accurate to the molecular level) modeling of generally smaller sized networks that requires realtime or close to real-time performance. These kinds of experiments can be used with artificial real-time set-ups or brain-machine interfaces (BMI) and are closely related to brain rescue studies (TYPE I). The second type revolves around the simulation of large or very large-scale networks (in which accuracy often needs to be sacrificed). These experiments attempt to simulate network sizes and connection densities closely resembling their biological counterparts (TYPE II experiments) [8] [12] . This, in combination to the variety of models commonly used, makes for a field that includes applications that vary greatly in terms of workload, while also, depending on the case, requiring high throughput, low latency or both. A single type of HPC node, either software or hardware based cannot cover all possible use cases with optimal efficiency.
A better approach is to provide scientists with an acceleration platform that has the ability to adjuston the fly -to the aforementioned variety of workload use cases. A heterogeneous system that integrates multiple HPC technologies, instead of just one, would be able to provide this. In addition, a framework for a heterogeneous system using a common user interface for all integrated technologies can also provide the flexibility to easily shift the choice of accelerator, depending on availability, cost and evolution of each available accelerator technology.
Such a heterogeneous platform must overcome additional challenges if it is to be used in the field. It requires a front-end which should provide two very important features:
• An easy to use and familiar interface in which neuroscientists can use the platform, without the constant mediation by the acceleration engineer.
• A front-end that can reuse the vast amount of models already present in the field.
Developing and executing experiments with SNN models is a very rigorous process since experimenting with the models presupposes their careful fitting to experimental data. The neuroscientist needs to be able to interface with the acceleration platform directly.
Having an engineer as a mediator between the scientist and the technology is a standard practice today but incurs significant delays in the research process. Lastly, the ability to program the accelerator platform with familiar coding languages and the lack of a need to redevelop what scientists are already using, is essential for wide adoption of the technology by the community.
In this paper we propose a framework for a heterogeneous acceleration platform for computationally challenging neuroscientific simulations called BrainFrame. Using this system we demonstrate the effect of model characteristics on performance and thus making a concrete case on the significance of heterogeneity on an HPC system used for the computational neuroscience field. To that end we use a state of the art extended-Hodgkin-Huxley neuron model of the Inferior Olive Nucleus (InfOli) as a benchmark to evaluate the framework. We chose this model as a respective workload of biophysically meaningful neuron representations, as their efficient simulation poses a significant engineering challenge. We also evaluate using three different instances of the workload, each differentiated by the presence and complexity of the neuron interconnectivity modeling, leading to vastly different computational requirements, while still reflecting plausible neuroscientific-experiment use cases. We propose a front-end for the framework based on the PyNN language. PyNN is widely used in the computational neuroscience community and has direct integration with many other well-known neuron modeling frameworks, covering both features that such a front-end would require.
Neuron Modeling Background
The best choice of SNN model depends on the subject of the study a modeler needs to conduct [13] . There are three main categories of SNNs (although not the only ones): (A) Integrate & Fire models, (B) Izhikevich models, and (C) Conductance(-based) models.
The simplest version of SNNs are Integrateand-Fire (I&F) models. They emulate the most basic operation of a biological neuron, which is the integration of spikes and firing using a threshold mechanism. From this most basic version, extensions are derived which add more features to the model's behavior such as the Leaky I&F, I&F-or-Burst [14] and quadratic I&F [15] . I&F models have extremely low computational demands but also have very limited biological plausibility. They are, thus, useful for exploring large-scale network dynamics in relation to the very basic features they can emulate. Izhikevich neurons [16] are a special type of models whichalthough featuring similar complexity to I&F models -emulate an impressive fraction of the biologicalneuron behavior. This model type boasts the ability to emulate most possible input/output spiking activity found in the biological neuron. Although it treats the neuron as a black box, its flexibility permits to create very accurate high-level representations of large-scale, biological-neural-network behavior.
If, on the other hand, a researcher seeks to explore the electrochemical characteristics that produce the neuron's response, they require a biophysicallymeaningful neuron model, such as conductance-based models. They capture closely the electrochemical behavior that produces the neuron activity by modeling the various ion channels observed in biological neurons. The most prominent conductancebased model is the one originally presented by Hodgkin and Huxley (HH) in 1,952 [16] .
HH models, and their variants, make heavy use of differential equations and are quite scalable, making the design of multi-compartmental models possible (the term "compartment" is used for the distinct parts of an accurate white-box neuron representation). The computational complexity of conductance-based models is orders-of-magnitude higher than that of the aforementioned types, posing a significant challenge for their efficient simulation. As a result, a number of simplifications exist with reduced features but also complexity, such as the FitzHugh-Nagumo model [17] .
Methods

The Inferior Olive
The Inferior-Olive Nucleus is an intricate part of the Olivocerebellar system which is one of the most dense brain regions and plays an important role in sensorimotor control. It does not initiate movement by itself but it does provide rhythm and coordination for motor functions. It is considered to be imperative for the instinctive learning and smooth completion of motor actions and the perception of rhythm [18] .
The Inferior Olive provides one of the two main inputs to the Olivocerebellar system through the socalled Climbing Fibers. The Inferior-Olive neurons are also heavily interconnected to one another through electrical connections called Gap Junctions (GJs). The gap junctions define the synchronization behavior between the Inferior-Olive cells and, subsequently, influence the synchronization and learning properties of the overall Olivocerebellar system [18] . 
The InfOli Workload
In this work a detailed Inferior Olive (InfOli for short) model is considered , which was originally developed by De Gruijl et al. [4] . It is an extended-HH (eHH) model representation of the inferior-olive cell. It implements a neuron with three distinct compartments, the dendrite, the soma and the axon. Within the dendrite the model also includes gap junctions. It is these gap junctions that complicate the model further and lead to the term "extended" to the standard HH model. The dendrites represent the cell input stage, the soma is the cell part wherein most of the neural processing takes place, and the axon represents the cell output stage towards the climbing fibers (Figure 1b) . The GJs are associated with important aspects of cell behavior as they are not just simple connections; rather, they involve significant and intricate electrical processes, which is reflected in their software implementation. Every compartment includes a number of state parameters denoting its electrochemical state and the neuron state as a whole. The neuron states are updated at each simulation step; every new state update is based upon: The neuron state of the previous simulation step of the executed neuron, the previous dendritic states coming from the GJ connectivity and the externally evoked input to the network, representing the input coming from the rest of the cerebellar circuit.
The three compartments and GJs are evaluated/updated concurrently at each simulation step. The model is calibrated with a simulation time step of δ = 50 µsec. This simulation step also defines real-time performance. Every simulation step for the entirety of the network must be completed within 50 µsec for the execution to be considered real-time. Figure 1a depicts a representation of the InfOli network model. The GJs are part of the dendritic compartment, thus the compartment receives the extra input coming from the inter-neuron connection. The network model works in lock-step computing discrete output axon values (with a 50 µ − sec time step) which, when aggregated in time, recreate the output response of the axon (Figure 1c) . The InfOli network must be synchronized in order to guarantee the correct exchange of previous dendritic data within a step. Thus, the execution can only be parallized in space (simultaneous execution of neurons within a step), but not in time (parallization of simulation steps). The cells even when not actively spiking present an oscillatory behavior, thus affecting network synchronization. As a result event driven execution of the network model is not an option.
By profiling a C-language implementation of the InfOli application it is revealed that the GJs have great impact on the total model complexity. As seen in Table 1 , the total number of floating-point (FP) operations needed for simulating a single step of a single cell including a single GJ are 871 ‡. Of those, more than 35% are the operations required just for the GJ. In an N -cell network, assuming that each neuron maintains a constant number of connections C to neighboring cells, the overall GJ computation cost exibits linear complexity O gj (C × N ). For many complex experiments, it is not the number of connections C but, rather, the connectivity density that is indicative of neuron interconnectivity. That is, the average percentage of the total neuron inventory to which neuron cells are connected (measured in % ‡ Table numbers have been updated to amend a profiling mistake reported in previous work [11] . units). As a result, the previous complexity can be expressed as O gj (N × N × K), where K is the connectivity density.
It is clear that the worstcase interconnectivity scenario occurs when K = 1, i.e. all neurons are connected with all other neurons, whereby the complexity becomes O gj (N 2 ). All remaining, non-GJ-related computation increases in a linear fashion O cell (N ) and the remainder of the application is of purely dataflow nature. This makes GJ computations the dominating factor in eHH models when GJ functionality is being modeled, as they break the dataflow nature of the application and dominate computational demands. This is true even for smallscale networks. As an example for a 96-cell, all-to-all connected network (Table 1 ) the GJs comprise almost 70% of the overall computations.
Application Use Cases
For our analysis, we use three use cases, which are representative of the memory and computational requirements of the InfOli workload.
All of the use cases are realistic instances of the InfOli application and have neuroscientific merit. They can also be considered as plausible instances of multi-compartmental modeling using HH models with various cases of neuron connectivity modeling.
The biology of each neuron is characterized by the internal conductances of the ion channels modeled in each compartment. In all use cases, the user can set each neuron ion channel conductance separately with every experiment and for each cell, giving the greatest possible control over the biological behavior of the simulated network. Additionally, the application allows for the connectivity of the InfOli network to be programmable by the user before the simulation is deployed. The network connectivity (when present) is defined by an N × N connectivity matrix (where N is the network size) of FP weights signifying the weight of each connection. The weight value is used in the GJ computations to calculate the connection impact on the neuron. A weight of 0.0 denotes the absence of the corresponding GJ connection. The three use cases are focused around the biological complexity of the GJs:
-InfOli cells modeled with (biophysically) realistic GJ interconnectivity as presented in [4] . The highest amount of detail is included in the GJ modeling.
(ii) InfOli with Simplified Gap Junctions (SGJ) -InfOli cells modeled with GJs replaced by simplified , passive connections. This constitutes a simpler connectivity in comparison to the previous use case. (iii) InfOli with No Gap Junctions (NGJ) -InfOli cells modeled without accounting for GJs and without any interconnectivity implementations. This is the simplest use case, whereby the neurons are modeled as separate computational islands.
In Figure 2 , we see the amount of FP operations, based on the profiling of a sequential, C implementation of the InfOli application. The FP operations are counted for each of the aforementioned use cases for different connectivity densities. In Figure 3 , the Compute (in FLOPS) to memory (in single-FP memory accesses) ratio is calculated as a metric to judge each use case in terms of being memory-or computation-bound. The computations in this use case increase quadratically, as we simulate more neurons and increase their connectivity density. The GJ computation greatly disrupts the dataflow nature of the neuron model as connectivity computations must complete before the rest of the compartment where the GJ influence is calculated (eg. dendrites). The computeto-memory ratio suggests also that this use case is strongly computation-bound for all connectivity cases: the computations increase at a much faster pace than the memory access requirements with increasing problem sizes.
; } r e t u r n I c ;
InfOli with Simplified Gap Junctions (SGJ)
The level of detail as in the RGJ case is useful for many modeling experiments but is also an overkill in many other cases that more simple rudimentary connection are involved (like simple synapses that accumulate inputs). Lighter workloads are represented by the SGJ case. We assume a use case of the InfOli application that simplifies the connection between neurons to a few simple input accumulators. The accumulation is parameterized using the weight that is assigned to each connection between two neurons, thus the connectivity information needs to be accessed the same way as is in the RGJ case. This use case has significantly lower processing requirements. Even though increasing the network size leads to similar scaling trends as in the RGJ case, the actual FP operations are reduced by about one order of magnitude compared to the previous use case (Figure 2 ). Yet, the connectivity feature still breaks the data-flow nature of the main neuron modeling. A similar trend as before is seen also in the compute-to-memory ratio, since the computation still increases at a faster pace than the memory requirements.
InfOli with No Gap Junctions (NGJ)
This use case represents the minimally featured version of the InfOli application. This is the case where the application becomes a purely data flow workload and can achieve the greatest parallelism within a single simulation step. The processing requirements scale almost linearly to the network size and, compared to the other use cases, fewer computations are needed, as shown in Figure 2 . Similarly, the memory accesses are significantly lower as in this case the connectivity matrix is not used at all. As we can see in Figure 3 , although the NGJ use case shows that computation is still the most important aspect of the application, both computation and memory access scale linearly at a similar pace. Thus, the compute-to-memory aspect remains constant as the problem size increases.
HPC Technologies and Implementation
Our heterogeneous platform incorporates three acceleration nodes. Two software-based and one hardwarebased. The hardware platform is a Maxeler Maia DataFlow Engine (DFE) board. The second node is an Intel Xeon Phi 5110P system [19] . The third is a Maxwellbased Titan X GPU by NVidia [20] . All there boards are PCIe-based which is how they communicate with the Host System. The three very different acceleration nodes provide broad enough features to cover a variety of characteristics of our use case instances, discussed in the previous section.
The Maia DFE is a Maxeler HPC node based on reconfigurable hardware. Its tool flow is designed and optimized to accommodate the acceleration of dataflow applications; that is, applications with the bulk of their implementation using purely raw computations with the absence (partially or totally) of branching execution or feedback paths. The Maxeler tools can exploit the nature of dataflow applications to implement very fine-grain pipelined designs, maximizing the throughput and overall performance.
The DFE boards also incorporate high-speed design for the communication between the reconfigurable chip and large on-board (DRAM) memory resources (several GBs) making it ideal for scientific applications manipulating large amounts of data in batch mode. What makes Maxeler DFEs stand out from the rest of the FPGA-based solutions is the high-level programming language employed for kernel coding (Java with Maxeler-related extensions) [21] . Additionally, this makes the interfacing with PyNN front end a much more straightforward process. The DFE board used in our experimental setup is a 4th-generation Maia-DFE board implemented using an Altera Stratix FPGA chip.
The Xeon Phi is an Many Integrated Core (MIC) architecture co-processor which features 61 cores, each capable of supporting up to 4 instruction streams. The generation of Phi cards used in this work, named Knight's Corner, use an Intel Xeon host processor which can offload work to the Phi, much like a GPU, by using well-known programming tools such as OpenMP and OpenCL. However, and in contrast to GPU mentality, the Phi can also be thought of as an accelerator that can act as a stand-alone processor and even features its own Operating System. This allows for an application to run natively on the platform, which is what our InfOli implementation uses. The main disadvantage of this generation of the Xeon Phi is the interconnect architecture of the chip cores, which are communicating over a bidirectional ring bus. This is expected to increase memory consistency and cache coherency delays. GPGPUs have been also aggressively entering the HPC and the scientific-computing domain specifically. The Titan X includes 3,072 CUDA microcores, which are used to parallelize computation execution, and 12 GB of on-board RAM. GPU implementations also benefit from the generally good adoption of the NVidia CUDA library open environment that allows porting of applications with similar ease to the Phi OpenMP and OpenCL frameworks. GPUs also come at a relatively lower cost than the other two node types. However, as opposed to the the Xeon Phi node, a GPU cannot act as its own host increasing communication delays between host and acceleration node during execution.
The performance capabilities of each HPC technology are fully exposed in this paper for the case of neuron simulation workloads. The specifications' overview for all three evaluation systems is presented in Table 2 .
Infoli on the Maia DFE
The DFE implementation of the InfOli application can be seen in Figure 4 and is a more advanced version of the work done in [22] . Added features include the addition of programmable connectivity and programmable neuron state by the user between experiments runs without the need to re-synthesize the design. The design implements 3 pipelines on the DFE hardware to accelerate the application, one for each part of a neuron (Dendrite, Soma, Axon), executing the respective computations. The state parameters for each neuron are stored on separate BRAM blocks for fast read/update of the network state, as they are the data that are most used throughout the experiment execution. Since every new neuron state is dependent only on the network state of the previous simulation step, only one copy of each neuron state is required at any point in time. The input stream to the DFE kernel originates in the on-board RAM and represents the evoked (external) inputs used in the dendritic computations comprising the network input. The initial state and neighbor (gap-junction) influence are also streamed in from the on-board memory only once (initialization data). The size of the connectivity matrix makes it impossible to save on chip. It is, thus, stored on the on-board RAM as well and streamed in batches as they are required by the computations. The kernel output is streamed back to the on-board memory and -at the same time -is updated in the (on-chip) BRAM blocks of the DFE.
The program flow is tracked using hardware counters monitoring GJ loop iterations (where applicable), as the GJ loop is implicitly unrolled throughout execution, the neurons executed and the number of sim- ulation steps concluded. The data flows through the DFE pipelines with each kernel execution step (or tick) consuming the respective input or producing the respective output and new state at the correct execution points according to the hardware counters. Thus the whole application can be described and implemented in a fully data flow fashion allowing for very fine-grain pipelining of the compartment execution, greater than what could be acheived by a traditional FPGA and its respective tool-flow [7] . DFE execution naturally pipelines the execution of different neurons within one simulation step. Simulation steps are not themselves directly parallelizable, as every neuron must have the previous state of all other neurons ready for its GJ computations (either RGJ or SGJ) before a new step begins. The DFE pipeline is, thus, flushed before a new simulation step begins execution. This dependency is revoked when on the NGJ case. The GJ presence hurts hardware usage efficiency, by breaking the dataflow compatibility of the application. The GJ calculations is a loop that requires to finish before the rest of the dendrite compartment state is calculated. The rest of the dendrite pipeline does not produce valid data for the operation ticks that the GJ influence is being calculated. This delay is somewhat amortized by using hardware loop unrolling on the loop, but only to the point that the available chip area allows it. Additionally, in use cases where programmable connectivity is included, the ticks for the evaluation and execution of a GJ connection are always spent regardless of whether a connection actually exists or not. Thus, this implementation cannot benefit from a smaller connectivity density in terms of performance. On the other hand, since one synthesized design can account for all possible connectivity scenarios, the DFE implementation can guarantee predictable performance.
Infoli on the Xeon Phi
The InfOli application on the Intel Xeon Phi coprocessor, depicted in Figure 5 , is based on a sharedmemory implementation, which is a more typical, software-based solution for HPC workloads. The application uses the OpenMP library to spawn threads, which can work in parallel. As the Xeon Phi 5110P uses one core to handle OS-related tasks and each core features multithreading technology that can service up to 4 instruction streams simultaneously, the InfOli application on the Xeon Phi uses up to 60 × 4 = 240 OpenMP threads. Each thread handles a part of the neuronal network, partitioned as uniformly as possible to prevent workload imbalances.
In each simulation step, the OpenMP thread is charged with computing its sub-network's state. This process is further broken down into two tasks. Initially, the sub-network needs updated information on the rest of the network, specifically the voltage levels of dendritic membrane from other neurons in connection with this sub-network. These connections, facilitating the GJ connectivity, are formed based on an input map provided by the user at the initialization stage of the application and are assumed to be stable throughout the simulation. Thus, the OpenMP thread accesses memory space shared by all threads in order to collect data from other neurons, with the purpose of re-evaluating the state of its sub-network's Gap Junctions. In this task, shared-memory accessing can cause stalls in thread operations due to issues such as memory contention.
Upon completion of its first task, the OpenMP thread updates the compartmental states of each neuron in the sub-network. Each of the neuron's three compartments is re-calculated (dendrite, soma and axon). The dendritic compartment specifically uses the updated GJ states evaluated in the previous task in order to assess the incoming current from connected neurons. This particular process demands an amount of operations that increases significantly with neuron population size in the case of densely connected networks, as we would expect with the increasing computational demands of Gap Junctions.
After performing its two tasks for the entirety of its sub-network, the OpenMP thread begins the process anew for the next simulation step, until there are none left. It should be noted that the threads perform in lockstep; the threads sync before the execution of a new simulation step, so that stale data cannot be exchanged during GJ computation. This limitation is enforced by precision requirements. Relaxing said requirements could lead to performance increase at the cost of accuracy by avoiding excess thread synchronization. Furthermore, it should be noted that the implementation described assumes that Figure 6 . GPU implementation of the InfOli application the entire network is large enough so that it can be partitioned in 240 parts. When dealing with smaller networks, the implementation utilizes less than the maximum amount of the platform's assets, since it is designed to require at least one neuron for each OpenMP thread to operate on. Figure 6 depicts the GPU simulation flow. In the precompute stage, the host initializes neuron states and the external input currents for the entire simulation duration. It allocates global memory on the device to store the present neuron states, next neuron states and the external input currents. To reduce memory latency, it also binds the present dendritic voltage of cells (which is part of the neuron state) to the GPU texture memory. These voltage values are accessed frequently as they are used to determine the GJ influence. At the end of this stage, the host copies the required data for simulation onto the GPU device. Note that, after this stage, no data is transferred from the host to the device as the device contains all the necessary information for the simulation.
Infoli on the Titan X GPU
During the compute stage, the neuron calculations are conducted and the new states are persistently stored throughout the simulation duration.
To compute the new states for a single simulation step, the host launches a CUDA kernel on the device. Before simulation, the kernel is configured for a particular use case (RGJ, SGJ or NGJ) and interneuron connectivity scheme (if applicable). The kernel is executed by a 2-dimenstional grid of CUDA threads on the device. Every InfOli cell of the model is mapped to a corresponding thread that calculates the states of the neuron. On completion of the kernel, the host copies the calculated result of the simulation step from the device. As Figure 6 shows, the host uses two streams to issue the kernel execution and data-transfer operations to the GPU. By using synchronization points, it launches a kernel in one stream only when the kernel in the other stream has completed. While the kernel is executing in the current stream, calculated neuron states of the previous simulation step are copied asynchronously from the device to the host by the other stream. Thus, while maintaining synchronization between consecutive simulation iterations, computation of the current neuron states and data transfer of the previously computed states overlap, hiding Host-to-GPU transfer delays.
BrainFrame & the PyNN Front-End
As mentioned previously, PyNN is a Python package that facilitates the interchangeability and the study of different simulation environments within the computational neuroscience community [23] . It allows for simulator-independent specification of neuronalnetwork models and already supports many popular simulators like NEURON, NEST, PCSIM , Brian etc.
The PyNN API supports modelling at multiple levels of abstraction, both at the neuron level and the network level. It provides a library of standard neuron, synapse and synaptic-plasticity models and a set of commonly-used connectivity algorithms while also supporting custom user-defined connectivity in a simulator-independent fashion.
We integrated the three accelerators as backends of on the BrainFrame system using PyNN as a frontend, in order to compare and determine the efficiency of these heterogeneous solutions under different simulation scenarios, such as scenarios including sparse/dense networks, small/large populations, etc. Additionally the PyNN integration provides the neuroscientific community with easy access on the accelerators, as mentioned before, without the constant mediation by the acceleration engineer.
As a proof of concept for the front-end of the BrainFrame platform, we have integrated the InfOli model in PyNN's standard models. Following the PyNN paradigm, the user initially selects the simulator, in our case our BrainFrame simulator, and then proceeds to select the neuron model, in our case the Inferior-Olive model. A population of neurons using the chosen model is then generated, determining the inter-neuron connectivity type and, finally, a projection of the specified neuronal network is created.
The main difference between the proposed PyNNbackend substrate and the typical simulator backends within the PyNN environment, is an additional selection step. In this step a decision about which of the three candidate acceleration nodes will be used for a specific experiment is taken, based on the available hardware and the characteristics of the simulated neural network. An abstract view of the architecture of the PyNN Brainframe module is shown in Figure 7 . In order for the simulators to communicate with the PyNN frontend, a back-end PyNN module is required that implements and extends common methods and objects like the neuron models, synapse models and projections methods and objects. In the case of the proposed BrainFrame module, we implemented objects and methods for: i) initialization of the simulator, ii) the description of the neuronal network in PyNN (standardmodels.py, populations.py and projections.py), and iii) for controling the simulation execution (recording.py, simulator.py). In some cases an additional module is needed to translate these Python objects and parameters to each simulator's native parameters and language. For our system we developed the PyHet -the BrainFrame-specific Python interpreter -which fills this role and also implements the accelerator selection.
Currently, this decision is implemented as an offline check, but we intend to implement a online version of the selection middle-ware to serve experiments in which the problem parameters are updated dynamic during execution. Figure 8 shows an abstract depiction of the added back-end system within the typical PyNN paradigm.
Results
In this section we present a thorough performance analysis of our heterogeneous BrainFrame platform. The goal is to evaluate the platform and give a clear picture of how each acceleration node will be able to cope with each of the various instances of the use cases defined previously, validating the applicability of a heterogeneous HPC method for computational neuroscience. Additionally it is used as to guide the implementation of selection algorithm used for the back-end choice. As our selected workloads are drawn form a representative biophysically-meaningful model with various characteristics of inter-connectivity, the evaluation has worth not only for this specific model but generally for the computational neuroscience field using similar modeling classes. To validate the correct functionality of the implementations, we use a simple experiment that recreates a typical response that is found in the inferior-olive network (axon response). The experiment produces a so-called complex spike, seen in Figure 1c , from all simulated cells. The experiment simulates 6 seconds of brain time, that translates to 120,000 simulation steps. The complex spike is produced by applying a small pulse as input to all InfOli cells at the same point after program onset for about 500 simulation steps (or 25 ms, in brain time). Despite being rudimentary, this experiment is easy to validate, provided all neurons are initialized with the same state, and also gives a good indication whether synchronization between neurons is functioning correctly, validating the correct function of the interconnectivity (when present).
As mentioned in the introduction, there are two distinct tracks that can be followed in conducting neuroscientific experiments, both covered in this evaluation. We, thus, conduct one batch of measurements ranging from 96 to 960 neurons (representing TYPE-I experiments) and a second batch ranging from 960 to 7,680 neurons (representing TYPE II experiments).
We consider (consulted by our neuroscience peers) the minimum meaningful network size for experiments to be around 100 neurons, thus our measurements for TYPE I experiments begin at 96 neurons.
Performance Evaluation
Starting with the analysis for TYPE I experimentation, in Figure 9 we can see the simulation step-execution time for the most demading use case, that of the RGJ with 100% connectivity density. Even though not the most common case, a platform requires to support such high interconnectivity densities for certain TYPE I experiments. Here, we can see that the DFE node has the most efficient performance for all our measurement points. The Xeon Phi follows closely although still kept back due to the local memory delays and the less than efficient use of its parallel threads. These network sizes are not large enough to provide enough parallel tasks for the Phi threads to be fully exploited. The GPU, on the other hand, struggles cope with the computational intensity of the gap junctions, which involve mostly division and exponent calculations.
The inefficiency that the Titan GPU X has in conducting the realistic GJ computations is even more pronounced in the SGJ case (Figure 10 ). In this use case, that the most demanding calculations are dropped, the GPU presents greatly improved scalability as the problem size increases, compared to the RGJ case. The Xeon Phi, on the other hand, still has to suffer the delays for core-to-local-memory synchronization even though the actual calculations are much simpler now. The DFE node, similarly to the RGJ case, needs to spent the same amount of operation ticks to evaluate the connection influence, even though it does enjoy gains in performance because of the simpler calculations (by being able to achieve higher operation frequencies and shorter pipelines). Thus, both nodes show a similar scalability trend to the RGJ case. The improvement in performance allows the Titan X GPU to perform better than even the DFE for network sizes above 480 neurons.
An important aspect for this evaluation and to correctly guide the design of the PyNN selection step is the behavior of the three acceleration nodes for less than all-to-all connectivity. Even though not relevant for the DFE, as it cannot exploit performance benefits for less dense connectivities, smaller densities can influence the Phi and the GPU performance considerably. In Figure 11 , we can see the same simulation step performance for the three nodes for 25%, 50% and 75% connectivity densities, for the RGJ case. The GPU delivers significant gains but still the inefficient GJ execution does not allow its performance to surpass that of the DFE, even though the latter operates as in a 100% density simulation. The Xeon Phi, on the other hand, manages to have enough gains from the reduced GJ number and is faster than the DFE for specific problem sizes and onwards. Specifically, above 960 neurons for 75% density and above 864 neurons and 672 neurons for 50% and 25% connectivity densities respectively.
For the SGJ use case (Figure 12) , we see similar trends as in the RGJ case. Here, though, the end results are also the same as with the all-to-all SGJ connectivity runs. The GPU presents great scalability and, thus, becomes the most efficient option for network sizes higher than 480 neurons. The DFE node remains the most beneficial option for networks smaller than 480. For the NGJ case, for TYPE I experiments, the results point to the DFE as the uniformly best option. In the complete absence of the interneuron connectivity, the application becomes a purely dataflow task fully compatible for acceleration in on a DFE, which is tailor-made for such cases, providing significant benefits over both the Phi and the GPU ( Figure 13) .
Lastly, for TYPE I experiments what is specifically important is the real-time capabilities of each platform. Table 3 presents the real-time achievable networks for each use case.
The results show that, for real-time experimentation, the DFE accelerator is the most beneficial option for all cases. GPUs and Xeon Phi parallel threads tend to be underutilized at such low network sizes, even though most of the delays of using them are present. Thus the DFE, using a fine-grain pipelined kernel, can achieve greater benefits at the problems sizes that realtime execution is computationally achievable. It is interesting to note that the DFE can even support realtime experimentation for TYPE II experiment on the NGJ case.
For TYPE-II experiments the trends for the RGJ case with 100% connectivity change significantly ( Figure 14) . Here, the massive explosion of the GJ computations begins to stress the parallelization capabilities of both the Phi and the DFE. The DFE's efficient parallelization of the GJs relies mostly on the ability to unroll the GJ loop on the FPGA hardware, allowing for more iterations to finish per operation tick.
But the degree to which this unrolling can be accomplished is limited by available chip area. For network sizes above 1,000 neurons, the number of iterations becomes so large that the speed up gained by the unrolling starts to vanish, making the application less scalable on the DFE. The Xeon Phi follows a similar trend, as the communication overhead between cores (which are organized with a moderately efficient ring topology [5] ) causes the same effect on scalability. Even though the scalability of the two nodes worsens when the problem size increased, the GPU scalability remains largely the same as before. As a result, the GPU becomes the better performing node for network sizes of 4800 neurons and above.
For lower connectivity densities, we see a similar trend, although the Xeon Phi scalability is slightly better because of the lower interconnectivity ( Figure 15) . Thus, the Phi retains the advantages it has for lower than 100% densities, compared to the DFE node. Still, the effect of the inter-core communications is present allowing for the GPU to overtake the Phi for network sizes above 4800 neurons (for densities of 50% and 75%) and above 3840 neurons (for the 25% density).
The same effect is also observed in the SGJ case for the DFE and Xeon Phi, although it is less pronounced. As a result, the GPU has no problem outperforming the other two node types for any tested network size and connectivity density (Figures 16 and 17) . Finally, in the NGJ case the situation is the same as with TYPE I. The dataflow nature of the application allows the DFE again here to present significantly better performance than the other two nodes types ( Figure 18 ).
Accelerator Selection Algorithm
The performance analysis discussed above can now be used to formulate a simple selection algorithm that will choose the acceleration node based on the problem parameters, mainly connectivity detail (biophysically realistic: RGJ, simple: SGJ and not present: NGJ), density and network size. Figure 19 shows the selection for our use-case instances. The RGJ case selection, which presents the most complex case in terms of acceleration node choice, shifts between all three nodes depending on the connectivity density. For the SGJ case, the GPU is always the node of choice, while for the NGJ case the DFE is always selected. Lastly, if the experiment is flagged as a real-time experiment the algorithm chooses the DFE to accelerate the application, as it is the only clearly viable accelerator for real-time experiments.
As a simple example of how this selection can speed up experiments we can assume a scenario in which several batches of RGJ experiments need to be executed for various network sizes. Let us assume that each batch includes 5 experiments each with gradually decreasing connectivity density (100%-75%-50%-25%-0%) and that each experiment in a batch simulates 40 seconds of brain time. The time saving in this example by using the BrainFrame system compared to homogeneous systems that integrate only a single accelerator type can be seen on Table 4 .
The BrainFrame can achieve significant benefits depending compared to the single node systems that can range up to 86% faster execution. On average, assuming the total runtime of all batches, the BrainFrame system can achieve 40% speed up compared to a pure DFE system, an 10.7% speed up to a pure GPU system and a 20.2% speed up compared to purely PHI-based system. This selection can be easily extended/updated as new features and more generalized model libraries are added for acceleration (making the selection predictive for general cases) or as each acceleration technology is updated in the future.
Discussion
There are numerous related works that propose employing HPC nodes for the acceleration of SNNs. Such nodes include hardware-based solutions, like reconfigurable hardware, as well as software solutions using GPUs and less often many-core processors platforms, such as the Xeon Phi. Most simpler modeling has found a good match on GPU-based systems, such as Izhikevich and I&F modeling [10, 24] . Higher biophysically meaningful modeling, like the HH model, seem to be a much more difficult problem to solve with GPUs, especially for real-time experimentation [6] . Similar difficulties in acceleration of the HH are identified with Xeon Phi platforms for less densely interconnected networks [5] . Yet, for higher density networks this platform provides much better efficiency.
FPGA-based solutions have been especially prolific in accelerating neuron applications, with impressive results specifically for biophysically-meaningful modeling and real-time performance for such networks [7, 8, 25, 26] . It is also revealed in related works conducting performance analysis that an FPGA's potential benefit varies greatly between SNN types, even without taking into account connectivity modeling that can decisively change the workload characteristics [9] .
Recently, we have also seen use of the DFE for accelerating computational neuroscience. On purely dataflow neuromodeling applications, the DFE can have great benefits both in large scale networks but also in real-time network performance [27] . Even in the cases of HH neurons that include highly accurate interconnectivity modeling (breaking the pure dataflow nature), the DFEs can accomplish greater benefits than traditional FPGA acceleration [22] .
These works, though, present just a one-off implementation of a specific application instance, on a specific acceleration platform and most also ignore the variety of synapse modeling and its influence on the applications. As a result, these solutions have limited use for the scientific community at large. To the best of our knowledge no prior work has considered a heterogeneous acceleration system for coping with the variability of the applications in the field.
Additionally, most related works seem to suffer form limited re-usability value. They ignore the chal- lenge of the neuroscientific community adopting the proposed platform and very few propose solutions to that end. Beuler et al. [28] developed a graphical interface alongside their FPGA-based simulator. Although it does provide ease of use in experiments, it is still confined to only one platform and only one application with generally too limited flexibility to be the basis of a more widely adopted system. Weinstein et al. [29, 30] took the approach of developing their own modeling language to interface to their FPGA library, the DY-NAMO compiler. Besides the limitation of using only FPGAs as the back-end platform, the DYNAMO compiler is a technically complete solution. Unfortunately, it failed to achieve wide adoption by the scientific community as it requires learning a new language and, additionally, the non-trivial process of porting older established neuron models to the new coding paradigm. The most promising solution was proposed by Cheung et al. [31] in NeuroFlow. In this work, the researchers integrated PyNN to their DFE-based hardware library. Neuroflow also provides a very complete library of IPs in the back-end, covering a great portion of possible applications. Yet, the system is still integrating a single acceleration platform. What is more, The performance and efficiency analysis is only presented for a single use case of a generally simpler model (Izhikevich) and with connectivity modeling of medium complexity (STDP) and relatively lower density (about 10%). The behavior and performance of the system for the rest of the supported features is not self-evident and should be significantly different, especially for accurate modeling such as the HH and with high connectivity densities, as shown by our performance analysis on the DFE platform. Furthermore, many of the performance benefits are accomplished using event-driven simulations (each node processes only when an event is happening in its input). Many accurate models require cycle-accurate modeling, thus, event-driven solutions and their acceleration benefits cannot be employed. Biophysically accurate models of biological systems, such as ones using the HH description, are comprised mostly of a set of computationally challenging deferential equations often implementing an oscillatory behavior. If neurons are simulated as independent computational islands (NGJ case), then dependencies between the equations do not arise, allowing divideand-conquer, data-flow and event-driven acceleration strategies to be used very efficiently. The moment interconnectivity between oscillating neurons is modeled, either complex (like GJs) or simpler (like input integrators or STDP synapses), the cells become coupled oscillators. The embarrassingly parallel and data-flow nature of the application is then broken. All neuron states need to be completely updated at each simulation step to retain correct functionality. This requirement, in turn, enforces the use of cycle-accurate, transient simulators and forbids event-driven implementations.
The above discussion makes it obvious why the majority of the computational-neuroscience community has so far meticulously avoided employing HH models and multi-compartmental models with complex connections on large problem sizes using conventional computing machines. The eventual use of biophysically plausible neurons and connections on a larger scale is required to explain biological behavior. Even though the details of the most important systemic behaviors of the modeled systems must revolve around very specific characteristics of the networks, thus able to be revealed by generally simpler representations, the computational neuroscientist cannot know beforehand which of the numerous dynamics revealed from the biological measurements (from which the models arise) can be safely abstracted. Thus, studies seeking to reveal systemic behavior need to start with complex representations before they know enough to apply more simplified modeling.
As a result, a single HPC platform is unable to cover all the aforementioned requirements to support a complete study, as our analysis also reveals, making a heterogeneous HPC platform with PyNN support extremely useful.
Conclusions
In this paper, we propose BrianFrame a heterogeneous acceleration platform to serve computational neuroscience studies in conducting the variety of experimentation, often required for the study of brain functionality. We focus our analysis on biophysically-accurate neuron modeling, as such modeling is essential for the understanding of the system properties of biological brain networks. In order for the system to cope with the inherent flexibility and variety of the field we present a proof-of-concept heterogeneous HPC system that integrates three HPC technologies already proven useful for brain simulations. The performance analysis of the system using use cases that take into account connectivity density and connectivity modeling complexity, reveals that all three nodes are required to be integrated within a system to efficiently serve all possible experimentation cases. The platform is, thus, able to provide efficiency for both TYPE I and TYPE II experiments but also provide real-time performance for meaningful network sizes.
Based on these observations, we have combined the three accelerators with a PyNN front-end and implemented a selection algorithm identifying the most suitable HPC node, depended on the parameters of each desired experiment. Since all the acceleration nodes use PCI-e slots to be integrated to the host system, great flexibility is also provided for the practical deployment of such systems, as the composition of hardware can be adjusted on a percase basis depending on the availability of funds and hardware resources. Finally, the PyNN front-end creates a direct link of the simulation platform to a multitude of prior modeling works which is essential for wide adoption of the platform, while providing a clear roadmap for further development of our framework.
