Machine learning applications that are implemented with spike-based computation model, e.g., Spiking Neural Network (SNN), have a great potential to lower the energy consumption when executed on a neuromorphic hardware. However, compiling and mapping an SNN to the hardware is challenging, especially when compute and storage resources of the hardware (viz. crossbar) need to be shared among the neurons and synapses of the SNN. We propose an approach to analyze and compile SNNs on resource-constrained neuromorphic hardware, providing guarantees on key performance metrics such as execution time and throughput. Our approach makes the following three key contributions. First, we propose a greedy technique to partition an SNN into clusters of neurons and synapses such that each cluster can fit on to the resources of a crossbar. Second, we exploit the rich semantics and expressiveness of Synchronous Dataflow Graphs (SDFGs) to represent a clustered SNN and analyze its performance using Max-Plus Algebra, considering the available compute and storage capacities, buffer sizes, and communication bandwidth. Third, we propose a self-timed execution-based fast technique to compile and admit SNNbased applications to a neuromorphic hardware at run-time, adapting dynamically to the available resources on the hardware. We evaluate our approach with standard SNN-based applications and demonstrate a significant performance improvement compared to current practices.
Introduction
While machine learning techniques have been important for decades, this field is suddenly experiencing major breakthroughs in diverse domains due to the enormous increase of data, significantly improved algorithms, and substantially more-powerful computer hardware. Compared to analog and rate models, machine learning techniques implemented with spike model [13] and brain-inspired learning algorithms [19] , e.g., Spiking Neural Network (SNN) [44] , have a great potential to lower the energy consumption when they are executed on a neuromorphic hardware such as DYNAP-SE [48] , TrueNorth [28] , Neurogrid [9] , SpiNNaker [32] , LCTES'20, June [15] [16] [17] [18] [19] [20] 2020 , London, UK 2020.
and Loihi [26] . This makes SNNs attractive for implementing machine learning applications in resource and powerconstrained environments, ones where sensor and edge devices of the Internet-of-Things (IoT) [35] typically operate.
Executing a program on a hardware involves several steps: compilation, resource allocation, and run-time mapping. Although apparent for mainstream computers, these steps are challenging and not very well defined when executing an SNN-based machine learning application on a neuromorphic hardware. This is because a neuromorphic hardware implements accumulation-based alternate computing, where neural computations and synaptic storage are co-located inside each crossbar and distributed in the hardware. This is different from a conventional computer where CPUs compute by exchanging data centrally from the memory.
Prior research efforts such as [8, 38] have only addressed design-time analysis of an application with unlimited hardware resources, e.g., arbitrarily large crossbars or as many interconnected crossbars needed to accommodate all neurons and synapses of the application. While these efforts are still relevant when designing the hardware, they cannot provide a realistic guarantee of performance when executing these applications on an off-the-shelf neuromorphic hardware. This is because prior efforts fail to answer how to share compute and storage resources of the hardware to guarantee performance when not all neurons and synapses of an SNN can fit on the hardware at once. Table 1 , shown in Section 6, lists the number of neurons and synapses in standard machine learning applications, which are on the order of thousands of neurons and hundreds of thousands of synapses. A neuromorphic hardware such as DYNAP-SE [48] has four crossbars and each crossbar can accommodate a maximum of 128 fanin synapses per neuron. Clearly, the four crossbars must be shared when executing an SNN model, which can lead to lower performance. Figure 1 illustrates the throughput impact due to limited resources on the DYNAP-SE hardware for standard SNN applications (see Section 6) . We observe that throughput obtained on the hardware using current practices is on average 64% lower than the throughput analyzed using unlimited resources. Our objective is to reduce this performance impact when compiling an SNN-based application on a neuromorphic hardware with limited resource. Using our approach, which we describe next, throughput is 78% higher than current practices and only 21% lower than the throughput using unlimited resources. A second limitation of existing approaches is that they do not address run-time aspects, i.e., how to compile and admit machine learning applications to the hardware in the least possible time based on the available resources.
To address these limitations we propose a systematic and predictable approach to compile and map SNN-based machine learning applications on a resource-constrained neuromorphic hardware, providing performance guarantee.
Contributions: Following are our key contributions.
• We propose to partition an SNN into clusters of neurons and synapses, where each cluster can fit on to the resources of a crossbar in the hardware. We pose this as a bin-packing problem and propose a greedy strategy to solve it, maximizing crossbar utilization.
• We exploit the rich semantics and expressiveness of Synchronous Data Flow Graphs (SDFGs) to represent a clustered SNN and use Max-Plus Algebra to analyze its performance, e.g., throughput.
• We model resource constraints such as limited crossbars, input and output buffer sizes, and communication channel bandwidth into the SDFG representation. We extend the Max-Plus Algebra and use Self-Timed Execution to construct static-order schedules to estimate performance of this hardware-aware SDFG.
• We exploit a property of Self-Timed Scheduling to derive the schedule for each tile at run-time, starting from a single static-order schedule, without having to construct these schedules from scratch. This reduces the time to compile and admit a machine learning application to the hardware at run time and adapt dynamically to the available hardware resources. Figure 2 shows a high-level overview of our proposed approach. The colored boxes in this figure are our key contributions. We also annotate the section in this paper where these contributions are described. We evaluate performance and scalability of our approach using standard SNN-based applications. Our results, detailed in Section 7, demonstrate a significant performance improvement compared to standard practices.
2 Crossbar-Aware Clustering of SNNs
Introduction to Spiking Neural Networks
An SNN is a computation model with spiking neurons and synapses. Neurons are typically implemented using Integrateand-Fire (I&F) model [16] . They communicate with each other by sending short impulses of infinitesimally small duration, called spikes, via synapses. Spiking neurons can be organized into feedforward layers. A typical feedforward SNN has one input layer, one or more hidden layers, and one output layer (e.g., deep learning models [41] ). Spiking neurons can also be organized in recurrent topologies [45] . Without loss of generality, we focus on SNN-based supervised machine learning applications, where an SNN learns to perform tasks by considering examples from the field, without being exclusively programmed with any task-specific rules. SNN-based machine learning applications, especially those that are deployed on sensor and edge devices of an IoT network typically operate on streaming data, i.e., these applications are iterative in nature. For these applications, real-time performance is measured in terms of throughput. We formulate throughput in Section 3.2.
Crossbar Resource Constraints
A typical neuromorphic hardware (see Figure 6 ) consists of crossbars, which are interconnected using an interconnection fabric. A crossbar implements neuron dynamics and facilitates synaptic storage. Therefore, each neuron and synapse of an SNN must be mapped to one of these crossbars.
In terms of constraints, a crossbar can accommodate only a fixed number of synapses per neuron. This is illustrated in Figure 3 with three examples using a small 4 × 4 crossbar. In Figure 3 (a), the crossbar implements a single 4-input neuron. In this example, 5 out of 8 (62.5%) input and output (IO) ports are utilized, and 4 out of 16 (25%) crosspoints are utilized. In Figure 3 (b), the crossbar implements one 3-input neuron; the IO and crosspoint utilization are 50% and 18.75%, respectively. Finally, in Figure 3 (c), the crossbar implements two 2-input neurons, resulting in IO and crosspoint utilization of 75% and 25%, respectively. Clearly, utilization varies based on how neurons and synapses of an SNN are mapped to a crossbar. The SNN of a machine learning application can have many neurons with many synapses per neuron. Take the example of LeNet [40] , a state-of-the-art convolutional neural network (CNN) to classify handwritten digits (Figure 4 ). This application has 4,634 neurons and 1,029,286 synapses, much beyond what a single crossbar can accommodate. To map such a large SNN to the hardware, the SNN needs to be partitioned into clusters of neurons and synapses, where each cluster can fit on to the resources of a crossbar in the hardware. We discuss how to form clusters from an SNN in Sec. 2.3 and how to share crossbars among clusters in Sec. 3.
SNN Partitioning
The SNN partitioning problem is a classic bin packing problem and we propose a greedy strategy to solve this. Algorithm 1 shows the pseudo-code of this clustering algorithm. We first sort (in ascending order) neurons based on each neuron's fanin synapses and store them in a list (neuron_list). For each neuron in this sorted list, we check to see if this neuron can be merged in one of the existing clusters in the cluster_list. A neuron can be merged in a cluster if the total number of IOs and crosspoints of the cluster after merging can still fit on a crossbar. If the neuron can be merged, we assign the neuron and its fanin synapses to the cluster. Otherwise, we form a new cluster. The cluster_list is sorted in descending order of utilization so that the less utilized clusters can be used for merging neurons with higher number of fanin synapses.
Algorithm 1: Crossbar-aware SNN clustering.
1 neuron_list = sort neurons of the SNN based on their fanin synapses; 2 clusters_list = {}; 3 foreach n ∈ neuron_list do 4 find C ∈ cluster_list such that n can be merged in C; Assign n to C; 
Analyzing Inter-cluster Communication
Once an SNN is clustered, the next step is to analyze the inter-cluster communication, i.e., the number of spikes that are expected between these clusters when the SNN model is deployed in the field on a neuromorphic hardware. Figure 2 (left) shows a high-level overview of this step. An SNN is first analyzed using an application-level simulator such as CARLsim [17] with training examples. The aim is to compute the average number of spikes that are expected on each synapse of the SNN. We then compute the number of spikes between each cluster pair using the neuron-to-cluster mapping obtained from Algorithm 1. At the output of this stage, a clustered SNN graph is generated, with information on the number of spikes expected between each cluster pair. In the next section we describe how to analyze this clustered SNN.
Dataflow Modeling of SNN Clusters
We model a clustered SNN as a Synchronous Data Flow Graph (SDFG) for predictable performance analysis.
Operational Semantics of SDF Graphs
Synchronous Data Flow Graphs (SDFGs, see [42] ) are commonly used to model streaming applications that are implemented on a multi-processor system-on-chip [55] . Both pipelined streaming and cyclic dependencies between tasks can be easily modeled in SDFGs. These graphs are used to analyze a system in terms of throughput and other performance properties, e.g. execution time and buffer requirements [58] .
Nodes of an SDFG are called actors. Each node is a cluster of the SNN. Actors are computed by reading tokens (spikes) from their input ports and writing the results of the computation as tokens on the output ports. The number of tokens produced or consumed in one execution of an actor is called the port rate. They represent the number of spikes per unit time at the input and output of different clusters in the SNN. Port rates are visualized as annotations on edges. Actor execution is also called firing, and it requires a fixed amount of time to execute on a crossbar. Edges in the graph are called channels and they represent dependencies among actors. Figure 5 shows the example of an SDFG constructed using our SNN2SDF tool [3] for the LeNet CNN model used in handwritten digit classification [40] . There are 7 actors and 13 channels in this graph. For instance, actor_4 has two outgoing channels. The channel going to actor_6 has a port rate of 2 spikes per unit time, and the one going to actor_1 has a port rate of 3 spikes per unit time. From Fig. 5 we also see that there are cycles in the graph. For instance actor_6→actor_2→actor_1→actor_6 is a cycle in the SDFG. Due to the presence of these cycles, standard Directed Acyclic Graphs (DAGs) cannot be used to represent and analyze clustered SNNs. This justifies our choice of using SDFGs.
An actor is called ready when it has sufficient input tokens on all its input channels and sufficient buffer space on all its output channels; an actor can only fire when it is ready. A channel may also contain an initial token, shown as annotation. For instance, the channel between actor_0 and actor_6 in the figure has 1 initial token. A set Ports of ports is assumed, and with each port p ∈ Ports, a finite rate Rate(p) ∈ N \ {0} is associated. Formally, an SDFG is defined as follows.
is the execution time of a i and µ i is its state space, i.e., buffer space needed for communicating spikes on all of its channels. Definition 2. (SDFG) An SDFG is a directed graph G app = (A, C) consisting of a finite set A of actors and a finite set C ⊆ Ports 2 of channels. The source of channel ch j i ∈ C is an output port of actor a i , the destination is an input port of actor a j . All ports of all actors are connected to precisely one channel, and all channels are connected to ports of some actors. The source and the destination port of channel ch Before an actor a i starts its firing, it requires Rate(q i ) tokens from all (p, q i ) ∈ InC(a i ). When the actor completes execution, it produces Rate(p i ) tokens on every (p i , q) ∈ OutC(a i ). One important property of an SDFG is throughput, which is defined as the inverse of its long-term period. A period is the average time needed for one iteration of the SDFG. An iteration is defined as the minimum non-zero execution such that the original state of the SDFG is obtained. This is the performance parameter used in this paper. Following definitions are introduced to formulate throughput.
Definition 3. (Repetition Vector) The Repetition Vector
RptV of an SDFG is defined as the vector specifying the number of times actors in the SDFG are executed in one iteration.
In the SDFG representation of a clustered SNN, all spikes generated on a channel are consumed by the destination actor. This means that all actors are fired exactly once during one iteration of the application. So, RptV = [1111111].
Computing Performance on Infinite Resources
We present an approach to compute the application period of an SDFG by analyzing its maximum cycle mean (MCM) and assuming infinite hardware resources. For this, we use Max-Plus Algebra [36] . The Max-Plus semiring R max is the set R ∪ {−∞} defined with two basic operations ⊕ and ⊗, which are related to linear algebra as
To use Max-Plus Algebra to analyze an SDFG, it is customary to express the time at which an actor fires in terms of preceding firings and then use standard analysis techniques for Max-Plus Algebra to estimate timing performance. For the SDFG in Figure 5 , firing end time of all 7 actors in the k th iteration (in linear algebra) are where T captures execution times τ n . The following definitions are introduced to estimate latency.
Definition 4. (Digraph)
The digraph Γ(T ) of a n × n matrix T with entries defined in R max is the tuple ⟨A, E⟩, where A is the set of vertices, i.e., A = {1, 2, · · · n} and E is the set of connected ordered arcs between vertices i.e., E = {(i, j) | T i, j −∞}.
head of an arc in the sequence is either the start vertex of the walk or tail vertex of a preceding arc; and the tail vertex of an arc in the sequence is LCTES'20, June 15-20, 2020, London, UK either the end vertex of the walk or head vertex of a succeeding arc. Weight of the walk is given by
Definition 7.
(Maximum Cycle Mean) The maximum cycle mean, ρ max (T ) is the maximum of the weight-to-length ratio of all cycles c in Γ(T ) i.e.,
In this paper, performance of an SNN is defined in terms of throughput of the equivalent SDFG, measured as the inverse of its maximum cycle mean (Equation 6).
Hardware-Aware Performance Analysis
This section describes how to extend the Max-Plus formulation to analyze performance of an SNN on a resourceconstrained hardware.
Platform Description
Performance of an SNN, computed using Equation 6, gives the maximum period possible with infinite hardware resources in terms of crossbars, buffer sizes, and communication bandwidth. For off-the-shelf neuromorphic hardware, however, these resources are limited. Figure 6 shows a typical tile-based neuromorphic hardware, where tiles are connected via an interconnection fabric. Each tile consists of a crossbar (C), input and output buffers, and a network interface (NI). A crossbar is a two dimensional organization of horizontal and vertical electrodes. At every cross-point, there is a synaptic element implemented using a non-volatile memory device; the figure illustrates Oxide-based Resistive RAM (OxRAM) as the memory element [34] . 
Binding Actors (Clusters) to Tiles
Similar in vein to PYCARL [5] , we use a load balancing strategy to bind clusters of an SNN to the tiles of the hardware. We first formulate the load of a tile as follows:
load (t il e) = a * crossbar +b * buffer +c * connection +d * bandwidth (7) where a, b, c and d are user-defined constants used to prioritize different hardware resources on a tile. Next, we propose a greedy approach to balance the load on each tile. For this, we first distribute the clusters evenly to the tiles and calculate the standard deviation of tile loads. Then for every cluster pair that is bound to two different tiles, we swap the clusters to see if the standard deviation reduces. If it reduces, we retain this swap as the new binding and continue analyzing for other cluster pairs.
Executing a Cluster on a Crossbar
A cluster is executed by placing its neurons and synapses on to the crossbar of a tile. Figure 7 illustrates this execution mechanism. Synaptic weights w 1 and w 2 are programmed into OxRAM cells P1 and P2, respectively. The output spike voltages, v 1 from N1 and v 2 from N2, inject current into the crossbar, which is obtained by multiplying a pre-synaptic neuron's output spike voltage with the OxRAM cell's conductance at the cross-point of the pre-and post-synaptic neurons (following Ohm's law). Current summations along columns are performed in parallel using KirchhoffâĂŹs current law, and implement the sums j w i v i , needed for forward propagation of neuron excitation. The execution time of a cluster is the current propagation delay through an OxRAM synapse and is obtained from Malik et al. [46] . Note that though we use OxRAM synapses as example, the execution technique applies to any resistive non-volatile synapse. . Executing a cluster on a crossbar. Although a crossbar implements analog computations, spikes at the output are converted into digital packets before communicating on the interconnect. We use the Address Event Representation (AER) protocol [11] . Figure 8 shows an example explaining the principles behind AER. Here, four neurons in a crossbar spikes at time 3, 0, 1 and 2 time units, respectively. The encoder encodes these four spikes in order to be communicated on the interconnect. As can be clearly seen from this figure, a spike is encoded uniquely with its source and time of spike. Therefore, each token in the SDFG is simply a spike packet with header encoding the address and time, and zero payload.
Computing-Resource Aware Performance
To compute performance of an SNN on a resource-constrained neuromorphic hardware, we first construct its hardwareaware SDFG and then compute the maximum cycle mean using the Max-Plus Algebra formulation of Equation 6 . The
LCTES'
Step 1 (Buffer Modeling): Limited input and output buffersizes are modeled as back-edges with initial tokens in the hardware-aware SDFG. The number of tokens on this backedge indicates the buffer-size available. When an actor generates spikes on a channel, the available size reduces; when the receiving actor consumes the spike, the available buffer is released. Figure 9 shows such an example, where the buffer size of the channel from actor_4 to actor_1 in Figure 5 is shown as five. Before actor_4 can be executed, it has to check if enough buffer space is available. This is modeled by requiring tokens from the back-edge to be consumed. Since it produces three tokens per firing, three tokens from the back-edge are consumed, indicating reservation of three buffer spaces. On the consumption side, when actor_1 is executed, it frees three buffer spaces, indicated by a release of three tokens on the back-edge. In this model, the output buffer space is claimed at the start of execution and released only at the end of firing to ensure atomic execution of actors. Figure 10 shows the final hardware-aware SDFG of LeNetbased handwritten digit classification on a neuromorphic hardware with four tiles. For simplicity of representation, we have omitted the back-edges from the figure.
Step 2 (Actor Ordering): The number of crossbars in a neuromorphic hardware is limited and therefore they may have to be shared between actors of an SNN. However, on a tile, only one instance of an actor can be executing at the same moment in time. We use time-division multiple-access (TDMA) to allocate time slices to actors mapped to the same tile. During the allocated time slice, an actor is executed on the crossbar of the tile and generates spikes, which are stored in the output buffer for communication on the interconnect. Next, we generate the order in which the actors bound to a tile are fired to provide a guarantee on performance, i.e., throughput. For this, we apply our Max-Plus Algebra formulation (Equation 6) on the hardware-aware SDFG of Figure 10 . This is our static-order schedule. We construct this schedule at design time. Step 3 (Actor Execution): Once the static-order schedule is constructed for all tiles of the hardware, we use self-timed execution strategy [49] for executing these actors at runtime. In this strategy, the exact firing times of actors are discarded, retaining only the assignment and ordering of actors on each tile as obtained from the design-time analysis (step 2). At run time, ready actors are inserted in a list and fired in the same order as determined from design time.
Run-time Resource Management
A modern neuromorphic hardware is expected to execute many SNN applications simultaneously. When a new application is to be admitted to a hardware, which is currently running other applications, the incoming application needs to be compiled and mapped to the hardware within a short time window, based on resources currently available on the hardware. Furthermore, when an existing application finishes execution, its hardware resources are freed, meaning that such resources can now be allocated to other running applications to improve their performance. Clearly, a dynamic compilation strategy is needed to address them.
We observe that over 75% of the total compilation time of an SNN application is due to the time consumed in constructing the static-order schedule for each tile of the neuromorphic hardware (see Section 7.3). To address this, we exploit the basic property of Max-Plus Algebra and self-timed scheduling, which is expressed as the following lemma. Lemma 1. If the schedule of actors on a single-tile system is used to derive the schedule for a multi-tile system by keeping the actor firing order unchanged, the resultant multi-tile schedule is free of deadlocks [10] . Based on this lemma, we propose the following. First, we construct the static-order schedule for all actors of an SNN on a single tile at design-time. This is achieved using our proposed Max-Plus Algebra formulation of Equation 6 . Next, we discard the exact timing information, retaining only the actor firing orders for run-time use. At run-time, we first construct the actor binding to tiles (Section 4.2), considering the available resources. Next, we use the single-tile staticorder schedule to fire actors when they are ready. Figure 11 illustrates our run-time methodology. Actor Binding Section 4.2
Self-timed Execution

Design-time
Run-time Available Resources Figure 11 . Our approach to run-time resource management. Figure 12 illustrates the construction of per-tile schedules for an SNN application with seven run-time actors, and with two different binding of actors to tiles and the same singletile static order schedule. We illustrate two scenarios in this example. In the first scenario (left), the application uses two tiles of the hardware. In the second scenario (right), the application uses three tiles of the hardware. In both scenarios, actor orders on each tile is the same as that on the single-tile. Since tile schedules are not constructed from scratch, the schedule construction time is much lower (see Table 3 ). Binding: Figure 12 . Two schedules constructed from the same singletile static order schedule using two and three tiles, respectively.
However, performance obtained using this single-tile schedule can be lower than the maximum performance of a multitile schedule constructed independently. As long as this performance deviation is bounded, the actor schedule for any tile can be easily derived from the binding of actors to this tile and a given single-tile static-order schedule. In Section 7.6, we evaluate the performance of this scheduling.
Evaluation Methodology
We conduct all simulations on a system with 8 CPUs, 32GB RAM, and NVIDIA Tesla GPU, running Ubuntu 16.04.
Hardware Models
We model the DYNAP-SE neuromorphic hardware [48] with the following configurations.
• A tiled array of four neuromorphic cores, with each core integrating neurons and synapses locally. A neuromorphic core is a crossbar with 128 input and 128 output neurons. There are 65,536 crosspoints (i.e., synaptic devices) in each crossbar.
• Spikes are digitized and communicated between cores through a mesh routing network using the Address Event Representation (AER) protocol.
• Each synaptic element is an HfO2-Based OxRAM Device. In this device, multiple low-resistance states are used for emulating long-term potentiation (LTP; cumulative increase of conductance) and multiple highresistance states are used for long-term depression LCTES'20, June 15-20, 2020, London, UK S. Song, A. Balaji, A. Das, N. Kandasamy, and J. Shackleford (LTD; cumulative and gradual decrease of conductance). Each crossbar uses 1T-nR structure, where one access transistor is used to access n OxRAM devices vertically stacked in the back-end of line (BEOL). Timing parameters are modeled from [34] . To test scalability of our compilation technique, we also evaluate hardware models with 9 and 16 neuromorphic cores, organized in a 3 × 3 and 4 × 4 mesh network, respectively.
Evaluated Applications
We evaluate eight standard SNN-based machine learning applications: 1) image smoothing (ImgSmooth) [17] on 64 × 64 images; 2) edge detection (EdgeDet) [17] on 64×64 images using difference-of-Gaussian; 3) multi-layer perceptron (MLP)-based handwritten digit recognition (MLP-MNIST) [30] on 28×28 images of handwritten digits from the MNIST dataset [29] ; 4) heart-rate estimation (HeartEstm) using electrocardiogram (ECG) data [24] from the Physionet database [47] ; 5) ECG-based heart-beat classification (HeartClass) [6] ; 6) handwritten digit classification with standard CNN (CNN-MNIST) [52, 54] ; 7) handwritten digit classification with the LeNet CNN (LeNet-MNIST) [52] ; and 8) image classification with LeNet CNN (LeNet-CIFAR) [52] with images from the CIFAR dataset [39] . The LeNet CNN model is described in [40] . Table 1 summarizes the topology and the number of neurons, synapses, and spikes of these applications. Imagebased applications are iteratively executed on test images.
Applications
Synapses Neurons Topology Spikes ImgSmooth [17] 136 Table 1 . Applications used to evaluate our approach.
Evaluated State-of-the-art Techniques
We evaluate the following three approaches.
• SpiNeMap [8] : This technique clusters an SNN and distributes these clusters on to the tiles of a hardware such that the number of spikes on the interconnect is minimized. Clusters mapped to the same tile are executed in random order.
• PYCARL [5] : This technique maps SNN clusters to tiles, balancing the tile load. Clusters mapped to the same tile are executed in a random order.
• Our proposed approach uses SDFGs to analyze the performance of an SNN on a neuromorphic hardware. Clusters are allocated to tiles based on this analysis. Overall, our approach balances load on each tile and uses static-order schedule to improve throughput.
Evaluated Metrics
We evaluate the following performance metrics.
• Performance: This is the throughput of each application on the hardware. • Compilation Time: This is the time to compile and map each application on the hardware.
• Resource utilization: This is the tile, buffer size, IO, and bandwidth utilization on the hardware for each application.
7 Results and Discussion 7.1 Performance Figure 13 reports throughput obtained on the DYNAP-SE neuromorphic hardware of each of our application for each of the evaluated techniques normalized to SpiNeMap. We make the following three observations. Figure 13 . Throughput, normalized to SpiNeMap.
Im g S m o o t h E d g e D e t M L P -M N IS T H e a r t E s t m H e a r t C la s s C N N -M N IS T L e N e t -M N IS T L e N e t -C IF A R A V E R
First, throughput obtained using SpiNeMap is the lowest among all the evaluated techniques. This is because SpiNeMap places SNN clusters on tiles to minimize the number of inter-tile spikes. Therefore, some tiles need to execute many SNN clusters. As cluster ordering on a tile is not addressed in SpiNeMap, throughput is significantly low. Second, throughput obtained using PYCARL is better than SpiNeMap by an average of 41%. Although PYCARL also orders cluster execution on a tile randomly, throughput of PYCARL is higher than SpiNeMap. This is due to PYCARL's strategy to balance the load on each tile, resulting in lower number of clusters mapped per tile than SpiNeMap. Third, throughput obtained using our approach is the highest (78% higher than SpiNeMap and 28% higher than PYCARL). This improvement is due to our static-order schedule, which we analyze and construct at design-time for every tile of the hardware to decide the exact order in which clusters mapped to the same tile need to be executed to improve performance.
Cluster Binding
We reason that balancing the load on the tiles of a hardware is essential to achieving high throughput. Figure 14 reports throughput of each of our application on the DYNAP-SE hardware. We compare our proposed approach against baseline SpiNeMap with random cluster order on each tile and SpiNeMap with static-order schedule on each tile. Throughput results are normalized to SpiNeMap. We make the following two observations. First, throughput of SpiNeMap improves by an average of 39% when static-order scheduling is enabled for each tile of the hardware. Second, our approach improves throughput further by an average of 27%. Although the static-order scheduling remains the same, our proposed approach, which balances the load on each tile improves throughput compared to SpiNeMap. Figure 14 . Throughput, normalized to SpiNeMap. Figure 15 reports the fraction of total compilation time of each of our application using our proposed approach for the DYNAP-SE hardware, distributed into time to bind clusters to tiles and the time to construct static-order schedule on each tile. The number on each bar reports the absolute time in ms to compile these applications on the DYNAP-SE hardware. We observe that the time consumed to create static-order schedule on each tile is on average 75% of the total time to compile these applications on the hardware. For some applications such as HeartEstm, the scheduling time is over 95% of the total time to compile the application. These results suggest that for run-time use, the schedule construction time needs to be reduced, which justifies our fast self-timed execution based scheduling. We present these run-time results in Section 7.6. Binding Scheduling Figure 15 . Fraction of total compile time, distributed into binding and scheduling time. Table 2 reports the utilization of hardware resources (tile resources, buffer size, connections, and input and output bandwidth) on the DYNAP-SE neuromorphic hardware for each application. The average utilization of hardware resources are 92.5% for the crossbar IOs on each tile, 9.0% for buffer space, 42.6% for connections, and 15% for input and output tile bandwidth. Since we perform hardware-aware analysis, resource utilization never exceeds 100%. Table 2 . Resource utilization on DYNAP-SE.
Compilation Time
Im g S m o o t h E d g e D e t M L P -M N IS T H e a r t E s t m H e a r t C la s s C N N -M N IS T L e N e t -M N IS T L e N e t -C IF
Resource Utilization
These results illustrate that our approach can also be used for designing neuromorphic hardware, not only in terms of number of tiles, but all other resources such as buffer space, connections, and input and output bandwidth. Figure 16 reports throughput of each of our application for our proposed approach normalized to SpiNeMap. We compare throughput obtained on three hardware models: 4 tiles (our baseline configuration), 9 tiles (arranged in a 3×3 mesh), and 16 tiles (arranged in a 4×4 mesh). We make the following two observations. Figure 16 . Throughput, normalized to SpiNeMap.
Performance Scalability
First, throughput generally increases with increasing the number of neuromorphic tiles. With 9 and 16 tiles, the average throughput is higher than the baseline configuration by 11% and 15%, respectively. This improvement is because with more tiles in the hardware, a tile is shared among fewer clusters, which improves throughput. Second, for applications such as ImgSmooth, four tiles are sufficient to map the application. There is therefore no significant improvement in throughput when the number of tiles in the hardware is increased. For other applications such as EdgeDet, throughput increases with increase in the number of tiles. Figure 17 reports throughput of each of our application for our proposed approach normalized to SpiNeMap. We compare throughput obtained at design-time where cluster schedules are independently constructed for each tile against throughput obtained at run-time using our proposed singletile schedule. We make the following three observations. First, throughput obtained at run-time from a single-tile static-order schedule is on average 15% lower than the case when schedules are constructed independently -that is, by using our design-time analysis method. This verifies Lemma 1. Second, for some application such as HeartEstm and HeratClass, throughput obtained at run time is exactly the same as that obtained at design time. Third, throughput at run time is still higher than SpiNeMap by an average of 51.4%. Table 3 compares the compilation time for each application using our approach at design time against that at run time. On average, the run-time approach achieves an average 67.5% reduction in compilation time. This is due to the reduction of schedule construction overhead using the single-tile staticorder schedule along with the self-timed execution approach. [26] . The chip is designed in 14 nm FinFET process. There are also many other neuromorphic chips such as Neurogrid [9] , BrainScaleS [53] , Braindrop [50] , and ODIN [31] . These architectures are all tiled arrays, with each tile integrating neurons and synapses locally. This is similar to the architecture of DYNAP-SE [48] , which we model in this paper.
Run-time Performance
Mapping SNNs to Neuromorphic Hardware
Corelet is a proprietary tool from IBM to map SNNs to TrueNorth [1] . PACMAN is used to map SNNs to SpiNNaker [33] . Beyond these hardware-specific tools, there are also general-purpose ones. For instance, PyNN [27] is used to map SNNs on Loihi, BrainScaleS, SpiNNaker, and Neurogrid by balancing the load on each tile. The PSO-based technique developed in Das et al. is used to map SNNs to a hardware, reducing the energy consumption between tiles [25] . SpiNeMap reduces the communication between tiles [8] . PYCARL is proposed to perform hardware-software co-simulation of SNNs [5] . We compare our approach against PYCARL and SpiNeMap, and found it to perform significantly better. There are also other approaches that use a single large crossbar to map SNNs [2, 43, [60] [61] [62] [63] . [46] to design neuromorphic tiles.
Similar Concept in Related Domain
SDFGs are widely used for predictable mapping of applications to multiprocessor systems. Numerous approaches to throughput analysis of SDFGs have been previously proposed [18, 57, 59, 64] . Bonfietti et al. evaluated mappings of SDFG to multiprocessor system, maximizing the throughput [12] . Stemmer et al. propose to use probabilistic analysis to allocate and schedule SDFGs on multiprocessor systems [56] . Das et al. evaluated the fault-tolerant mapping of SDFGs to multiprocessor systems [21] [22] [23] . Recently, SDFGbased analysis is also proposed for analyzing machine learning applications [4, 7, 15, 20, 37] . However, none of these approaches address application analysis with limited hardware resources, both at design-time and at run-time.
Conclusions
We introduce an approach for predictable compilation of SNN-based applications on state-of-the-art neuromorphic hardware. Prior works have only addressed design-time mapping, considering unlimited resources in the underlying hardware. These approaches present significant limitations when used to compile and execute machine learning applications on a resource-constrained hardware. Our approach makes three contributions. First, we propose a technique to generate neuron and synapse clusters, where each cluster can fit on to the resources of a tile of the hardware. Second, we exploit the rich semantics of SDFG to model SNN clusters and analyze performance on a resource-constrained hardware. Finally, we propose a scheduling approach based on self-timed execution to reduce the time to compile and admit LCTES'20, June 15-20, 2020, London, UK an application to a hardware at run-time, adjusting to dynamic resource availability. We conducted experiments with standard SNN-based applications and demonstrate a significant increase in performance and reduction in compilation time over current practices.
