Abstract-Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of small payloads to accelerate the simulation of spiking neurons in neural networks. In this paper, we demonstrate that this hardware is also beneficial for other for applications which require massive parallelism and the large-scale exchange of small messages. More specifically, we study the scalability of PageRank on SpiNNaker and compare it to an implementation on traditional hardware. In our experiments, we show that PageRank on SpiNNaker scales better than on traditional multicore architectures.
I. INTRODUCTION
Unlocking the value in the masses of the data stored and available to us has become a primary concern. Medical data, banking data, shopping data and others are all analyzed in great detail to find patterns, to classify behavior or phenomena and to finally predict behavior and progression. The outcome from these analyses is ultimately hoped to predict the behavior of customers, to increase sales in marketing [1] , optimize diagnostic tools in medicine to detect disease earlier, optimize medical treatments for better outcome [2] and so on.
As the amounts of data are growing rapidly, however, increasingly big scale-out architectures need to be used to analyze data in due time. Doing so works very well for embarrassingly parallel problems, i.e., problems with no or little-shared state. For problems with the shared state, however, massive numbers of messages need to be sent between cores to exchange state, making IPC (inter-process communication) the bottleneck (or more precisely the bandwidth of the network or memory). One such application is the simulation of SNNs as they are used to understand or replicate how the brain works.
To still support the efficient simulation of SNNs, research has recently started to develop neuromorphic hardware. Neuromorphic hardware designs typically feature massive parallelism (a massive number of cores) with a very fast interconnect for small messages (to send spikes between cores). While some of the proposed designs can only be used to simulate neural activity (e.g., Spikey [3] ), others can be programmed to run general applications as well (e.g., SpiNNaker [4] ).
In this paper, we want to show that for the class of problems which require massive parallelism as well as the exchange of small messages, the neuromorphic hardware provides better scalability than traditional multicore architectures. We demonstrate this with the specific example of PageRank on SpiNNaker. We chose PageRank because it is a broadly used algorithm used for graph analytics where each core works on a subset of the graph and exchange edge counts in small messages. More specifically, a single iteration of PageRank needs to send as many messages as there are edges in the graph. In each iteration, each core sends its weighted rank along its outgoing edges and builds up a new rank by summing the ranks coming from inbound edges. Processing big graphs requires the computation to run in parallel on multiple cores or machines. The computation is typically not very CPU intensive but the bottleneck generally lies with the communication needed to exchange ranks in each iteration. Being designed for the efficient, low-latency exchange of small messages, communication is precisely where the SpiNNaker board may have an advantage over traditional multicore architectures or distributed setups.
We further chose SpiNNaker as it uses general purpose ARM cores to process data connected with quick IPC. As we show with the example of PageRank, compared to traditional multicore architectures, SpiNNaker enables for better scalability for massively parallel applications which require the exchange of large numbers of small messages.
II. BACKGROUND
In the following, we will provide background on neuromorphic hardware in general, SpiNNaker in particular and then discuss the PageRank algorithm.
A. Neuromorphic Hardware
The brain is a massively parallel system of highly interconnected but computationally simple neurons. Simulating brain activity as a SNN is typically not done efficiently at scale on traditional von Neumann architectures. First, to simulate an SNN, a very high number of very small messages (i.e., the spikes) must be sent between neurons. While spikes are encoded with a few bits, most communication protocols require an order of magnitude more bits for the header and routing information alone. Moreover, today's communication protocols lack efficient means for multicast communication which is crucial to send the same spike to a large set of neurons. Second, while neurons are only active when receiving and processing spikes, traditional CPUs or GPUs are always on, rendering the simulation of SNNs energy inefficient. has the relatively simple task of simulating several neurons and it can thus be comparatively wimpy (few FLOPS).
Communication: Unlike traditional communication between cores or computers, the payload sent between neurons is very small and can be encoded in a few bits (40-72). Key to neuromorphic hardware thus is an efficient communication optimized for small payloads. Given the massive number of connections between neurons, multicast communication must also be efficiently supported.
2) State of the Art Devices:
The prototypes developed share key features like a massively parallel infrastructure with a fast interconnect for small messages. For example, the SpiNNaker architecture is based on a large number of ARM processors [4] . Memory with little capacity is local to each processor (which itself has several cores) while there is also slower, global memory used for communication between processors. More important is a very efficient communication infrastructure for small packets (∼24 Bytes for a spike). IBM's TrueNorth project is a platform with 4096 cores each simulating 256 neurons [5] . Similar to SpiNNaker, memory, computation, and networking is handled locally by each core, therefore moving past the von Neumann architecture. With an event-driven model where cores are only powered when needed, TrueNorth reduces energy consumption similarly to SpiNNaker. Intel's Loihi platform is designed to reduce energy consumption by using spin devices to simulate the neurons (thereby restricting the neuron models that can be used) as well as memristor as local memory (to store synapse weight) [6] . Spikey [3] takes energy efficiency even further by using a mixed-signal approach. Neurons are implemented using analog elements. While this makes them more bio-realistic, it limits the flexibility to use arbitrary neuron models as new models cannot simply be implemented through programming.
B. SpiNNaker
Several different SpiNNaker boards and devices have been developed. They range from small development boards attached to a desktop and featuring four chips, to large-scale deployments featuring 500'000 cores accessed remotely. The architecture components, i.e., chips and communication infrastructure, are the same across boards.
1) Chip Architecture:
The SpiNNaker chip [7] has 18 ARM968 cores running at 200MHz. Each core also has 96KB of Tightly Coupled random-access Memory (TCM), that is mounted on top of the core to minimize latency (5ns) [7] . Each core has a DMA controller, to copy from and to shared memory at a low latency (15ns) [7] and without CPU overhead. Additionally, each core comes with a communications controller (to interface with the router). 128 MB of DDR SDRAM is shared between all the cores of the chip. It is located offchip, but usually in the same package; hence the SDRAM controller has an on-chip interface to access it. The router is used to direct messages. SpiNNaker cores communicate via a packet-switched network optimized for large numbers of small fixed-sized data packets sent over UDP/IP (message passing). At boot time, each core is assigned one role where 16 application cores, 1 monitoring core, and 1 spare core are needed. The application cores are running the user code while the monitoring core is responsible for coordinating the work of the application cores. The spare core is used in case of hardware failures.
2) Software Architecture:
The software stack used to run simulations on SpiNNaker comprises several modules, the most relevant of which are described below.
Host Machine: where the higher-level software to define and generate the C code needed for simulations resides. The sPyNNaker8 framework is an implementation of PyNN, a simulator-independent language for building neuronal network models [8] , and is used to define SNNs while general graphs can be specified through the SpiNNakerGraphFrontEnd. At a lower level, the graph is mapped to SpiNNaker cores using PACMAN, and even lower is where the compiled C application binaries are sent to SpiNNaker through SpiNNMachine via the SpiNNaker Datagram Protocol (SDP) over UDP/IP.
Monitoring Processor: runs the SpiNNaker Control and Monitor Program [4] . At the center of any SpiNNaker core is the SpiNNaker Application Runtime Kernel (SARK) [9] . It is the lowest layer of software that provides functionality such as core and application bootstrapping, memory management, interrupt control, handling of host machine commands on the monitor processor (e.g. remote memory alteration for debugging, application load / unload).
Application Processor: runs the simulation binary compiled from C on the host machine by the user. Two libraries providing support to the application code can also be statically linked with the binary: SARK, which is compulsory as it supports the application cores in the SpiNNaker environment and spin1, an event-based API that operates at a higher level and lets the user define callbacks for specific events.
3) Execution Model:
Execution of code on SpiNNaker is primarily event-driven with the most important events being received messages. SpiNNaker also features synchronous execution of code where the code is executed periodically to batch process buffered messages. To make use of the asynchronous execution model, an asynchronous callback is registered with the associated event. The callback will be executed once the event occurs. The synchronous model executes a defined piece of code after the timer tick period has expired. The timer tick period is defined globally and is a mean to synchronize the execution of code on different cores globally.
C. PageRank
PageRank is an algorithm developed and used by Google to assign relevance to web pages to ultimately rank them. The relative weight and thus relevance (also referred to as rank) of a page depends on the relevance of the pages pointing to it. The definition of relevance is thus recursive, i.e., pages may point (indirectly) to themselves, and execution of the PageRank algorithm is iterative and will stop once the weights of pages converge.
The distributed approach to PageRank is based on the Bulk Synchronous Parallel (BSP) model [10] . Each core/node in a distributed setting is assigned multiple vertices for which it calculates the ranks. Once ranks are calculated for vertex A, the ranks are sent to all cores/nodes which are responsible for vertices directly connected to A through edges. True to the BSP model, overall computation works in two globally synchronized phases: (a) a communication phase in which the vertex ranks are exchanged and (b) a computation phase where the ranks are updated based on the incoming ranks. The two phases are repeated until the weights of the vertices converge.
III. IMPLEMENTING PAGERANK ON SPINNAKER
On an abstract level, executing PageRank as well as simulating SNNs are very similar as both work massively parallel on a large graph. Furthermore, both are distributed by assigning vertices (web pages or neurons) to nodes/cores and by exchanging small messages (weights/ranks or spikes) between cores. Given the similarities, the execution of PageRank lends itself very well on the neuromorphic hardware.
A. Implementation
The sPyNNaker framework lets developers implement custom neuron models in addition to those already provided in sPyNNaker. This allows us to keep all the benefits of sPyNNaker, while at the same time let us define our own logic in the neurons implemented as C code. In this implementation, we first express a PageRank graph in terms of a population of interconnected neurons and then rewrite the neural interactions in C as a distributed, locally synchronous globally asynchronous, PageRank implementation using message passing.
a) Mapping PageRank to SpiNNaker: Simulating a SNN on SpiNNaker requires to define the neural network with all its properties before moving it on the device and running it. We implement a mapping that defines PageRank graphs in terms of a SNN. We map between the abstractions of PageRank and SpiNNaker/SNNs as follows:
PageRank Vertex: each vertex is represented by a neuron. sPyNNaker already allows for a neuron to maintain an internal state, so we reuse this structure to store the internal state of a vertex, such as its rank/weight. Reusing the neuron structure has the advantage that we can delegate resource usage planning, i.e., resource use (memory and others) of the neurons, to PACMAN which computes the best placement of neurons/vertices on SpiNNaker cores.
PageRank Directed Edge: edges are represented by a directed synaptic connection between two neurons. This is used by sPyNNaker to pre-compute all the multiplexing / demultiplexing metadata needed to distribute packets once they reach their destination core.
Vertex Rank: the rank/weight is represented by the membrane potential of a neuron. This is a voltage measurement at the core of SNNs, and support is already implemented to be able to record this value. We reuse this implementation for vertex ranks, that can be recorded at each time step of the simulation to get a trace of the successive rank values assigned to each vertex.
This mapping has the advantage that we can reuse the tools (data structures etc.) provided by sPyNNaker to serialize and deserialize the data structures needed to move the graph to the device as well as the results back from it. b) PageRank Neuron Model: At the core, the mapping defined before allows us to feed the inputs to PageRank and to move back the results (ranks) at the end of the simulation. With it, messaging between cores, i.e., the message processing facility (multiplexing / demultiplexing) can simply be reused and left untouched. Crucially, however, is the logic of the neurons needed to be adapted to execute the logic of PageRank computations. We map the neural components in the sPyNNaker C code to PageRank as follows:
Neuron Model: defines the state of a neuron and we redesign them as vertex manager defining the state of a vertex, comprising of fixed parameters such as the damping factor (probability that a random user will click on another link) and of state variables such as the current rank of the page.
Spike Processing: in the sPyNNaker implementation, neurons (or cores) process incoming spikes with no payload. We redesigned it to be a message processor supporting multicast messages with a payload. Our implementation handles packet reception, queries synapse records through DMA transfers and calls the synapse manager with the (routing) key, payload (rank) and buffered synapse_row (recipients).
Synapse Manager: used to handle the last message delivery step. We use it as a message dispatcher for vertices by removing the additional logic specific to SNNs.
Receiving messages in our implementation is handled via asynchronous callbacks. The code is executed to handle message processing and buffering once a message arrival event is triggered. This step of the computation is concerned with receiving incoming messages. First, an incoming packet is buffered by the message processor (step 1) while its associated recipient records (mapping routing key to recipient neuron/vertex) are loaded from SDRAM (step 2). Upon completion (step 3), the message dispatcher is called with the packet key/payload (green) and the recipient records (white) are used to distribute the message to the specified PageRank vertices (step 4).
This message processing strategy is derived from the original spike processing logic in sPyNNaker and therefore benefits from the robustness and efficiency of the sPyNNaker implementation.
The code block executed synchronously at the end of the timer tick period, on the other hand, is used to compute the rank of pages as well as to send out messages along the edges of the core. All vertices managed by a core send their ranks to connected vertices. The vertex manager loops through its vertices and for each of them uses the Spin1 API to send a packet with a 32-bit routing key. The first 24 bits are computed by PACMAN and uniquely identify the core the message is sent from. SpiNNaker routers keep track of which core expects messages from what other cores and routes the messages based on the sender. The remaining 8 bits correspond to a vertex identifier (in the range 0..255 in the figure) that is used for demultiplexing on the receiving core to dispatch the rank to the correct vertices.
c) Locally Synchronous, Globally Asynchronous:
The execution of PageRank requires synchronization on some level. Neither the Spin1 API nor the SARK library, however, provide a way to globally synchronize all cores across the system.
In PageRank, the results are computed iteratively in subsequent iterations until convergence is reached. In a traditional single-threaded implementation this is simply achieved by executing each iteration after the previous one. For a traditional multi-threaded implementation, this is accomplished by having a synchronization barrier to wait for all workers to complete an iteration before moving to the next iteration of PageRank. In SpiNNaker, there are no synchronization barriers so we choose a locally synchronous, globally asynchronous approach.
Locally, each core waits for all its vertices to complete their iteration before moving to the next one, synchronously as a group. In practice, this is done by using the semaphore API, by having each vertex increase the semaphore at the beginning of the iteration and proceeding when all expected messages have been received (as many as there are incoming edges).
Globally, cores are not synchronized and progress at their own pace. Doing so makes it possible that some messages arrive early, sent by cores that are some iterations ahead. To address this issue, all messages are tagged by the sender with an iteration number and the receiver maintains distinct periteration buffers. Consequently, early packets are buffered until the core executes the correct iteration. Because a core only moves to the next iteration when all its vertices have received all their expected messages, these cross-core dependencies ensure (particularly on dense networks) that all cores progress approximately at the same pace. Per-iteration buffers can potentially lead to a large number of buffers which need to be maintained, i.e., one for each iteration of the simulation. In practice, however, four per-iteration buffers are sufficient, i.e., execution lags as most four iterations behind.
Overall, this implementation achieves good performance speed-ups, which are discussed in Section IV, and allows PageRank to scale at a minimal overhead.
IV. EVALUATION
In the following we analyse the efficiency and scalability of our PageRank implementation on SpiNNaker. We compare it with the scalability on a traditional multicore CPU.
A. Experimental Setup
We use the following hardware, data and methodology to analyze our approach.
1) Hardware:
For the experiments, we use two different hardware devices. For development, testing and small-scale experiments, we use a SpiNNaker 102 board with 72 cores (4 chips) locally. To test scalability on a large scale, we use a 106 board with 1 million cores. For the comparison of SpiNNaker based implementation of PageRank with that on traditional hardware, we use a dual-socket Intel Ivy Bridge server (IVB) with E5-2697 v2 processors, similar to what authors used in [11] . Table I shows details about our test machine. HyperThreading and Turbo-boost are disabled through BIOS. With Hyper-Threading and Turbo-boost disabled, there are 24 cores in the system operating at the frequency of 2.7 GHz. 2) Data: We generate synthetic graphs to test PageRank. We use a simple graph generation method that aims to maximize network usage and thus has numerous edges (along which a large number of messages needs to be sent). For a graph size of N vertices, we create N × 10 edges to connect them. We do not generate dangling nodes. To generate an edge, we randomly select the source and target vertices and repeat the process until N × 10 distinct edges are generated. Because core-to-core communication is multiplexed, the network cost is the same when sending a message to one off-core vertex or to all vertices on that distant core. Therefore, for graph sizes that are mapped to tens of cores as in our experiments, each vertex will on average have outbound edges to 10 distinct offcore vertices and thus send 10 distinct network messages.
3) Experimental Methodology:
As we can only fit a limited number of vertices on each core, we need to use more cores to analyze bigger graphs. For experiments analyzing scalability, we increase the number of cores along with the size of the randomly generated graph. We always run the 25 iterations of PageRank. We use 10 edges for each vertex to maximize network usage as one message needs to be sent per edge (and per iteration). Due to limitations of SpiNNaker related to addressing recipient vertices within a core as well as semaphores, each SpiNNaker core can manage at most 255 vertices (or also at most 255 neurons). Each chip, therefore, manages 15×255 = 3, 825 vertices (the remaining application core is used to re-inject dropped packets). The Spin3 board which we use for the tests contains 4 SpiNNaker chips so the maximum number of vertices we can simulate with it is 4 × 3, 825 = 15, 300. The graphs, or more precisely their vertices, are randomly assigned to cores. Due to the fundamentally different characteristics of neuromorphic hardware compared to a traditional multicore CPU, we do not compare the raw execution times of the hardware and compare normalized execution times and the scaling trends of one machine versus each other.
As SpiNNaker only supports fixed-point arithmetic operations, implementing PageRank on it inherently is less precise than comparable implementations using floating point operations. Consequently, to ensure results are precise, we use a fixed-point arithmetic library to represent floating point numbers on SpiNNaker. We verify correctness of the results with a Python based implementation.
B. Parameter Considerations
For the simulation of SNNs, several parameters, like time scale factor, time step and others, need to be set to define the duration of the execution of the synchronous code block. For the execution of PageRank, however, this fine granular configuration is not necessary. Instead, we fix all parameters bar one and use it, the time scale factor, to increase or decrease the time in which the synchronous code can be executed in each step/iteration of the simulation/execution of PageRank. We refer to this period of time as the time allowance.
Clearly, the bigger the time allowance, the more processing time a core has between during each execution of the synchronous code. When SpiNNaker is not given enough time to process messages during an iteration, it starts dropping them. This can occur at two different layers: (a) at the network layer with router buffers overflowing or (b) at the core level which In a first experiment, we, therefore, analyze the time SpiNNaker needs to process one iteration so the time allowance can be set such no messages are dropped. In Figure 1 , we show an attempt to prevent message drops. We fill an entire chip (3, 825 vertices) and increase the time allowance iteratively. One can see the transient behavior of messages getting dropped before a specific threshold of time allowance is reached. During this phase, the simulations also crash. Once the threshold is crossed, the transient behavior disappears and we can measure the computation time. We tune the time allowance in all our the performance related experiments.
C. Performance Analysis
In the next experiment, we study the basic scalability of PageRank on SpiNNaker. More precisely, we use a fixed size graph and vary the number of cores used for the execution. This is done by setting a constant at run-time that defines how many vertices of the graph should be mapped by PACMAN to a single core. The results in Figure 2a show that the greater the number of cores for the same graph, the better the execution time. However, this trend seems to taper off from 6 cores onwards. For example, splitting this 255-vertex graph between 4 cores, managing 255/4 ≈ 64 vertices each reduces the running time by more than half. The results in Figure 2b show the same experiment but using a bigger number of cores. When we increase the number of cores used for the simulation by ×4 from 15 to 60, the execution time is only reduced by 20% (from 590ms down to 480ms) whereas it decreased by over 50% in the first case. This is a result of (a) the network overhead of a larger graph and (b), the added cost of communication between chips. The latter was not needed previously as all cores used were located on the same chip.
D. Scalability Analysis
We next compare the scaling potential of PageRank on SpiNNaker versus an implementation running on a traditional multicore. We adopt the benchmark from the PRPACK multicore implementation [12] to run on increasingly bigger graphs and hardware: 255 vertices on 1 core, 510 vertices on 2 cores etc. up to 18 cores. We run the benchmark 100 times on a dualsocket server with 24 cores and average the execution times. The trace of execution times of both machines is normalized relative to their first execution time (for the smallest graph) so that both scaling curves have the same starting point. The normalized execution time used makes the specific multicore CPU version less relevant as the scaling curve will have the same shape regardless of how fast it runs on a given CPU.
As the results in Figure 3 show, the implementations scale on a graph that gradually fills up an entire chip, that is 3, 825 vertices and ten times more edges. SpiNNaker shows smoother scaling and a significant speedup in terms of scaling trend. Scaling of the execution on the multicore is also smooth initially and the speedup grows to about 2× up to 12 cores. At 12 cores, however, it rapidly increases when the computation starts to spans both NUMA sockets of the 24-core server as is indicated by the sudden jump in the curve. A similar trend can be observed for SpiNNaker albeit it is not as pronounced. Its scaling curve becomes slightly steeper after 15 cores when the computation uses two chips/processors. Overall, this evaluation confirms that PageRank on SpiNNaker and on a multicore have a similar scaling behavior part from the time where the traditional multicore spans across multiple NUMA sockets.
E. Limitations
The graph size our implementation can process is limited by the hardware size used because as discussed previously, only 255 vertices can be managed by one core. The largest graph that can run on the Spin3 chip has 15, 300 vertices. Further, we do not take into account the start-up time of simulations. The vast majority of this time is spent in the pre-processing stage, which includes constructing the data specification that is loaded on SpiNNaker. Only long-running algorithms will justify the upfront cost required by this preprocessing stage, and PageRank usually converges in a few hundred iterations, which are executed fast.
V. RELATED WORK
The earlier version of the examples shipped with the module has a SpiNNakerGraphFrontEnd based implementation of heat equation solver [13] . This type of simulation is defined by a 2D grid where each cell iteratively propagates its heat to its neighbors according to a fixed set of rules. This essential requirement of state propagation between the nodes disqualifies sPyNNaker8. In [14] , authors also use SpiNNakerGraphFrontEnd to execute the Markov chain Monte Carlo inference on SpiNNaker. In another study [15] , authors develop a key-value store on Spinnaker and explore the energy efficiency of SpiNNaker devices by moving data between cores of the key-value store and aggressively using the power efficient wait for interrupt mode of the SpiNNaker CPU cores. Sugiarto et al [16] proposed a fine-grained parallelism based method to implement Sobel edge detection on SpiNNaker. James et al [17] uses SpiNNaker machine to implement a model of the auditory nerve and ear's inner hair cell, which is shown to be comparable with a MATLAB version of the same model algorithms but have better scalability.
VI. CONCLUSION
With our PageRank implementation, we demonstrate how massively parallel hardware with low latency interprocess communication for small messages designed for the execution of SNNs can be used for other problems as well. With the specific example of PageRank, we show that we obtain smoother scalability than with traditional multicore CPUs where, once execution spans multiple cores, communication through NUMA slows down execution considerably and adversely affects scalability.
Clearly, we picked PageRank because it requires a large number of small messages to be sent between cores and thus lends itself perfectly for this type of hardware. Still, the experimental results indicate the potential of neuromorphic hardware not only for PageRank but for the class of problems which require massive parallelism and the exchange of small messages.
