Abstract. This paper presents a novel method for implementing Factor Graphs in a SpiNNaker neural computing system. The SpiNNaker system provides resources for fine-grained parallelism, designed for implementing a distributed computing system. We present a framework which utilizes available SpiNNaker resources to implement a discrete Factor Graph: a powerful graphical model for probabilistic inference. Our framework allows mapping and routing a Factor Graph on the SpiNNaker hardware using SpiNNaker's event-based communication system. An example application of the proposed framework in a real-world robotics scenario is given and the result shows that the framework can handle computation of 26.14 MFLOPS only in 30.5ms. We demonstrate that the framework easily extends for larger Factor Graph networks in a bigger SpiNNaker system, which makes it suitable for complex and challenging computational intelligence tasks.
Introduction
A Factor Graph (FG) is a graphical model quite popular for probabilistic inferences. It is designed to complement already existing models such as Bayesian Network and Markov Random Field and it provides convenient mechanism to transform between those models [1] [2] . In probabilistic perspective, an FG is an appropriate model to represent factorization of joint as well as conditional probability. The standard FG is regarded as a bipartite graph since it is composed of two different types of node: variable nodes and factor nodes. To perform an exact inference on an FG (i.e. computing marginal probability), one usually uses belief propagation mechanism through message passing algorithm. FG is a powerful tool for probabilistic inference in many machine learning and signal processing applications [3] . The SpiNNaker (Spiking Neural Network Architecture) system is a distributed computing system designed to simulate spiking neural networks [4] . It is developed by the Advanced Processor Technologies Research Group (APT) at University of Manchester (UK). The SpiNNaker system is composed of many SpiNNaker chips in a torus network, which allows simulation of thousands of artificial neurons in real time.
Each chip is a multi-core system, consisting of 18 ARM968-based cores and also several internetworking elements and supporting modules. Comparing with the usage of a standard PC (or even a mainframe) for simulating spiking neural networks, the SpiNNaker system has several benefits such as smaller size and low power consumption. Hence it is a very promising platform for application in mobile robot.
This inference mechanism can be used to create control applications in robotics. Here we use our previous work on kinematic control of a mobile robot using FG as an example case for our new framework [8] . In an FG, each variable can represent a belief about a specific robot state and they exchange information by sending messages to each other for updating the overall belief. In this work, we implement an FG on a SpiNNaker system and demonstrate its promising performance especially when comparing its implementation on a standard computer (PC), as we belief that conventional PCs are particularly poor matched to instantiate such graphical networks. This paper will not describe in detail the FG performance for robot kinematic control since it has been presented in [8] ; rather, we focus on the development of the embedded FG framework. Our contribution in this research is the SpiNNaker-FG framework. This SpiNNaker-FG framework is comparable to "PACMAN" [6] , which is the framework for emulating spiking neural networks on a SpiNNaker hardware, in the sense that it provides mapping and configuration functionalities to FG-based applications implemented on a SpiNNaker system.
2
Design and Specifications
Design Consideration
Deploying a program on a dedicated hardware, especially the one with intrinsic parallelism, requires different treatments and explicit consideration [9] [10]. The same challenge is also valid for our SpiNNaker-FG framework. Here we develop our framework with the following criteria.
• Scalability The framework should be able to work with a variable number of chips, allowing us to resize the networks.
• Flexibility The framework should be flexible enough to be reconfigured for many general purpose applications without too much modification in the framework.
• Cross-boundary The framework should be able to connect the separated elements of the factor graph seamlessly.
In the following sub-sections we will describe in more detail how we achieve such criteria in our proposed framework.
SpiNNaker Infrastructure
The SpiNNaker chip contains interconnected microcontrollers (ARM968) with a specific routing mechanism. This routing mechanism involves several modules inside the chip and a special look-up table which is maintained by a Packet Router. The key feature of this chip lays on this specialty; by properly configuring the Packet Router, the developer can create an efficient massively distributed computing system [5] . There are several communication protocols available for an application program. In this work, two of them will be used for developing the embedded FG: the neural event multicast (MC) and the SpiNNaker Datagram Protocol (SDP). The MC packets will be used for transferring "messages" between nodes and SDP packets will be used for communication between the FG in the SpiNNaker system and the host PC.
There are four elements that need to be specified in advance when implementing an FG on a SpiNNaker system: the SpiNNaker cores which handle the nodes (variable or factor nodes), the routing mechanism which transfers messages from node to node, the memory layout for vectors such as messages and local functions, and the converting mechanism between real-valued data to/from discrete probabilistic representation. In this work, we use a spiNN-3 board which consists of four SpiNNaker chips and develop the mapping and routing framework (we called it SpiNNaker-FG).
SpiNNaker-FG Building Block
There are two important aspects of an FG that require distributed computation. The first is the state distribution of input/output values and the second is sum-product computation of the message passing algorithm.
Neurons Population Mapping
In this work, following our previous work, we use population coding principle to discretize the input/output values [7] . For this discretization, the chip "0,0" (see Fig  1b, box colored in green) will be used. The rest of the chips will be used for distributing nodes and sum-product computing engine (see Fig 1b, boxes colored in pink).
(a) (b) Fig. 1 . The SpiNN-3 board (a) and its chips layout 1 (b). The chip "0,0" is chosen for population encoder since it has a direct Ethernet connection to the external system.
In the theory of population coding, a group of homogenous neurons will generate spikes in synchrony and produce a certain distribution specific for input stimuli.
Although the SpiNNaker system is originally designed to emulate spiking neurons, but we don't use this emulation mechanism since we are not interested in neuron-byneuron spike generation. Instead, our SpiNNaker-FG will only use SpiNNaker abundant resources to implement population coding principles over a fully connected homogenous networks as described in [7] . Fig 2 shows how the population coding with Gaussian response is mapped into SpiNNaker cores.
Fig. 2. Mapping neurons population into SpiNNaker cores in one chip (note: the white and the black cores is reserved for SpiNNaker kernel)
The mapping shown in Fig. 2 uses 15 cores and those cores are controlled by core "1" which also behaves as an I/O port for sending and receiving data to/from external devices (e.g. a robot or host PC) via SDP. Core "0" and "17" are used by the SpiNNaker kernel for monitoring; both cores cannot be used by any application program.
When data comes from the external device, core "1" will distribute the data to the other cores (except to core "0" and "17") as MC packets and those cores will start immediately the partition process to discretize the data and store the result internally. Later on (or when requested), they will transfer the discretized value to the other chips through links "0", "1" and "2" in chip "0,0" so that it can be used by those chips for an FG inference. On the other hand, when core "1" receives a message (a vector) from the other chips, it will split and distribute the vector value to core "2" to "17" and those cores will start computing the expected value using mechanism explained in [7] .
FG-Nodes Mapping
In our SpiNNaker-FG, every core in the chip can be assigned as either factor node or variable node. We also define a Region as a subset of an FG that can be mapped efficiently into one SpiNNaker chip. This Region might contain one or more factor nodes together with its associated variable nodes as many as possible. An example for this Region splitting is shown in Fig 3. The constraint of this design is that all associated variable nodes should reside in the same chip with its associated factor node as much as possible. The reasons is that we want to minimize the traffic overhead of "messages" in the Region and also for load balancing between cores. We also envision further improvement for this load distribution for our future work (see section 4). In Region-1 in Fig. 3 , the factor F A and F B only occupy one core each since these factors are essentially inputs for node A and node B in the case that node A and node B are observed. These factors also don't have a vector value but only '1' as its local function. However, the factor F ABD occupies 10 cores since it is the only factor in the region which has a vast computation process of (2) due to its link to the three nodes. Also, the factor F A and F B could be assigned with the task of communication with the chip "0,0" (see Fig. 1b ) to get the input as well as sending the message out to the chip "0,0" before sending it to the external system (e.g. the robot). The only computation that might be performed by A and B is the marginalization, hence we assign each node with only two cores. If the application doesn't require marginalization in node A and B, they can use only one core each. The local function of F ABD will be stored in the internal SDRAM.
In Region-2, the node C and D also occupies two cores each since they might need to compute its marginal (however, if they don't, then they can be reduced to only one core each and assign the remaining cores to the more "busy" nodes). The node D, the factor F CD and the factor F DE , each occupy four cores since they compute messages product intensively. Same as in Region-1, the local function of F CD and F DE will be stored in the internal SDRAM of the chip. In the case of node D, it can be accessed in the following way. If Region-1 is placed on the chip "0,1" and Region-2 is placed on the chip "1,0" (see Fig. 1b ), then the routing table for the output of F ABD towards node D can be assigned with link "1" of the chip "0,1" and, correspondingly, the routing table for the input of node D from F ABD must be assigned with link "4" of the chip.
Mapping and Routing Factor Graph in the SpiNNaker
An FG is a bipartite graph and it has two types of node: variable nodes and factor nodes. With this bipartite nature, using FG for inference in a message passing mechanism means that there will be two types of message: variable to factor node message (as expressed in (1)) and factor to variable node message (as expressed in (2)).
Those messages will be encoded and sent using multicast (MC) mechanism as the payload of the corresponding MC packet. Specific to factor nodes: the f F in (2) is the local function of the corresponding factor node. It usually takes form as a vector in a discrete FG. We have to provide this vector function before the FG executes its inference and this vector function normally learned off-line.
Regarding this routing mechanism, each node maintains its table and registers it only once (due to the SpiNNaker's constraints). The node also has its own inputoutput matrix which reflects its neighborhood and determines which node has sent the message or has a pending message. This is important since the node computes the outgoing message only when all neighboring nodes have sent their messages.
Example Application
As the test case of our new SpiNNaker-FG, we use the scenario from our previous work [8] . In this example, an FG model for kinematic control of an omnidirectional mobile robot is developed. The task is to compute the correct robot command given the desired translational and rotational velocities. The model has been trained using data from a camera tracking system which provides the absolute pose of the robot. This example shall be viewed as a proof of concept which demonstrates a small subset of the features from our proposed SpiNNaker-FG framework. The robot (see Fig 4a. ) has three wheels and the complete FG model of the robot will involve at least 12 nodes. As explained in [8] , the model is broken down into three similar networks and the kinematics model for each wheel is shown in Fig 4b. This also gives benefits such that it makes easier to fit the model into three Regions. The models are then implemented in the chip "0,1" for wheel-1, in the chip "1,0" for wheel-2, and in the chip "1,1" for wheel-3 (see Fig 1b) . We use maximum likelihood estimation (MLE) for training the network (e.g. updating the factor f XYR ). After the training has been completed, the vector value of the factor f XYR is sent into the SpiNNaker system via SDP mechanism. To evaluate the performance of our embedded FG, we send the desired velocities of the robot (represented as factors f X , f Y and f R in Fig. 4b ) and observe the computed motor command by the model (represented as node M 2 , which reflects the motor command for the second wheel of robot in Fig. 4a) . We measure the time needed to complete one such inference to see how effective the proposed parallelism strategy is. The result is shown in Table 1 . Although it is obvious that the number of node's states linearly influences the execution time, it is interesting to note that for the highest number of the states in the scenario, the system just needs 30.5ms to complete one full inference computation. Using 60 states, actually the system computes 26.14 MFLOPS for one complete cycle (from the discretization until the final message decoding); a very fast computation, especially when regarding the core speed which is only 150MHz and without any dedicated floating point unit. As a comparison, our previous work which uses standard PC with processor Intel i5 3.30GHz and memory 16GB DDR3 running at 1.3GHz, takes 5ms to complete one full inference computation. Also, the SpiNNaker-FG offers two additional important advantages:
1. The SpiNNaker version consumes much lower energy than the PC implementation. 2. If we increase the problem size such that Fig 4b is replicated three times using the remaining chips in the SpiNNaker board, for this particular example, the execution time in Table 1 remains the same; while in a PC, it needs three times.
These advantages show that our SpiNNaker-FG framework has very promising features for future real robotics applications.
Result and Discussion
This paper describes a new implementation strategy of a Factor Graph on a SpiNNaker neural computing system. In this work, we explore one possible configuration of a SpiNNaker chip as a Region. We fit the Region with arbitrary nodes and split the CPU cores accordingly. Another possible configuration which can increase the computational efficiency is by introducing a generic Region which only contains smaller number of nodes (e.g. three variable nodes and one factor node). This is similar with the idea of binary DAG used in [10] . For example, the network in Fig 4b can be decomposed into the network shown in Fig 5a. Although it will introduce hidden nodes which need to be learned beforehand, the sum-product algorithm will run faster due to smaller number of items to be processed by the algorithm. This is preferable for future implementation using a bigger SpiNNaker system. Currently, we are also targeting the SpiNN-4 board which has 48 SpiNNaker chips (see Fig 5b) . At this stage, we are still developing an application test case for which we can demonstrate the applicability of our SpiNNaker-FG framework in the more complex scenario using that massive SpiNNaker system. Also as an extension of the framework, we will be using our SpiNNaker interface board [9] so that we can use the SpiNNaker system in real time robotics control as a standalone application. 
