Abstract-We present a hardware architecture that uses the neural engineering framework (NEF) to implement large-scale neural networks on field programmable gate arrays (FPGAs) for performing massively parallel real-time pattern recognition. NEF is a framework that is capable of synthesising large-scale cognitive systems from subnetworks and we have previously presented an FPGA implementation of the NEF that successfully performs nonlinear mathematical computations. That work was developed based on a compact digital neural core, which consists of 64 neurons that are instantiated by a single physical neuron using a timemultiplexing approach. We have now scaled this approach up to build a pattern recognition system by combining identical neural cores together. As a proof of concept, we have developed a handwritten digit recognition system using the MNIST database and achieved a recognition rate of 96.55%. The system is implemented on a state-of-the-art FPGA and can process 5.12 million digits per second. The architecture and hardware optimisations presented offer high-speed and resource-efficient means for performing highspeed, neuromorphic, and massively parallel pattern recognition and classification tasks.
Abstract-We present a hardware architecture that uses the neural engineering framework (NEF) to implement large-scale neural networks on field programmable gate arrays (FPGAs) for performing massively parallel real-time pattern recognition. NEF is a framework that is capable of synthesising large-scale cognitive systems from subnetworks and we have previously presented an FPGA implementation of the NEF that successfully performs nonlinear mathematical computations. That work was developed based on a compact digital neural core, which consists of 64 neurons that are instantiated by a single physical neuron using a timemultiplexing approach. We have now scaled this approach up to build a pattern recognition system by combining identical neural cores together. As a proof of concept, we have developed a handwritten digit recognition system using the MNIST database and achieved a recognition rate of 96.55%. The system is implemented on a state-of-the-art FPGA and can process 5.12 million digits per second. The architecture and hardware optimisations presented offer high-speed and resource-efficient means for performing highspeed, neuromorphic, and massively parallel pattern recognition and classification tasks.
Index Terms-MNIST, neuromorphic engineering, neural engineering framework, pattern recognition, pseudo inverse, timemultiplexing.
I. INTRODUCTION
N EURAL networks have proven to be powerful tools for real world tasks such as pattern recognition, classification, regression, and prediction. However, these methods are computationally demanding and are not ideally suited to modern computer architectures. This constraint has often prohibited their use in applications that need real-time control, such as interactive robotic systems. To solve this problem, scientists have been developing hardware implementations of conventional artificial neural networks over the past two decades [1] - [7] . Neuromorphic systems, inspired by biological nervous systems, have also achieved significant breakthroughs [8] - [19] . However, most of these systems (with the exception of the SpiNNaker system, [13] ) are not capable of efficiently synthesising large-scale neural networks for these real world tasks from subnetworks and therefore are not very helpful for algorithmic development, as pointed out by Tapson et al. [20] . The main contribution of this paper is a neuromorphic architecture for hardware implementation of large-scale neural networks for massively-parallel pattern recognition.
The Neural Engineering Framework (NEF) [21] , first introduced in 2003, is a framework that is capable of building large systems from subnetworks with a standard three-layer neural structure (the first layer contains the input neurons; the second layer is a hidden layer, which consists of a large number of non-linear neurons; and the third layer is the output layer, which consists of linear neurons). The NEF has been used to construct SPAUN, a brain model that is capable of solving cognitive tasks in a comparable way to how humans do this [22] . This demonstrates that the NEF is a powerful tool for synthesising large-scale cognitive systems.
We have previously presented a compact neural core architecture specifically for the FPGA implementation of large NEF networks [23] . In this paper, we present the architecture that uses this neural core to build a pattern recognition system. The outline for this paper is as follows: Section II-A introduces the basic concepts of the NEF; the algorithm and theory is presented in Section II-B; the hardware implementation is presented in Section II-C; the performance for different design choices will be thoroughly compared in Section III; in Section IV, we conclude our work by comparing with other solutions.
II. MATERIALS AND METHODS

A. Background
In this section, we review the theoretical framework for mapping computations onto heterogeneous populations of spiking neurons for a typical NEF system. The topology of the NEF network is illustrated in Fig. 1 . A NEF network performs three tasks to calculate a desired function f(X). Fig. 1 . A typical NEF network. The stimulus X(t) is encoded into a large number of nonlinear hidden layer neurons N using randomly initialised connection weights. The output of the system, Y(t), is the linear sum of the weighted firing rates of the hidden layer neurons. 
1) Encoding:
An encoder will have a fixed random weight (RW) for each hidden layer neuron, and multiplies the input stimulus by this weight. The firing rate of individual neurons is a nonlinear function of the input stimulus weighted by the random weights. The parameters of the neurons are also randomised, so that each neuron in the hidden layer exhibits a distinct tuning curve. An example of such tuning curves is shown in Fig. 2 .
2) Decoding: The activity, H, of the hidden neurons (i.e. the spike rate of each neuron) can be measured over the desired range of input values X. The output of each neuron will be multiplied by their decoding weights such that HW = f (X) = Y . Since this is a linear system, these weights can be found by calculating W = H + Y , where H + is the Moore-Penrose pseudoinverse [24] of H.
3) Averaging: The output of the system, Y(t), is the linear sum of the weighted firing rates of the hidden layer neurons.
The three-layer feedforward network is similar to the Extreme Learning Machine (ELM) [27] known in the machine learning field.
B. Algorithm and Theory
1) Methodology:
Recognition or classification of handwritten digits is a standard machine learning problem, and it has Fig. 3 . System Topology. The inputs are the pixels; they are connected to a higher-dimensional hidden layer with 8k neurons, using randomly weighted connections. The output layer consists of linear neurons and the output layer weights are solved analytically using the pseudoinverse operation.
become a benchmark problem in the form of the MNIST dataset of handwritten digits [25] . As a proof of concept, we used the NEF to implement a system to recognise the MNIST digits (see Fig. 3 ). The focus of the work presented here is on the hardware implementation and techniques, and the dataset is used primarily to verify the approach and to provide a means to compare and validate against existing systems. The MNIST dataset is perfect for this, since it is well understood and characterised, and has been used by many other authors. It should be noted, however, that the techniques and optimisations presented here can be applied to other pattern recognition applications as well.
The digit recognition system uses the three-layer feed forward neural network described in the previous section. It consists of 784 input layer neurons (pixels), 8 k hidden layer neurons and ten output layer neurons. The input layer neurons are connected to the hidden layer neurons using randomly weighted all-to-all connections. The hidden layer neurons are also connected to the output layer neurons using all-to-all connections but with the weights calculated using a pseudoinverse operation. The response of the output layer neuron is given by:
where, H is the output of the hidden layer neurons for each input digit, W is the decoding weight and Y represents the corresponding value of the input digit. The weights can be obtained by calculating W = H + Y , where H + is the pseudo-inverse of H and is computed on a computer.
Compared to our previous work [26] , we have made three major modifications: the grey-scale pixels in the input images of MNIST were replaced by black & white (binary) pixels; tanh neurons in the hidden layer were replaced by rate neurons (see Fig. 2 ); and 64-bit floating-point numbers for the decoding weights were replaced by 6-bit fixed-point numbers. We will address these modifications in the following sections.
2) Modelling: Our aim is to develop a fast pattern recognition system implemented on hardware and running in real time, rather than a system with the lowest test error. Thus, we have adopted a hardware-driven method to implement our system, which will achieve the best trade-off between performance and hardware resources. This method first considers the hardware constraints, and then optimises all the building blocks.
For FPGA implementations, there will be a significant difference in hardware cost (logic gates) between fixed-point and floating-point implementations, as the latter requires many more digital signal processor units (DSPs). More importantly, a floating-point number is represented by 64-bits, which would lead to a large data storage requirement, which would be a bottleneck for the system. Thus, we have implemented our system using fixed-point fractional numbers, ([−1, 1) for weights and neuron input, [0,1] for hidden neuron output).
Before implementing the design on hardware, we have modelled our system using the fixed-point representation in Python, which is a popular software programming language. This software model performs exactly the same as the hardware implementation. This will ensure that the software and the hardware results are the same, and will avoid any performance drop or malfunctioning of the system in hardware due to conversion from software to hardware. The models presented in the remaining part of this section were all software models unless otherwise specified.
3) Input Layer: The input layer reads digits from the MNIST database and maps them onto the input layer pixels (one-byone). This task consists of not only converting the dimension from 28 × 28 to 784 × 1, but also converting the grey-scale value (an 8-bit number that ranges from 0 to 255) of the pixels to a binary value. The latter is a major difference between our system and existing solutions [2] , [3] , [6] , [17] , [28] . This conversion reduces the hardware cost significantly (see Section II-C 2) while resulting in a negligible performance loss (see Section III-A). There are many different methods to convert grey-scale images to a binary representation, such as thresholding and normalisation. However, since our focus is on the hardware implementation here, we use the simplest approach and convert any non-zero values in the image to 1. It is expected that more complex conversion methods may produce better results, especially when tuned to specific datasets.
With the input values limited to 0 or 1, each hidden layer neuron simply receives as input the sum of the random weights between it and the non-zero pixels. For verification of our hardware system, the random weights used in the software and in the hardware models should be the same and produce identical results. One option is to use a look up table (LUT) in the FPGA to store the random weights generated by the software model. The major drawback of this solution is that it requires a significant amount of memory, which scales linearly with number of input neurons and hidden layer neurons. For FPGA implementations, the most efficient way to generate random numbers is to use linear feedback shift registers (LFSRs), as we have previously used to implement a randomly weighed all-to-all connectivity in a spiking neural network [29] . Based on that work, we have developed an encoder that uses LFSRs to perform the nonlinear projection. We use the same LFSR encoder in software to ensure that the random weights are identical in both implementations. We have highly optimised the encoder for hardware implementation, and details of this will be presented in Section II-C.
4) Hidden Layer:
The standard building block of the hidden layer is a compact neural core that consists of 64 neurons and is capable of performing non-linear mathematical computations as presented in [23] . By using the NEF, we can easily combine multiple identical neural cores together to build the hidden layer since each neural core has the same set of known tuning curves and this method allows future improvements without changing the hardware.
The hidden layer was implemented with 128 identical neural cores, for a total of 8 k neurons and 8 k × (784 + 10) 6.5 M synaptic connections. This hidden layer size has achieved the best trade-off between performance and memory usage as shown in Section III-B. Given an input image, the encoder will generate, via the random weight projection, a different Vin for each neuron in each core, even if each core contains identical neurons, because the random weights will be different.
C. Hardware Implementation 1) Topology:
To efficiently implement the system on an FPGA, we use a time-multiplexing approach [12] , [15] , [29] - [35] , which leverages the high clock speed of the digital circuit. State-of-the-art FPGAs can easily run at a clock speed of 266 MHz (clock period 3.75 ns). Therefore, we can time-multiplex a single physical neuron to simulate many virtual neurons [29] such that up to 256 k virtual neurons can be simulated, each one updated every millisecond. We refer to these neurons as time-multiplexed (TM) neurons. This means that on every clock cycle, a TM neuron will be processed and each TM neuron is updated every 256 k/266 MHz 943 µs. A sub-millisecond resolution is generally acceptable for real-time control, such as interactive robotic systems.
As our system is still only a proof-of-concept, we have used only one clock domain and the only peripheral connection used is the JTAG interface, via which the computed weights are loaded into the FPGA.
The time-multiplexing approach is, however, constrained by its data storage requirement since the on-chip SRAM is limited in size (usually only tens of MBs). Due to bandwidth constraints it is difficult to use off-chip memory with the time-multiplexing approach, as new values need to be available from memory every clock cycle to provide real-time simulation. Furthermore, the architecture of the system will be more complex when using offchip memory because it needs a dedicated memory controller. Nevertheless, using off-chip memory promises the ability to implement much larger networks and we will investigate this option for future designs. However, we chose to use on-chip memory for the current work to keep the architecture simple.
As our system is still only a proof-of-concept, we have also used only one clock domain and the only peripheral connection used is the JTAG interface, via which the computed weights are loaded into the FPGA. Fig. 4 shows the topology of the FPGA implementation of the system with an input layer (the encoder), a hidden layer with 128 neural cores, and an output layer with 10 neurons. The encoder and the hidden layer are both implemented to use time-multiplexing and Fig. 4(b) shows their internal structure. It consists of a physical encoder, a physical neuron, a global counter and a weight buffer. The global counter processes the time-multiplexed (TM) encoders and neurons sequentially. The decoding weights of the physical neuron are stored in the weight buffer. For simplicity, let us assume that each TM encoder and TM neuron are processed in only one clock cycle. This means that in every clock cycle, a TM encoder will generate the stimulus for an input digit, and the corresponding TM neuron will generate a firing rate with that stimulus and then multiply it with the decoding weights. The decoding weights are obtained by calculating the pseudoinverse of H using our online pseudoinverse update method (OPIUM) [26] , which is an incremental method. We have also developed simplified versions of OPIUM, such as OPIUM lite [26] and SOL [36] , which are fast online methods for calculating an approximation to the pseudoinverse.
The input digit remains available until all the TM neurons finish their processing. The output of every TM neuron will be ten weighted firing rates, each of which will be accumulated by its corresponding output neuron. Using a pipelined architecture, the result from calculating one time step for a TM encoder and neuron only has to be available just before the turn of that TM encoder and TM neuron comes around again. The above description assumes that it only takes one clock cycle to process one TM encoder and TM neuron, but this timing requirement is quite difficult to meet in a practical design. We will address this issue in detail in next section.
2) Physical Encoder: The encoder will generate a uniformly distributed random weight for each pixel of the input digit, and then sum these weighted pixels to generate the stimulus for each neuron in the hidden layer. We have pre-processed the input digit by converting the grey-scale value of each pixel to a binary value. This saves significant hardware resources in the FPGA, since otherwise we would need 784 multipliers to compute the multiplication between all pixels and their corresponding random weights. Each binary pixel is used to control a 2-input multiplexer, one is connected to its corresponding random weight and the other is tied down to zero. If the value of a pixel is high, that corresponding random weight will be accumulated for the generation of the stimulus for one hidden layer neuron.
The major challenge in implementing the encoder in hardware using the time-multiplexing approach is to meet the timing requirement. We need to sum all the 784 weighted pixels in 3.75 ns, since each TM neuron needs to be processed in one clock cycle. Moreover, this operation will require 784 adders, which will cost a significant amount of hardware resources. The introduction of pipelines will mitigate the critical timing requirement, but will need even more adders. As a compromise we chose to process each TM encoder and TM neuron in a time slot of four clock cycles. So the encoder will perform this sum operation in four cycles, each of which will sum 784/4 = 196 weighted pixels. This modification not only mitigates the critical timing requirement, but also reduces the number of adders that are needed. The price paid is that the time-multiplexing rate has to be divided by four. Hence, we can only time-multiplex 64 k neurons rather than 256 k neurons in 1 ms.
The complete system has a 13-stage pipeline without halt. This means that each TM neuron will access each computing module such as the parallel adder and the multipliers (described in the next section) for up to one time slot of 4 clock cycles but at different clock cycles. On one clock cycle, the parallel adder and the multipliers are all being used but by different TM neurons. The overhead of pipeline is negligible: 26 clock cycles, the first 13 cycles for setting up the pipeline and the last 13 cycles for waiting for the last TM neuron to finish the computing.
Since a regular LFSR will cycle through all its possible values, its output will be unbalanced (the number of the 0's and 1's are NOT approximately the same), which will in turn make some of the hidden layer neurons always generate either low or maximum activations. This would affect the performance of the classifier significantly. To generate more balanced random weights, we use multiple small LFSRs, each of which generates a random number. The probability for these small LFSRs to be all 0's or 1's simultaneously is then negligible. arrives, it is stored in the input buffer. In each time slot, the global counter sends that stored digit to the multiplexers to generate the weighted pixels. The lowest 196 bits are sent in the first clock cycle (of that time slot) and the highest 196 bits in the fourth clock cycle, and the two other sets in between.
Each RW generator generates a 20-bit random number, which is divided into four 5-bit random numbers. Hence, 49 RW generators will provide totally 49 × 4 = 196 5 − bit random weights; each is sent to its corresponding multiplexer. All these LFSRs will reload their own initial seed on the arrival of a new input digit. After that, it keeps generating random numbers until the next input digit arrives. In this way, we can guarantee that the encoder will generate the exact same set of random weights (for all incoming digits) using a given seed. This "on the fly" generation scheme reduces the usage of the memory significantly, as there is no requirement for storing the random weights anymore -only the seeds need to be stored.
The accumulator module sums the 784 weighted pixels (in four clock cycles) for generating the input to that TM neuron. A naive implementation would need a 196-input 5-bit parallel adder and create a large delay (∼20 ns). To mitigate this critical timing requirement, we use a 2-stage pipeline, which consists of fourteen 14-input 5-bit parallel adders and one 14-input 9-bit parallel adder. Since this is a pipelined design, the input for each TM neuron is still being generated every time slot, but with a latency of two clock cycles.
3) Physical Neuron: The rate neuron computes its output (F_rate) from its input (Stim) and its index in the core (N_index), as shown in Fig. 6 . None of these need memory access and memory access is only needed to read the decoding weights. The neuron then multiplies F_rate with ten decoding weights (for the ten output neurons). A naïve implementation would instantiate ten identical neurons, each with one decoding weight (for each output neuron), and would cost 10 multipliers. The whole operation would require 11 multiplications. Since the time slot consists of four clock cycles, we can distribute these 11 multiplications to these four clock cycles so that only 11/4 = 3 multipliers will be needed. Based on this strategy, the neuron has been efficiently implemented with three identical 9-bit multipliers as shown in Fig. 6 . The number of the implementable multipliers is usually one of the bottlenecks of largescale FPGA/ASIC design. The multiplier's inputs A and B are 9 bits wide and the output result is 18 bits wide. All of the three multipliers will need four clock cycles to process the algorithm. For multiplier [0], the first cycle computes F_rate, which is a 7-bit number, by multiplying N_index and T, which is a Boolean function of Stim and N_index [23] ; the second cycle latches F_rate at input A of the multiplier; the third and fourth cycle multiplies F_rate with the decoding weight (0) and (1), respectively. For multiplier (1), the first, second, third and fourth cycle multiplies F_rate with the decoding weight (2), (3), (4) and (5), respectively. For multiplier (2), the first, second, third and fourth cycle multiplies F_rate with the decoding weight (6), (7), (8) and (9), respectively. Again, since it is a pipelined design, the output of each TM neuron is updated only once in its time slot (with a latency of four clock cycles).
4) Output Layer:
The output layer consists of ten neurons (see Fig. 4 ) that linearly sum the results of all the 8 k TM (hidden) neurons. In a time-multiplexed system, this sum is just an accumulation of the outputs of the TM neurons of each time slot. Hence, the implementation of each output neuron will only need a register and an adder. When all the 8 k neurons have been processed, the index of the output neuron with the maximum value will be sent out as the result, which will indicate the most likely input digit. After that, the values of the ten output neurons will be cleared.
5) Utilisation:
The system was developed using the standard ASIC design flow, and can thus be easily implemented with state-of-the-art manufacturing technologies, should an integrated circuit implementation be desired. We have successfully implemented 128 proposed neural cores, yielding 8 k neurons, on an Altera Cyclone V FPGA (on a Terasic Cyclone GX starter kit). The design uses less than 6% of the hardware resources (with the exception of the RAMs, Table I ). Note that this utilisation table includes the circuits that carry out other tasks such as the JTAG interface.
III. RESULTS
The results presented here will focus on how different design choices affect the performance of the proposed system, keeping in mind our goal is to develop a hardware system running in real time, rather than exploiting an algorithm that is as accurate as possible. The performance results were obtained using the full test set of 10 000 handwritten digits after training on the full 60 000 digit training set, unless otherwise specified. The results presented in Section III-A and Section III-B were obtained using the software (Python) models, while results presented in Section III-C were obtained from the hardware implementation.
A. Comparison Across Different Configurations
We investigated the effects of the three modifications that we have made using four configurations:
Configuration 1: is the configuration used in our previous work [26] with grey-scale images, tanh hidden neurons and floating-point output weights.
Configuration 2: uses black and white images. Configuration 3: uses black and white images and rate neurons.
Configuration 4: uses black and white images, rate neurons, and fixed-point output weights. The hidden layer consisted of 8 k neurons in all four configurations. For each configuration, 100 test runs were conducted, each with a different random seed. The same set of 100 seeds was used for all four configurations, so that the encoder would generate the same 100 sets of random weights. Since the goal of this exercise was simply to investigate the impact of the three modifications on performance, rather than to find the best possible performance, we only used OPIUM lite to calculate the decoding weights and the test error. It significantly reduces the simulation time needed for these tests and still provides a fair comparison between the four configurations. However, OPIUM lite finds an approximate solution to the pseudo-inverse, which reduces the performance of the classifier by about 1% compared to the pseudo-inverse solution to the output weights.
We first investigated the effect of using binary values in the input layer. We compared the performance result of the same classifiers using grey-scale images and binary images (see Fig. 7 ).
The top two panels show a histogram of the number of errors (misclassifications) out of 10 000 test patterns obtained for the 100 test runs. Given the skewed nature of the two error distributions, rather than simply reporting p-values to indicate the statistical significance of this difference, we have chosen to display the full distribution here and follow the method by Kruschke [37] to analyse it. The same set of 100 random weight vectors was used for each configuration, so that we can determine the paired difference between the number of errors made in the two configurations using the same weight vectors, shown as a histogram in Fig. 7(c) . We then modelled the distribution of the difference in errors using a non-central T-distribution, which is optimal for modelling distributions that are approximately Gaussian but contain outliers. We followed the Bayesian estimation method using Markov Chain Monte Carlo (MCMC) simulation [37] . Our MCMC generates a statistical distribution of 100 000 fits of the non-central T-distribution to the data, in our case to the paired differences in error. Fig. 7(d) shows the histogram of the mean values of these T-distributions, and the red curves in Fig. 7(c) show 50 examples of the T-distribution with parameters taken at random from the Markov Chain as in [37] .
From the distribution of the mean value for the difference data [see Fig. 7(d) ], we can see that configuration 2 results on average in 59.5 more errors on the 10 000 test digits. If we define a difference of 10 or fewer errors as a region of practical equivalence (ROPE), or, in other words, we consider as insignificant a change of 10 or fewer errors out of 10 000 tests, i.e., a change of less than 0.1%, we note that the 95% highest density interval (HDI) of the distribution of the mean of the difference of errors is outside the ROPE, and therefore we conclude that changing the input images from grey-scale to binary values results in a small but significant increase in error of around 0.6% for the MNIST database using this classifier.
Next, we investigated the effect of using the rate neurons in the hidden layer. The distribution of errors for this configuration (configuration 3) is shown in Fig. 8(a) . This should be compared with configuration 2 [see Fig. 8(b) ] and their paired difference is shown in Fig. 8(b) . Fig. 8(c) shows the distribution of the mean of the difference in errors between configuration 3 and configuration 2. It shows that changing from tanh neurons to rate neurons increases the number of errors by approximately 0.19%. However, this difference is not strongly significant, as the 95% HDI is not entirely outside the ROPE, indicating that a difference within the region of practical equivalence is amongst the possible mean values. Finally, we investigated the effect of using limited-resolution decoding weights. Fig. 9(a) shows the distribution of errors for this configuration and the difference between configuration 3 and configuration 4 is close to zero [see Fig. 9(b) ]. In fact the distribution of the mean of the error difference is entirely within the ROPE, indicating that somewhat surprisingly there is no significant loss in performance when using 6-bit fixed-point output weights instead of floating point weights.
The performance difference between configuration 1 and configuration 4 was merely 0.8%. We can therefore conclude that, in this digit recognition system, the modifications that we made achieved significant reductions in terms of hardware cost with a minimal drop in performance. 
B. Size of the Hidden Layer
Next, we used configuration 4 from the previous section and changed the hidden layer size in the range from 1 k to 16 k neurons. For each size, ten test runs (each with a different random seed) were conducted. Again, to reduce the testing time, we used OPIUM lite to calculate the decoding weights and then calculate the test error.
The median error over 10 runs (see Fig. 10 ) for the hidden layer with 1 k, 2 k, 4 k, 8 k, 12 k and 16 k neurons was 14.5%, 10.4%, 6.96%, 5.01%, 4.47% and 4.33% respectively. It is clear that the error decreases with the number of hidden layer neurons, although with a diminishing return. This is a common observation in single hidden layer neural networks, where performance saturates when the hidden layer is larger than a certain size [38] . The hardware cost of a hidden neuron is almost negligible in our system due to the use of time-multiplexing. However, the memory required by the decoding weights is linearly proportional to the size of the hidden layer and this becomes a bottleneck in the system. To achieve a good balance between the desired accuracy and memory, we chose to implement the hidden layer with 8 k rather than 16 k neurons.
C. System Performance
To explore the best performance that the proposed system can achieve, 1000 runs were carried out for configuration 4 using different random seeds. The lowest error rate achieved with the lite and full version of OPIUM is 4.52% and 3.45%, respectively. The decoding weights, obtained by calculating the pseudo-inverse (using OPIUM) and converting them to 6 bit numbers, were loaded into the FPGA board for real time digit recognition. The pixels of input digits were converted to binary values in software and a Python-based front-end client software sent the selected test digit to the FPGA via the JTAG interface. We have successfully performed a live demo of this system at the Telluride Cognitive Neuromorphic Engineering workshop 2014.
Since the system runs at 266 MHz and the hidden layer contains 8 k neurons, each of which has a time slot of four clock cycles, the processing time for one input digit is 8 k × 4/266 MHz ≈ 120 µs, yielding 1 s/120 µs ≈ 8 k digit recognitions per second. Due to the fact that our system only used 8 k out of 64 k neurons in one single TM neuron layer, the maximum number of the digit recognitions that can be processed by one TM neuron layer is ∼64 k per second. The system used less than 6% of the hardware resources (with the exception of the RAMs), thus multiple TM neuron layers can be instantiated to run in parallel. It is practical to scale this system to process millions of digit recognitions in one second.
IV. CONCLUSION
Our system embodies the biologically-inspired principles of neuromorphic computing, but does not make use of the conventional asynchronous protocols such as the work done in [39] - [42] . The focus of these systems is on scalable, parallelised and power efficient systems, rather than achieving cutting-edge accuracy on existing datasets. Similarly, we are primarily interested in the trade-off between the scale, performance, and hardware costs, and we present below a comparison of our work with some of the existing hardware solutions.
Although the classification system presented in this work makes use of binary input images, these images are generated from the original MNIST dataset. As the conversion discards the grey-scale values, the results achieved using the full MNIST dataset form the upper bounds to expected accuracy for the binary system. As a result, it is possible to compare the accuracy of this work to existing hardware implementations making use of the MNIST dataset.
The state-of-the-art in artificial neural networks for pattern recognition are deep neural networks with multiple hidden layers, each of which is trained by using a back-propagation approach [25] . This approach is heavily optimised for pursuing the single goal of achieving the lowest error rate such that "near-human performance" on the MNIST database has been achieved in [43] . The work reported in [44] even outperformed human accuracy. Deep neural networks generally outperform our approach in terms of recognition accuracy. The reason for this is that our approach actually has far fewer variables even with a large hidden layer. Our approach uses random weights, rather than learned weights in the input layer. These are generated on the fly and thus require very little memory for storage, whereas deep neural networks store all the weights (of each layer) as they are trained by the back-propagation approach. Moreover, deep neural networks use feature extraction layers that cluster and extract features from data, whereas our system is feature-less and can thus be easily configured for different input data without needing to adapt feature extraction layers. For real world problems, the pre-processing of the raw data is vital. With proper pre-processing methods, this type of featureless network has been successfully used for performing complicated tasks [44] - [48] including biomedical tasks [47] - [50] .
Hardware implementations of conventional artificial neural networks are generally DSP-driven due to their high computational demands. A recent work [6] has achieved an error rate of 0.54% with a processing speed about 8 µs, which is 15 times faster than our system. However, its hardware cost is also very high: 3599 DSPs, 2.4 M bits SRAM and more than half million logic gates (see Table II ). These hardware resources, which are only available on high-end FPGAs, are hundreds/thousands times more than the ones (logic gates /DSPs) used by our approach. Another state-of-the-art system was presented in [2] , and [3] . The authors implemented a deep neural network on a Xilinx Virtex-4 SX35 FPGA and achieved an error rate of 5% for the MNIST dataset. Regarding its hardware costs, it used up all the 192 DSPs on that FPGA. Neither the usage of the logic gates and the memory nor the precise processing speed (for the MNIST dataset) has been given. As its throughput is less than 30 fps, we can assume its processing delay is in the millisecond range, whereas our system needs only 120 µs. The IBM TrueNorth system is a general-purpose system for building large-scale neural networks running in real time [10] . When it was programmed for digit recognition, it achieved a result of 8.06% error rate in the 10 000 test set of the MNIST with 13 cores, each of which consisted of TM 256 spiking neurons and needs ∼96 k bits memories per core [51] . Hence, our system achieved a much lower error rate using significantly fewer hardware resources, especially the memories (see Table  II ). Regarding the processing speed, their system needs 20 time steps (each one is 1 ms) to process one digit, whereas our system needs only 120 µs (approximately 167 times speedup). The TrueNorth system however has many more applications besides pattern recognition tasks, such as the simulation of large-scale spiking neural networks.
The Minitaur, which is an event-based neural network accelerator, achieved an error rate of 8% on a deep spiking network with only 1785 neurons [28] . Since the scheme it used is a variant of the time-multiplexing approach, which only needs very few neurons to be physically implemented, the cost of a single neuron is also negligible but the bottleneck again is the memory. Each of the neurons used by the Minitaur needs 73-bit memories and the connection weights need 16-bit memories. In contrast, our neuron needs only 60-bit memories for the decoding weights. The processing time of the Minitaur for one digit is 0.152 s (see Table II ), which is approximately 1300 times slower than our system. Most of the speed up in our system comes from reducing the number of fetch operations by using the random weights. Also the MNIST digit recognition is not a natural task for the Minitaur as it is aimed at implementing different spiking neural networks using event-based sensors as input. Therefore, it is not optimised for this dataset, which results in its slower performance.
Our future work will focus on scaling up the network such that it will be able to process more difficult input patterns. Our design is scalable due to its fully digital implementation. The number of TM hidden neurons implemented by a single physical neuron will increase linearly with the amount of available memory, as long as the multiplexing scale keeps the time resolution acceptable for real-time control. The number of physical neurons will increase linearly with the number of available logic gates. As the bottleneck of the TM approach is the memory bandwidth, the strategy to achieve a large system is to reduce the memory usage such as by using low-precision weights or even binary weights.
The programmability of the FPGA, especially the decoding weights, makes the integration of the system with the desired pattern recognition applications seamless. However, the advantages of running large-scale networks in real-time are strongly reduced if such neural networks take a long time to compute the decoding weights. Hence, another major improvement is to speed up this computationally extensive task. One promising solution is to implement the OPIUM lite learning algorithm on FPGA, since this algorithm is an adaption procedure without the requirement of hundreds of Gigabyte RAMs and is quite friendly for hardware implementation. In other words, there is no bulk memory requirement, while implementing OPIUM will be a bit complex as it requires more memories and computations. Considering the tiny performance difference, we will implement OPIUM lite in our follow-up work. Running OPIUM lite in real time makes it possible to upgrade the system to be a true turnkey solution for pattern recognition in real world. As our system makes use of binary input images, another important part of our future work is to investigate the effect of different binarisation methods. 
