Abstract-This paper introduces a spiking hierarchical model for object recognition which utilizes the precise timing information inherently present in the output of biologically inspired asynchronous Address Event Representation (AER) vision sensors. The asynchronous nature of these systems frees computation and communication from the rigid predetermined timing enforced by system clocks in conventional systems. Freedom from rigid timing constraints opens the possibility of using true timing to our advantage in computation. We show not only how timing can be used in object recognition, but also how it can in fact simplify computation. Specifically, we rely on a simple temporal-winnertake-all rather than more computationally intensive synchronous operations typically used in biologically inspired neural networks for object recognition. This approach to visual computation represents a major paradigm shift from conventional clocked systems and can find application in other sensory modalities and computational tasks. We showcase effectiveness of the approach by achieving the highest reported accuracy to date (97.5%±3.5%) for a previously published four class card pip recognition task and an accuracy of 84.9%±1.9% for a new more difficult 36 class character recognition task.
I. INTRODUCTION
This paper tackles the problem of object recognition using a hierarchical Spiking Neural Network (SNN) structure. We present a model developed for object recognition, which we have called HFirst. The name arises because the approach extensively relies on the first spike received during computation to implement a non-linear pooling operation, which is typically required by frame-based Convolutional Neural Networks (CNNs).
We rely on the biological observation that strongly activated neurons tend to fire first [1] , [2] . In particular, we focus on the relative timing of spikes across neurons, namely the order in which neurons fire. We will argue that such a scheme allows us to derive temporal features that are particulary suited for robust and rapid object recognition at a very low computational cost. Existing work on artificial neural networks tend to assume a predetermined timing which is completely independent of the processing taking place. This prohibits these artificial NNs from using time in their computation. However, the timing of communication (spikes) in biological networks is known to be very important. Much like biological networks, in this paper we exploit spike timing to our advantage in computation. More specifically we rely on the time at which a spike is received to implement a simple non-linear operation which replaces the more computationally intensive maximum operation typically used in non-spiking neural networks for visual processing.
Artificial Neural Networks (NNs), of which CNNs are a subset, have successfully been used in many applications, including signal and image processing [3] , [4] , and pattern recognition [5] , while hardware acceleration of such models allows real-time operation on megapixel resolution video [6] . Although CNN models are argued to be biologically inspired, their artificial implementations are typically far removed from biological neural networks, most of which consist of spiking neurons.
Spiking Neural Networks (SNNs) have received a lot of attention recently as new, more efficient computing technologies are sought as conventional CMOS technology approaches its fundamental limits. SNNs have the potential to achieve incredibly high power efficiency. This is not a claim that we provide our own evidence for, but is rather based on observations of power consumption in biology (the human brain consumes only 20W) and recent works which present SNNs on chip with impressive power efficiency. Examples include Neurogrid [7] and IBMs TrueNorth [8] which can simulate 1 million spiking neurons while consuming under 100mW. In this paper we address the question of how SNNs can be used for visual object recognition.
Modern reconfigurable custom SNN hardware platforms can implement hundreds of thousands to millions of spiking neurons in parallel. Examples of these hardware implementation projects include the Integrate and Fire Array Transceiver (IFAT) [9] , Hierarchical AER-IFAT [10] , Brain Scales [11] , Spiking Neural Network Architecture (SpiNNaker) [12] , Neurogrid [7] , Qualcomm's Zeroth Processor, and IBM's TrueNorth [8] (fabricated with Samsung).
In parallel with these hardware platforms, software platforms for neural computation have emerged, including the Neural Engineering Framework (NEF) [13] , Brian [14] , and PyNN [15] , many of which can be used to configure the hardware platforms previously mentioned. Continued interest and funding from the European Union's Human Brain Project [16] and the USA's Brain Research through Advancing Innovative Neurotechnologies (BRAIN) project [17] will drive development of such systems for years to come.
As neural simulation hardware matures, so must the algorithms and architectures which can take advantage of this hardware. However, it does not necessarily make sense to directly convert existing computer vision models and algorithms (which process traditional frame based data) to SNN implementation. A central concept within SNNs is that spike timing encodes information, but frames do not contain precise timing information. The timing of the arrival of frames is purely a function of the front end sensor and is completely independent of the scene or stimuli present. In order for a SNN to exploit precise timing, it must operate on data which contains precise timing information and not spike timings artificially generated from frame-based outputs. To obtain visual data with precise timing, we turn our attention to asynchronous AER vision sensors, sometimes referred to as "silicon retinae" [18] , [19] . These sensors more closely match the operation of biological retina and do not utilize frames.
Asynchronous AER vision sensors have seen much improvement since their introduction in the early 1990s by Mahowald [20] . Modern change detection AER sensors reliably provide information on changes of illumination at the focal plane over a wide dynamic range and under a variety of lighting conditions. The pixels within such sensors each contain a circuit which continuously performs local analog computation to detect the occurrence and time of changes in intensity for that particular pixel. This computation at the focal plane is a form of redundancy suppression, ensuring that pixels only output data when new information is present (barring some background noise). Furthermore, the time of arrival of data from the sensor accurately represents when the intensity change occurred. Under test conditions sub-microsecond accuracy is achieved, versus accuracy on the order of milliseconds for fast frame-based cameras. This temporal accuracy provides precise spike timing information which can be exploited by a SNN.
Much like SNNs are a more accurate approximation of biological processing hardware, AER vision sensors are a more accurate approximation of the biological retina. The single bit of data provided by a pixel can be likened to a neural spike, and much like a biological retina, the AER sensor performs computation at the focal plane. Notable examples of spiking AER vision sensors include the earliest examples of spiking silicon retinae by Culurciello et al. [18] and Zaghloul et al. [19] , as well as the more recent Dynamic Vision Sensor (DVS) from Delbruck [21] , the sensitive DVS from Linares-Berranco [22] , and the Asynchronous Time-based Image Sensor (ATIS) from Posch [23] . Operation of these sensors will be discussed in Section II. For a review of asynchronous event-based vision sensors see Delbruck et al. [24] .
With the emergence of these asynchronous vision sensors, many researchers have taken an interest in processing their data in a manner which takes advantage of the asynchronous, high temporal resolution, and sparse representation of the scene they provide. Models of early visual area V1, including saliency, attention, foveation, and recognition [25] - [27] have been implemented by combining the reconfigurable IFAT system [28] with the Octopus silicon retina [18] . More recent focuses in the field include stereo vision [29] - [31] , motion estimation [32] , [33] , tracking [34] , and more object recognition works [35] - [37] . Further information on neuromorphic sensory systems can be found in Liu and Delbruck [38] .
In this paper we focus on the task of object recognition. The most similar recent works include a VLSI implementation of the HMAX model [39] , [40] for recognition which uses spiking neurons throughout [41] . The VLSI spiking HMAX implementation computes all the functions required by HMAX, but operates on 24×24 pixel images, limited by the number of available neurons, and does not run real-time. Adaptations of frame-based CNN techniques for training SNNs and implementing them in FPGA have also been recently presented [42] , including a recent PAMI paper [35] which presented a high speed card pip recognition task which we also tackle in this paper as a comparison to existing works.
In this paper we present our SNN architecture dubbed "HFirst", which takes advantage of timing information provided by AER sensors. A key aspect is that our architecture uses spike timing to encode the strength of neuron activation, with stronger activated neurons spiking earlier. This enables us to implement a MAX operation using a simple temporal Winner-Take-All (WTA) rather than performing a synchronous MAX operation as is typically done in frame-based algorithms [39] . Unlike the frame-based MAX operation, which outputs a number representing the strength of the strongest input, the temporal WTA can only output a spike, but by responding with low latency to its inputs, the temporal WTA preserves the time encoding of signal strength. It should be noted that other methods of implementing a MAX operation in spikes have been presented previously [27] .
Masquelier et al. [43] also use a temporal WTA, but their approach focuses on static images and spike generation from these images is artificially simulated, whereas we use AER vision sensors [21] - [23] to directly capture data from dynamic scenes for recognition. Additionally, Masquelier et al. require their network to be reset before a second object can be recognised, whereas HFirst operates on streaming "video" and can recognise multiple objects in sequence, or even simultaneously.
The HFirst model described here can be used with many of the available AER change detection sensors, and could be implemented on one of many neural processing platforms. For this particular work we analysed HFirst in simulation using a combination of C and Matlab on a desktop PC. Once simulated, the SNN was implemented in real-time on a Xilinx Spartan 6 XC6SLX150-2 FPGA.
The rest of this paper is organized as follows. In the next section we briefly describe the event-based vision sensors, then we describe the neuron model using spike timing for computation in Section III. The HFirst architecture is described in Section IV, followed by brief analysis of the required computation and real-time implementation. Testing and results are then presented to showcase the model accuracy before wrapping up with discussions and conclusions. 
II. ASYNCHRONOUS CHANGE DETECTION VISION SENSORS
Neuromorphic, event-based vision sensors are a novel type of vision sensor driven by changes within the visual scene, much like the human retina, and differs from conventional image sensors which use artificial timing to control information acquisition. The sensors used in this paper [21] - [23] consist of autonomous pixels, each asynchronously generating spike events that encode relative changes in illumination. These sensors capture visual information at a much higher temporal resolution than conventional vision sensors, achieving accuracy down to sub-microsecond levels under optimal conditions. Moreover, since the pixels only detect temporal changes, temporally redundant information is not captured or communicated, resulting in a sparse representation of the scene. Captured events are transmitted asynchronously by the sensor in the form of continuous-time digital words containing the address of the activated pixel using the AER protocol [20] .
To better understand the operation of these sensors we will briefly provide a formulation to approximate the sensor response to visual stimuli. Let us define I (u, v,t) as the intensity of a pixel located at [u, v] T , where u and v are spatial co-ordinates in units of pixels. Each pixel of the sensor asynchronously generates events at the precise time when change in the log of the pixel illumination ∆log(I(u, v,t)) is larger than a certain threshold ∆I since the last event, as shown Fig. 1(a) and (b) . The logarithmic relation means the pixels respond to percentage changes in illumination rather than the absolute magnitude of the change. This allows pixels to operate over a very wide dynamic range (≥120dB).
Under constant scene illumination the intensity changes seen by the sensor are due to the combination of a spatial image gradient and a component of image motion along that gradient. As described by the equation below which is a first order approximation of the image constancy constraint.
where I(u, v,t) is intensity on the image plane, and u and v are horizontal and vertical coordinates measured in units of pixels. The sensor will therefore generate the most events at locations where a large image gradient is present, as will be discussed further in Section III-B.
III. COMPUTING WITH NEURONS

A. Neuron model
The neuron model we use is a simple Integrate-and-Fire neuron (IF neuron) [44] with linear decay and a refractory period, as shown in Fig. 2 . We foresee that the model would translate to hardware implementations which model many neurons in parallel, but the neurons in such hardware implementations may have very limited precision. To account for the possible limited precision in implementation, in software we simulate subthreshold membrane potential decay with 1ms time precision and restrict all neuron parameters (V thresh , I l C m , and t re f r in Table I ) to be unsigned 8 bit integers with 1 Least Significant Bit (LSB) corresponding to 1 unit shown in Table I . During simulation, membrane potential is stored as an integer value in units of millivolts.
The simple behaviour of IF neurons ensures that an output spike can only be elicited by an excitatory input spike, and not by subthreshold membrane potential dynamics in the absence of excitatory input. When an input to a neuron arrives, the neuron's new state (membrane potential) can be entirely determined by the time since it was last updated, and its state after the previous update. We therefore need only update a neuron when it receives an input spike (rather than at a constant time interval). Neurons are organized into a hierarchical structure consisting of layers. When an input spike arrives from a lower layer, the update procedure for the neuron is:
where t i is the time at which the i th input spike arrives, t lastspike is the time at which the current neuron last generated an output spike, t re f r is the refractory period of the neuron, V m i is the membrane voltage after the i th input spike, I l is the leakage current, C m is the membrane capacitance, ω i is the input weight of the i th input spike, and V thresh is the threshold voltage for the current neuron.
Output spikes from a neuron feed similarly into the layer above, but can also affect neurons within the same layer through lateral connections. When an input is received from a lateral connection, it forces the receiving neuron to reset and enter a refractory period. In practice we implement this by treating the reset neuron into thinking it has recently spiked by using the update:
t lastspike ← t where t is the current time.
B. Using Spike Timing to Find the Max
Jarrett et al. [45] showed in a comparison of object recognition architectures that the top performing algorithms are those with a hierarchical structure incorporating a non-linearity, although some more recent works show similar performance with a single layer of neurons, but at the expense of increased computational complexity and training difficulty [46] . In the case of the popular HMAX [39] model, this non-linearity is a maximum operation in the pooling stages (C1 and C2). Finding this maximum requires comparing the responses of all units within the region to be pooled. This maximum value is then passed through to the next layer, irrespective of how large or small the value is. In other words, the maximum value is passed to the next layer, regardless of its value (so long as it is the maximum).
In the HFirst architecture we observe which neuron responds first, and judge that neuron to have the maximal response to the stimulus. This is based on two main observations. Firstly, that sharper edges (larger spatial gradients) result in larger temporal contrast (1), therefore generating events sooner than less sharp edges. Secondly, the higher the spatial correlation between a neuron's input weights and the spatial pattern of incoming spikes, the stronger it will be activated (see Fig. 3 ). The strongest activated neuron will cross its spiking threshold before other neurons, thereby providing an indication that its response is strongest. Using this mechanism there is no need to compare neuron responses to each other, rather we simply observe which neuron generated an output first. The first spike from a pooling region can then be used to reset other orientations through lateral reset connections, thereby ensuring that non-maximal responses are not propagated through to subsequent layers. Fig. 3 shows how neurons tuned to different orientations will respond when an edge is presented. The neuron tuned The "time to first spike" approach simplifies computation of the max. It indicates which neuron has the strongest response, and through the time at which the spike is elicited it conveys how strong the response is. However, if no neuron was activated strongly enough to generate an output spike, no firstspike is detected and no output spikes are generated. This is an important property ensuring that no computation is performed when there is insufficient activity in the scene. Much like the front end sensor, which represents lack of stimulus (temporal contrast) through a lack of data, HFirst represents the lack of a strong enough neuron activation through a lack of output spikes.
IV. ASYNCHRONOUS HFIRST ARCHITECTURE
HFirst is structured in a similar manner to hierarchical neural models [39] , [43] , which consist of four layers, named Simple 1 (S1), Complex 1 (C1), Simple 2 (S2), and Complex 2 (C2). In these frame base architectures, cells in simple layers densely cover the scene and respond linearly to their inputs, while cells in complex layers have a non-linear response and only sparsely cover the scene. The layers and manner in which computation is performed in HFirst differs considerably from previous implementation of similar computational models of object recognition in cortex [39] , [47] , [48] . The Simple layers in HFirst are in fact non-linear due to the use of a spike threshold and binary spike output. In the remainder of this section the form and function of each HFirst layer is described. The same neuron model is used for all layers, but with different parameters and connectivity. The network (Table I shows the sizes for the full model). The S1 layer performs orientation extraction at a fine scale, followed by a pooling operation in C1. Note that due to lateral reset in C1, some S1 responses are blocked (for example, the last three orientations on the bottom row). The S2 layer combines responses from different orientations, but maintains spatial information. The C2 layer pools across all S2 spatial locations, providing only a single output neuron for each character.
architecture is shown in Fig. 4 , and the parameters for each stage are shown in Table I .
A. Layer 1: Gabor Filters
The S1 layer densely covers the scene with even Gabor filters at 12 orientations. All filters are 7x7 pixels, resulting in 12 filters at each pixel. These filters are designed to pick up sharp edges. Filter kernels are generated with the same equation as in Serre et al. [39] , repeated below for convenience.
where u and v are horizontal and vertical location in pixels. u 0 and v 0 are used to effect a rotation which orients the filter. Parameters of λ = 5 and σ = 2.8 were used to generate the synaptic weights. θ varies from 0 to 165 degrees in increments of 15 degrees. S1 neurons are divided into adjacent non-overlapping 4×4 pixel regions, referred to as S1 units. Each S1 unit feeds into 12 C1 neurons, one for each orientation. C1 neurons have lateral reset connections between orientations to perform the max operation discussed in Section III-B. C1 neurons use a very low threshold voltage to ensure that a single input spike is sufficient to generate an output spike (provided the neuron is not under refraction).
The refractory period in C1 saves computation by reducing the number of spikes which need to be routed within the architecture. Limiting the firing rate is also important to ensure that no single C1 neuron can fire rapidly enough to single handedly elicit a spike from an S2 neuron.
B. Layer 2: Template Matching
S2 neurons densely cover C1 neurons, with each receiving inputs from 8×8 C1 neurons of all orientations. S2 receptive fields are created during a training phase as described below.
A simple activity tracker [34] is used to track training objects and compensate for their motion to generate a static 32×32 pixel view of the object. This stabilised view is processed by S1 and C1, and the number of spikes of each orientation originating from each C1 neuron is counted. Note that due to the non-overlapping S1 units, the 32×32 pixel input region feeds into 8×8 C1 neurons, which is the size of an S2 receptive field in HFirst (see Table I ).
The counts generated in this manner constitute the synaptic weights (or input kernel) for the S2 neuron sensitive to this object. A separate neuron is required for each object to be recognized. For each neuron, synaptic weights are normalised to have an l 2 norm of 100. Finally, since negative spike counts are not possible, all zero valued weights are replaced with inhibitory values (-1) to reduce noise sensitivity. A copy of each trained neuron is then implemented at every location, allowing detection of all trained objects at all locations. Fig. 5 shows an example of a learnt S2 receptive field for recognizing the character 'G'. The figure shows how the highest input synapse weights are assigned to locations where the orientation of character's edges match the orientation to which the underlying C1 neurons are tuned.
S2 neuron spikes reset all other S2 neurons within an 8x8 region sensitive to other classes of objects, thus implementing the max operation discussed in Section III-B. Furthermore, by only resetting neurons sensitive to other object classes, the detected object class is given a "head start" in the race to first spike in the nearby region. This can be seen as using the detection to create a prior expectation of detecting that object again nearby.
An optional C2 layer can be used to pool all responses from all S2 locations for classification. The C2 layer is not always used because it discards information regarding the location of the object, which can be particularly useful when multiple objects of interest are simultaneously present in the scene.
C. Classifier
A basic classifier outputs the soft probabilities for the object belonging to each class. The probability P(i) of an object belonging to class i is calculated as
where n i is the number of spikes elicited by S2 neurons sensitive to the i th class. When ∑ i n i = 0 we assign P(i) = 0 for all classes. If we wish to force the classifier to choose only a single class, we can assign the output class y as
We have no neuron to respond to lack of an object in a scene. Lack of an object results in lack of positive detections. This is a fundamental concept of the computing and sensing paradigm we use. Lack of information is not communicated, but is rather represented by a lack of communicated data.
V. IMPLEMENTATION
In this section we briefly analyse computational requirements. The number of input spikes generated by the front end sensor varies with scene activity and dictates the required computation since neuron updates are only performed when spikes are received. We analyse computation as a function of the number of input and output spikes for each layer. A worst case scenario is used which assumes that a neuron is updated every time it receives a spike (ignoring the refractory period).
A parallelised and pipelined FPGA implementation was programmed to run in real-time on the Opal Kelly XEM6010-LX150 board, which includes a Xilinx Spartan 6 XC6SLX150-2 FPGA. The model operates on a 128×128 pixel input. The implementation runs at a clock frequency of 100MHz and uses internal block RAM without relying on external RAM. The final output of the system consists of S2 output spikes, although access is also provided to spikes from intermediate layers for characterization.
A. S1 and C1: Gabor Filters
Each input spike in S1 routes to all S1 neurons within a 7×7 pixel region. There are 12 S1 neurons at each pixel location (one per orientation), resulting in 12×7×7 = 588 synapse activations per input spike.
For FPGA implementation, 84 synapses update in parallel, requiring 7 clock cycles to update all 588 synapses, allowing the S1 stage to sustain throughput of 14M events per second.
Each S1 output spike excites a single C1 neuron, and resets the 11 C1 neurons sensitive to other orientations, resulting in 12 C1 synapse activations per S1 output spike. C1 updates all 12 synapses in parallel and can process 25M input events per second.
B. S2 and C2: Template Matching
Each input spike to S2 routes to all S2 neurons within an 8×8 region. If N y denotes the number of classes to be classified, then there will be N y neurons at each location, and each input spike will activate N y ×8×8 = 64N y input synapses. Each S2 output spike resets all S2 neurons lying within an 8×8 region around where the spike originated. So, for every S2 output spike N y ×8×8 = 64N y S2 lateral reset synapses are activated.
The FPGA implementation of S2 can update N y neurons in parallel, requiring 64 clock cycles to process each input or output spike. The number of C2 input synapses activated is equal to the number of S2 neuron output spikes. The C2 stage is optional and not implemented in FPGA.
In HFirst there are no zero valued synapses in S2. Synapses which are not activated during training are assigned an inhibitory synaptic weight of -1 (see Section IV-B). The number of synapses in S2 could be significantly reduced by instead assigning a weight of 0 to these synapses and optimizing them out of the model. However, such an optimization would introduce significant additional complexity in pipelining for the FPGA implementation. The FPGA implementation benefits far more from the simplified pipelining which results from having a dense regular connection structure where all synapses are implemented. This connection structure is also more general, allowing the synaptic weights to be easily reprogrammed.
The regular connection structure also saves memory by ensuring that when a neuron is updated, all co-located neurons will also be updated. Updating all co-located neurons simultaneously allows us to store only a single time value to indicate when all neurons at that location were updated, rather than storing a separate time value to indicate when each individual neuron was last updated (Section III-A shows how the time value is used in the neuron update). This memory saving is important because memory availability is the limiting factor in scaling the model to higher resolution, as shown in the next section.
C. Scaling to higher resolution
When scaling to higher resolutions two main factors need to be considered: memory requirements, and computational requirements. Required memory scales linearly with the number of neurons in the model, which in turn scales linearly with the number of input pixels. Computational requirements scale linearly with the input event rate.
With 36 classes (N y = 36), 167 Block RAMs are used for HFirst (see Table II ), plus an additional 10 for pipeline FIFOs and USB IO, resulting in a total of 177 of the available 268 Block RAMs being used for 128×128 pixel input resolution.
Digital Signal Processing (DSP) blocks are blocks within the FPGA containing dedicated hardware for performing multiplication and addition. The number of multiplications which can be performed per second is a limiting factor in many algorithms, particularly for visual processing algorithms which compute kernel responses using convolution. Optimization of these algorithms typically involves optimizing memory access and pipelining to maximise utilization of hardware multipliers (see [49] for an example). High end GPUs and FPGAs contain thousands of hardware multipliers.
In HFirst only 17 of our FPGA's 180 DSP blocks are used and these 17 DSP blocks are only utilized a small percentage of the time due to the temporal sparsity of the AER data. For HFirst, internal FPGA memory is the limiting resource when increasing resolution. Internal memory requirements scale with the input sensor resolution, while the number of DSP blocks required will scale with maximum sustained input event rate the model is required to handle.
The current FPGA implementation can handle a sustained 14Meps (events per second) input event rate, while bursts of up to 100Meps (limited by FPGA clock speed of 100MHz) can be handled for durations up to 5µs (limited by FIFO buffer depth). Larger FIFO buffers can be used, but are unnecessary. At 128×128 resolution, event rates for typical scenes are around 1Meps. The latest ATIS can generate events at a peak rate of 25Meps, and sustain a maximum rate of 15Meps at 304×240 pixel resolution. 14Meps is therefore a very high rate for 128×128 pixel resolution. Using additional DSP blocks, the maximum sustainable event rate can be increased by 1Meps per block used.
D. Power Consumption
The FPGA board on which HFirst was implemented also performs other tasks in parallel as part of normal operation of the ATIS sensor (powering and controlling the ATIS, as well as interfacing to a host PC). Implementing HFirst in addition to the other tasks on the FPGA increases power consumption by 150mW for static scenes (little to no processing happening), and by a further 100mW for the the highest activity scene we could generate. We therefore estimate HFirst power consumption to be between 150mW and 250mW depending on scene activity. These measurements are done at the board's power supply and include losses due to inefficiencies in the onboard switching regulators.
VI. TESTING HFirst was tested on two tasks. The first consists of recognizing pips on poker cards as they are shuffled in front of the sensor. The poker card task has been previously tackled [35] and was chosen to provide a direct comparison with previously published works. The second task is a simulated reading task in which characters are recognized as they move across the field of view using the test setup shown in Fig. 6 . Examples of recordings used for each task are shown in Fig. 7 . For both tasks, HFirst was implemented in Matlab simulation, coupled with a reconfigurable C++ function for increased speed.
A. Poker cards
For the poker card task data was provided by LinaresBarranco [35] who captured the data using the sensitive DVS sensor [22] . The dataset consists of 10 examples for each of the 4 card types (spades, hearts, diamonds, and clubs). For each of 10 different trials, non-overlapping test and training sets were chosen such that each contained 5 examples of each pip. For each pip in the training set, all 5 examples were concatenated into a single sequence from which the S2 layer kernel was generated. To provide a close comparison with the previously published task, we also tested on the stabilised and extracted pips.
Additional tests were performed in which lateral reset connections were removed from the model to investigate the value of the timing approach to computing the max. Finally, the advantage of having orientation extraction and pooling in S1 and C1 were investigated by bypassing these stages.
B. Character Recognition
36 characters (0-9 and A-Z) were printed on the surface of a barrel which was rotated at 40rpm while viewed by the DVS [21] as shown in Fig. 6 . Data was recorded over two full rotations of the barrel, thereby providing two recordings for each character. For each of 10 trials, non-overlapping test and training sets were randomly chosen such that every character appears once in each set. Training and testing was then performed using an automated script.
Training of the second layer of HFirst is performed on a stabilised view of a moving object, and therefore requires knowledge of the object location, which is acquired through Fig. 6 . The test setup used to acquire the character dataset, consisting of a motorised rotating barrel covered with printed letters viewed by a DVS [21] . tracking. However, for testing we use moving sequences instead of stabilised views, removing the need for tracking.
As with the card task, The character recognition task was also used to investigate the advantages of using reset connections for max computation, and of performing orientation extraction and pooling in S1 and C1 respectively.
Further testing was performed on the characters to show that HFirst can detect multiple objects simultaneously present in the scene, and to investigate the impact of timing jitter introduced during training and testing.
Finally the importance of precise timing was investigated by artificially altering spike times in the recordings and observing the effect on HFirst accuracy.
VII. RESULTS
Results from testing are summarised in Table III , and discussed in the sections below. The S1 and C1 columns show the total number of activated synapses in each of these layers. For S2, the S2 and S2 rst columns show the number of activated feedforward (from C1) and lateral reset synapses respectively.
A. Cards
HFirst classified the stabilised and extracted card pips with an accuracy of 97.5%±3.5% using an S2 threshold of 150mV. Chance for this task is 25%. The average duration of a test example was 23ms, and consisted of 4.3k input spikes, which elicited 73 C1, and 2.8 S2 spikes. The S1/C1 and S2/C2 layers took on average 102ms and 0.7ms respectively per example to simulate in Matlab using a single thread on an Intel Xeon X5675 processor running at 3.07GHz. The FPGA implementation simulates the network in real-time, with latency ≤ 2µs in response to incoming events. Removing lateral reset in the first layer decreases recognition accuracy to 51.6%±4.4%, while removing lateral reset connections in the second layer decreases recognition accuracy to 72.3%±3.8%, and removing lateral reset connections in both layers reduces recognition accuracy to chance levels, while increasing the average number of spikes elicited to 309 and 66 for C1 and S2 respectively. These results suggest that using the first spike mechanism improves performance, both in terms of computational efficiency, and in terms of recognition accuracy. For the card classification task which only has four output classes, bypassing the first layers reduces the required computation at the cost of recognition accuracy.
B. Characters
HFirst classified the moving letters with an accuracy of 84.9%±1.9% using an S2 threshold of 200mV. Chance for this task is 2.8%. The average duration of a test example was 112ms, and consisted of 14k input spikes, which elicited 313 C1, and 27 S2 spikes on average. The S1/C1 and S2/C2 layers took on average 365ms and 28ms respectively per example to simulate in Matlab using a single thread on an Intel Xeon X5675 processor running at 3.07GHz. As with the card pip task, the FPGA implementation easily runs in real-time with latency ≤ 2µs.
Next we investigated the effects of bypassing the first layers of HFirst and performing template matching directly on the input events. This modification resulted in an accuracy of 81.4%±3.8%, which is not too different from the performance of the full model. However, bypassing the S1 and C1 layers also increases the required computation significantly, suggesting that performing orientation extraction and pooling in S1 and C1 is actually more computationally efficient. The same is not true for the cards task where only 4 classes are present, but is true whenever 10 or more output classes are required. This increased computational requirement is also obvious when observing the time taken for simulation, which increased by 50 fold to an average of 19.7 seconds per example in Matlab.
C. Detecting Multiple Objects Simultaneously
After testing the model performance on individual characters, we verified that it can detect multiple characters simultaneously present in the scene. Fig. 8 shows 150ms worth of S2 outputs with multiple characters simultaneously visible in the scene. The S2 responses indicate both the object class and location. In this example the letters 'X', 'F', 'Y', and 'G' are all accurately detected as they pass across the scene. Later, the letters 'Z' and 'H' enter the scene. The 'Z' is accurately detected, but the 'H' is erroneously detected as an 'F' and 'I' at different points in time. The 1 in 6 error for these characters is in agreement with the 84.9%±1.9% accuracy reported overall. Fig. 9 shows output detections for a single full rotation of the barrel, comparing the times at which letters were detected (or missed) to the ground truth of when they were present in the scene.
D. Effect of Timing Jitter
In the front end AER sensor, the latency of pixel responses and of the AER readout can vary, resulting in timing jitter in the spikes feeding into S1. All of our tests are performed on real recordings and therefore include some jitter. In order to investigate the effect of increased timing jitter on the model, we artificially added additional jitter to the recordings used for training and testing. Jitter times for each spike were randomly chosen from a Gaussian distribution and the effect of varying the standard deviation of the distribution is shown in Fig. 10 . Changing the mean of the Gaussian distribution adds a constant time offset to all spikes and has no effect on accuracy. The accuracy for each standard deviation value is again obtained as the mean of 10 random test and training splits performed on the character database. Two tests were run, in the first test additional jitter was introduced in the training data ( Fig. 10a ) and the test data was left unaltered. In the second test (Fig. 10b ) the training data was left unaltered and additional jitter was introduced only in the test data. Training is performed on tracked and stabilized views of the characters, thus for the purposes of training, the characters appears static. HFirst can therefore tolerate high timing jitter because even when a spike's time is changed, it will still occur in the correct location relative to the center of the character. Accuracy drops off significantly only when the standard deviation of the jitter exceeds 100ms, which is comparable to the length of the recording itself (112ms).
Recognition is performed on moving views of the characters which are crossing the field of view at roughly 1 pixel/ms. Delaying a spike by even a few milliseconds (Fig. 10b) will cause the spike to occur in the wrong location relative to the center of the character (because the character center will have moved during the delay period). Therefore, even a few milliseconds of timing jitter will cause a significant decrease in recognition accuracy.
VIII. DISCUSSION
In this paper we have described a spiking neural network for visual recognition dubbed "HFirst". HFirst exploits timing information in the incoming visual events to implement a time-to-first spike operation as a temporal Winner-Take-All (WTA) operation with lateral reset to block responses from other neurons in the same pooling area. Computationally, 10 . The effect of timing noise on recognition accuracy for the character recognition task. Adding Gaussian noise to the stabilized training data (a) has little effect on accuracy because even when delayed, spikes occur in the correct location relative to the character center. Accuracy drops off significantly only when the timing jitter is large enough to cause the training data spikes to be too spread in time. Adding even a small degree of Gaussian noise to the moving characters used for testing (b) causes accuracy to drop off significantly because by the time the delayed (jittered) spikes arrive at the S1 inputs, the character has already moved on to a new location.
this temporal WTA is significantly simpler than the MAX operation typically used in hierarchical models.
HFirst operates on change detection data from AER sensors. Each pixel in these sensors adapts individually to ambient lighting conditions, which to a large extent removes dependence on lighting conditions. This removes the need for normalization of oriented Gabor responses in HFirst, which is another computationally intensive task (division) required by the standard HMAX model and other CNN implementations.
Thus far HFirst has been tested on simple objects, and neurons in the second layer of HFirst directly detect the presence of these objects, allowing HFirst to simultaneously detect multiple objects in the scene, which is not typically possible with CNNs.
Masquelier et al. [43] used STDP to learn more complex features, and a powerful Radial Basis Function (RBF) classifier which allows recognition of more complex objects (motorcycles and faces from Caltech 101). Their approach used STDP to extract features with high correlation between training examples, even though these features appear at different locations. This removes the need to precisely track and stabilise a view of an object for training. However, the model only operates on static images, removing the problem of moving stimuli, and objects are already centered in the Caltech 101 database (although features do not always appear at the same location). A second major difference is that HFirst operates continuously, whereas Masquelier et al. present images to their model sequentially, requiring the system to be reset before each image presentation.
In a recent PAMI paper, Perez-Carrasco et al. [35] reported an accuracy ranging from 90.1% to 91.6% for the card pip task using a five layer spiking CNN. They kindly provided us with their data and for the same task we report accuracy of 97.5%±3.5%. However, we compute accuracy differently to Perez-Carrasco et al.. Their CNN implementation includes separate "positive" and "negative" responses to represent the presence or absence for each object, and both these responses are used in their calculation of accuracy. HFirst has no "negative" responses, which prevents us from using the same equation. Instead, HFirst provides only positive responses, and does not respond when no objects of interest are present in the scene. Nevertheless, if we consider a lack of response from a neuron to be a "negative" response, then we can use the same equation. Doing so marginally increases our accuracy to 98.8%±1.9% because correct "negative" responses are rewarded, even when "positive" responses are incorrect. The card pip task was also used to investigate the benefits of including lateral reset, by showing that removal of lateral reset connections in the first, second, or both layers consistently reduces recognition accuracy, while simultaneously increasing computational requirements.
Given the high accuracy of the full HFirst model on the card pip recognition task, a second more difficult character recognition task was constructed and was also used to investigate the benefits of a multi-layer model. Bypassing the first layer decreased accuracy from 84.9%±1.9% to 81.4%±3.8%, suggesting that the first layer increases recognition accuracy. Perhaps more importantly, the first layer significantly reduces computational requirements for the character recognition task. The same was not true for the card recognition task because it consists of very few classes (4), but as the number of classes increases, so does the number of neurons in S2, therefore making it more important to have the S1 and C1 layer to reduce the number of spikes reaching S2.
The leaky integrate and fire neurons used in HFirst essentially perform coincidence detection on input spikes arriving in a specific spatial pattern. A neuron will only generate an output spike if enough input spikes matching this pattern are received within a sufficiently short time period. Under ideal circumstances (no noise), the projection of an object moving between two points on the focal plane will generate the same number of spikes from the AER sensor, regardless of the speed of the object. However, the speed of the object will determine the time period over which these spikes are generated, with slow moving objects not generating spikes at a high enough rate to elicit a response from HFirst layer 1 neurons, but this can be overcome through active sensing, by using a small motion or vibration of the sensor to elicit an egomotion induced velocity on the image plane.
IX. CONCLUSION
We have presented an HMAX inspired hierarchical SNN architecture for visual object recognition dubbed 'HFirst'. The architecture uses an SNN to exploit the precise spike timing provided by asynchronous change detection vision sensors to simplify implementation of a non-linear pooling operation commonly used in bio-inspired recognition models.
HFirst obtains the best reported accuracy on a card pip recognition test and results for a second, far more difficult character recognition task have also been presented. The low computational requirements of the HFirst model allow for real time implementation on an Opal Kelly XEM6010 FPGA board which interfaces directly with the vision sensor, and is both narrower and shorter than a credit card in size. 
Garrick Orchard
