One of the most exciting advancements in Artificial Intelli gence (AI) over the last decade is the wide adoption of Artificial Neural Networks (ANNs), such as Deep Neural Network (DNN) and Convolu tional Neural Network (CNN), in real world applications. However, the underlying massive amounts of computation and storage requirement greatly challenge their applicability in resource-limited platforms like drone, mobile phone and IoT devices etc. The third generation of neural network model-Spiking Neural Network (SNN), inspired by the working mechanism and efficiency of human brain, has emerged as a promising solution for achieving more impressive computing and power efficiency within light-weighted devices (e.g. single chip). However, the relevant research activities have been narrowly carried out on conventional ratebased spiking system designs for fulfilling the practical cognitive tasks, underestimating SNN's energy efficiency, throughput and system flexibil ity. Although the time-based SNN can be more attractive conceptually, its potentials are not unleashed in realistic applications due to lack of efficient coding and practical learning schemes. In this work, a Precise-Pime-Dependent Single Spike Neuromorphic Architecture, namely "PT-Spike", is developed to bridge this gap. Three constituent hardwarefavorable techniques: precise single-spike temporal encoding, efficient supervised temporal learning and fast asymmetric decoding are proposed accordingly to boost the energy efficiency and data processing capability of the time-based SNN at a more compact neural network model size when executing real cognitive tasks. Simulation results show that "PT-Spike" demonstrates significant improvements in network size, processing efficiency and power consumption with marginal classification accuracy degradation, when compared with the rate-based SNN and ANN under the similar network configuration.
more biological plausible time-based SNN may offer better energy efficiency and system throughput [10] , since theoretically the infor mation can be flexibly embedded in the time (temporal) domain of short and sparse spikes instead of the spiking count represented by a group of dense spikes in rate coding, e.g. the spike occurrence frequency is proportional to the intensity of the input like each pixel density of the image [7] , [11] , As a result, the rate-based SNN is naturally more power-hungry than that of time-based SNN due to the increased number of spikes and relevant spike operations, such as synaptic weighting and Integrate-and-Fire (IFC) etc. Meanwhile, the processing efficiency of time-based SNN can be further enhanced by performing an early decision making based on the temporal information extracted from early fired spikes, while in rate coding, the classification cannot be initiated until the last moment, e.g. winnertakes-all rule by sorting the number of spikes fired during the entire period of decoding time for each output neuron [12] .
However, the potentials of such an emerging architecture are signif icantly underestimated due to lack of efficient hardware-favorable so lutions for time-based information representation and complex spike timing-dependent (temporal) training of biological synapses towards practical cognitive applications [13] , On one hand, translating the input stimulus (i.e. image pixels) to the delay of the spikes, namely time-based encoding, is non-trivial because the coding efficiency can be easily degraded by the biased spike delays distributed in the limited coding intervals. Also, the hardware realization of time coding is usually expensive, as the time-based spike kernel needs to be carefully designed to provide accurate time information (e.g. pre-synaptic/post-synaptic time [10] ) for time-based training. On the other hand, realizing more biological plausible spikmg-time based training, i.e. unsupervised spiking-time-dependent plasticity (STDP), is very complex and costly due to the exponential time dependence of weight change and difficult convergence of learning [14] , In realworld applications, training of the rate-based SNN can be usually per formed off-line by directly borrowing the standard back-propagation algorithm from artificial neural network (ANN) [11] . However, this time-independent learning mle does not fit the time-dependent SNN because of a fundamentally different learning mechanism.
In this work, we investigate the possibility of unleashing the potentials of time-based single-spike SNN architecture in realistic applications by orchestrating the efficient time-based coding/decoding and learning algorithm. A Precise-Time-Dependent Single Spike Neu romorphic Architecture, namely "PT-Spike", is proposed to facilitate the cognitive tasks like the MNIST digit recognition. Our "PT-Spike" incorporates three integrated techniques: precise single-spike temporal encoding, efficient supervised temporal learning, and fast asymmetric decoding. Our major contributions are:
1)
We develop a precise-temporal encoding approach to efficiently translate the information into the temporal domain of a single spike. The single spike solution dramatically reduces the en ergy, while offering efficient model size reduction; 2) We propose a supervised temporal learning algorithm to facil itate synaptic plasticity on this single-spike system. and serious weight competition issue existing in this single spike system, and significantly improve the efficacy and effi ciency of synaptic weight updating.
II. B a c k g r o u n d s a n d M o t iv a t io n s

A. Neural Coding in SNNs
The neural coding in SNNs can be generally categorized as rate coding, time coding, rank coding and population coding etc. [15] . In particularly, the first two codings are the most attractive, since each piece of coded information is only associated with the spikes gener ated by a single input neuron, offering simplified encoding/decoding procedures and design complexity. Fig. 1 demonstrates an example of conceptual comparison between rate coding and time coding in SNNs. T e and T , (R e and R ) denote two types of input neurons: the time-coded (rate-coded) excitatory and inhibitory neurons, respectively. The excitatory neuron can exhibit an active response to the stimulus while the inhibitory neuron intends to keep silent. T i and T 2 (R i and R 2 ) denote two time-coded (rate-coded) output neurons for the classification. The rate-based SNN generates far more number of spikes than that of time-based SNN in both types of input neurons. After the input spikes are processed by the two different SNNs, a single spike firing at a specific time interval can perform an inference task in the output layer of the time-based SNN. However, a considerable number of spikes are needed for fulfilling a rate-based classification in the rate-based SNN, indicating a much higher power consumption. Moreover, the rate-based SNN may exhibit a slower processing speed than that of time-based SNN, since the output neuron of the former SNN needs to count the spiking numbers (i.e. through Integrate-and-Fire [16] ) in the whole predefined time window, while that of the latter one may quickly suspend its computations once a spike is detected.
B. Limitation o f Existing Spiking Neuromorphic Computing Research
N eurom orphic D esigns: Many studies have been conducted to facilitate the spiking based Neuromorphic Computing System (NCS) designs in real hardware implementations, including CMOS VLSI circuit [7] , [17] , [18] , [19] , reconfigurable FPGA [8] , and emerging memristor crossbar [20] , [11] . However, these works mainly focus on the rate-or time-based SNN model mapping and hardware implementations, rather than the SNN architecture optimization, i.e. coding, decoding and learning approaches etc.
Tem poral C oding: The concept of temporal coding, which relies on the arrival time or delay of a spike train for information repre sentation, has been widely explored and proved in the development of time-based SNN [21] , [22] , These theoretical studies, however, mainly emphasize on the biological explanations of time-based SNN models based on simple cognitive benchmarks (i.e. two inputs XOR gate), which are far from the complicated real-world problems such as image recognition. Recently, Zhao et al. [23] proposed an encoding circuit to handle the temporal coding, however, this type of work still concentrates on component-level hardware implementations with simple case studies, and hence is lack of a holistic architecture-level solution set capable of handling realistic tasks. In [24] , a complete time-based SNN design is proposed. However, their solution suffers from limited accuracy fundamentally constrained by existing coding and temporal learning rule, and is not optimized towards hardwarebased neuromorphic system designs.
Tem poral L earning: Since the popular learning approaches such as back-propagation [25] widely used in ANN or rate-based SNN are unable to handle precise-time-dependent information due to a fundamentally different neural processing, many proposals dedicated to the time-based learning have been developed [14] , [26] , [27] . However, these learning algorithms are neither hardware-favorable nor applicable for realistic tasks due to the expensive convergence and theoretical limitation. For example, in the unsupervised Spike-timing dependent plasticity (STDP) learning rule, the neural network struc ture and synaptic computation will be exponentially increased due to the expensive convergence and clustering. The proposed "Tempotron" and "Remote Supervised Method (ReSuMe)" can use the teaching spike to adjust desired spiking time for temporal learning, however, are not applicable to handle complicated patterns.
Our proposed "PT-Spike" is substantially different from previous studies: we explore how the time-based sin g le-sp ik e SNN archi tecture can be designed to perform the realistic tasks through a holistic efficient techniques spanning time-based coding, learning to decoding. A low cost and efficient temporal learning named "PT-Leaming" is augmented from the "Tempotron" learning by consider ing a synthesized contribution of the cost function and the hardwarefavorable time-dependent kernel for weight updating. By integrat ing with proposed "Precise Temporal Encoding" and "Asymmetric Decoding", "PT-Spike" can improve the accuracy, power, learning efficiency, and the model size reduction through the spatial-temporal information conversion significantly.
III. D e s i g n D e t a i l s
A. System Architecture Fig. 2 shows a comprehensive data processing flow of proposed "PT-Spike". First, the stimulus will be captured by the temporal perceptors to generate a sparse spike train (i.e. single spike) through "Precise Temporal Encoding". Each spike train will be further modulated in temporal domain by a linear-decayed spiking kernel to form time-dependent voltage pulse. Second, those voltage pulses will be sent to the synaptic network for a weighting process, i.e. the memristor crossbar with IFC design can be employed for parallel processing. The output neurons will exhibit time-varying weighting responses due to the time-dependent input information. After that, the output neuron will fire a spike if the weighted post-synaptic voltage crosses a threshold voltage. Then spike trains from the output layer will be transmitted to the "Asymmetric Decoding". Finally, the target pattern will be classified by analyzing the synchronized output spikes with a predefined asymmetric rule. During the learning procedure, desired spike patterns are coded by following the similar asymmetric rule during decoding. The detected errors will be sentback for synaptic plasticity through "PT-Leaming"-a supervised temporal learning algorithm.
B. Precise Temporal Encoding
As discussed in Section, n , in traditional rate coding, a large number of spikes within a proper time window will be needed to precisely indicate the amplitude of an input signal, i.e. the pixel density of visual stimulus. To maximize the power efficiency with minimized number of spikes, the input information will be represented as an extreme sparse train-single spike and its occurring delay in aforementioned coding approach. However, such a "one-toone" mapping between each stimulus and spike train of each input neuron can lead to a significant energy overhead. Meanwhile, the time or temporal information of those spike trains are not fully leveraged by each neuron, resulting in limited coding efficiency thus a dramatical accuracy reduction. As we shall present later, our results on "MNIST" benchmark show that the "one-to-one" mapping achieves very unacceptable training accuracy ((~ 20%) even under a large model size, that is, 784 input neurons for a 28 x 28 image.
6C-1 Input Stimulus
Input Layer 
Asymmetric Decoding
In "PT-Spike", we further propose the "Precise Temporal Encod ing". As shown in Fig. 2 , the "Precise Temporal Encoding" is inspired from human visual cortex and Convolutional Neural Network (CNN), where a Temporal Kernel (i.e. a unit square matrix) will be applied on the full image to capture the spatial information and then translated into a single spike delay in temporal domain as a neuron input by perceiving the localized information from multiple interested pixels, i.e. spiking delay is equal to the average density among several selected pixels. In practice, by selecting a proper stride with which we slide the Temporal Kernel, e.g. smaller than the dimensionality of Temporal Kernel, a portion of localized spatial information will be shared by adjacent kernel sliding. Consequently, the spatial localities can be further transformed into temporal localities, thus to uniformly allocate the spiking delay assigned to each input neuron in time domain, translating into improved coding efficiency and classification accuracy.
Another unique advantage of the proposed "Precise Temporal Encoding" is to offer a flexible model size reduction. Different from traditional "one-to-one" mapping, various choices of model size reduction can be easily achieved by reconfiguring the size of Temporal Kernel. Fig. 3 illustrates such an interesting concept offered by "Precise Temporal Encoding". Increasing the Temporal Kernel size can enrich the temporal information (see encoding time frame from T = 16ms to T = 256ms in Fig. 3) , and hence reduce the needed spatial information or input neurons, e.g. 169 input neurons for "PT-Spike (16)" v.s. 49 input neurons for "PT-Spike (256)". The training and inference accuracies will be slightly changed according to the selected Temporal Kernel size (see Section. IV).
C. Synaptic Processing and Linearized Spiking Kernel
Once the delay for the single spike is determined, as shown in Fig. 2 , a spiking kernel K will be applied to shape the associated spikes for input neurons. The kernel plays an important role in the following synaptic weighting for the output voltage Vn (t), as shown in Eq ( 1):
where weight Vn (t) represents the voltage of output neuron n, Wmn denotes the synaptic efficacy between input neuron X m and output neuron A n . t s is the decoded spiking delay of X rn. To provide sufficient and accurate temporal information for the classification, the exponential decayed post-synaptic potential in the biological spike response neural model [28] can be expressed as:
K i ( t -t s) = p ( e x p [ -( t -t s) / n ] -exp[-(t -t s) / r 2])
where r ( n and r 2) denotes decay time constant, and p is the normalizing constant. However, such an exponential decaying func tion requires expensive computation and hardware resource. In "PT-Spike", we employ a more hardware-favorable kernel function K 2a linear decaying function (see K \ and K 2 comparison in Fig. 2) , to simplify the costly dual-exponential function K \\
As we shall show in Section. IV, such a linear approximation cause very marginal classification accuracy degradation. Besides, this linear kernel function will be also applied to detect the input voltage contributions to the output spike in our proposed "PT-Leaming".
D. Asymmetric Decoding
In 'PT-Spike", a novel Asymmetric decoding scheme, namely "A-Decoding", is proposed for the classification. As the error signal critical for the proposed supervised temporal learning will be also generated through asymmetric decoding, we will discuss the "A-Decoding" technique first.
In rate-based SNN, the target pattern can be determined by the output neuron with highest spiking numbers. The costly weight updating will be performed in all synapses at each iteration of learning. The subsequent neural competition (weight conflict) among 6C-1 different patterns can be rectified by enough information provided by the large number of input spikes. Hence a good classification accuracy may be achieved for all different patterns. However, the similar case cannot occur in our proposed "PT-Spike", since its weight updating solely relies on the very limited number of spare spikes (e.g. a single spike) in temporal domain. In "PT-Spike", we further propose the "A-Decoding" to alleviate the neural competition for accuracy improvement. Fig. 4 illustrates the key idea of proposed "A-Decoding", including pattern readout and error detection. Pattern {P i} can be decoded based on the firing status of output neuron {A,}. In our asymmetric decoding, the output neuron can work on three different statuses: "firing", "not firing" and "independent", as shown in Fig. 4 . Note "independent" means that the associated neurons will not participate in the learning process of a certain pattern, and it will only occur in learning mode.
In testing mode, the output neuron will be only in following two status: {1 -f i r i n g / 0 -n o tfir in g } . The target pattern is scanned according to the order of the first firing neuron. Assume a binary code NiN2N3 ■ ■ ■ Ni is generated by output neurons {Ni}, a Huffmanstyle decoding procedure can be performed (See Fig. 4 left part) . For example, if the first firing neuron is A 3, the corresponding code will be 001. Thus, the target pattern is P3. In "PT-Spike", the early detection of testing, namely "Fire&Cut", can be realized based on the temporal "winner-take-all" rule: Once the IFC of neuron A , triggers a spike, all the remained IFCs for other neurons will be shut down by following the "Fire&Cut Order", which may save the additional power consumed by the IFCs.
In learning mode, a desired spike pattern is reversely generated according to the Huffman-style decoding of pattern {Pi} (See Fig. 4 right part). Once a participated neuron N i triggers an unexpected firing or a missing firing, an error will be detected and only the synap tic weights of N i will be modified according to our proposed "PTleaming". Note only "partial" output neurons (NOT in"independent" status), will be involved during the learning of pattern {Pi}, namely 'Tartial Learning". Such a mechanism significantly accelerates the learning procedure and saves power consumed by the unnecessary neural processing. Meanwhile, {Ai} is "asymmetrically" correlated with {Pi} and thus can ease the neural competition. For example, neuron Ai only engages in the synaptic plasticity of pattern Pi and will he ignored during the learning of all other patterns. As we shall show later, by taking advantages of "Fire&Cut", "Partial Learning" and "Ease Competition", our proposed "A-Decoding" can significantly enhance the weighting efficiency and learning accuracy.
E. PT-Leaming
Our proposed "PT-Leaming" coordinates with the aforementioned "A-Decoding" to capture the errors needed for synaptic weights updating. An error detected by the "A-Decoding" will be processed by "PT-Leaming" to generate corresponding weight changes and send back for synapse updating. As shown in Fig. 2 , based on the actual and expected spiking pattern, two types of errors may occur in the output neuron: "false missing" and "false fire". Here "false missing" means that the integrated voltage can not reach the threshold in output neuron to trigger the expected output spike, while "false fire" is defined as an undesired spike firing.
As shown in Algorithm. 1, once an error is detected, the error spiking time (T fai) and the cost function (E rr) will be extracted from Tmax and Vth -Vmax. Here Vmax and Tmax are the maximum voltage amplitude and its occurrence time, respectively. A negative (positive) E r r means a false-fire (missing). Hence, the gradient of E r r with respect to each weight w c at pre-synaptic spiking time Tc can be calculated as: 
-\E r r K 2(T fai -Tc) wci <-A w + wci
Here K 2 is the linear decayed spike kernel defined in Eq.( 3). As pre-synaptic spikes are weighted through synaptic efficacy w c before Tmax, BVer™ *^ = 0-By further considering E r r into the change of w c, A w c can be expressed as:
where A denotes the learning rate and spike kernel K 2 can be used again to calculate the contributions from the input neuron X c at time Tc.
A s discussed in "A-Decoding", only partial output neurons will be involved during the learning of a certain pattern, meaning that only partial synaptic weights will be updated. The dual-level acceleration, contributed by both "A-Decoding" and "PT-Leaming", can improve the learning efficiency significantly. As we shall show later, the synap tic computation can be reduced more than 200% when compared with the standard learning approach without accelerations. Moreover, "PT-Leaming" together with "A-Decoding" can boost the accuracy for realistic recognitions task significantly.
IV. E v a l u a t i o n s
To evaluate the accuracy, processing efficiency and power con sumption of our proposed "PT-Spike" neuromorphic architecture, extensive experiments are conducted in the platforms like MATLAB and heavily modified open-source simulator-Brian [29] .
A. Simulation Setup
In our evaluation, a full MNIST database is adopted as the benchmark [30] . A set of "PT-Spike" designs-"PT-Spike(R)" are implemented to demonstrate the leveraged temporal encoding where PT-Spike (25) P T -S p ik e (1 6 ) PT-Spike ( [32] , are also implemented for the energy and performance comparisons with proposed "PT-Spike". Table. I presents the detailed structural parameters of selected candidates. Compared with the "Diehl-15" and "Lecun-98", our proposed temporal encoding achieves significant model size reduction for all "PT-Spike" designs, i.e. ~ 40 x ("PT-Spike(4)" v.s. "Diehl-15") and ~ 4 x ("PT-Spike(4)" v.s. "Lecun-98"). Fig. 5a shows the accuracy comparison among different "PT-Spike (R)", "Lecun-98" and "Diehl-15". "PT-Spike(25)" can achieve very comparable accuracy at much lower cost (~ 86%, 1440 synaptic weights) when compared with "Diehl-15" (~ 83%, 78400 synaptic weights) and "Lecun-98" (~ 88%, 7840 synaptic weights). Mean while, "PT-Spike(16)" and 'PT-Spike(25)" also show a very close accuracy (~ 87% and ~ 86%), which is much better than "PT-Spike(4)" and "PT-Spike(lOO)" (~ 63% and ~ 70%).
B. Accuracy
We also evaluated the individual training accuracy improvement contributed by various proposed techniques, such as "linearized spiking kernel", "Precise Temporal Encoding", "A-Decoding" and "PT-Leaming", receptively. Here, we choose the "PT-Spike(16)" as the baseline design that employs all aforementioned techniques. "Exponential Kernel", "one-to-one mapping", "non A-Decoding" and "Tempotron" denote the designs that substitute only one out of the four techniques. As shown in Fig. 5b , "PT-Spike(16)" shows a very marginal accuracy degradation (0.2%) because of the "linearized spiking kernel" (K 2 in Eq.( 3)) when compared with the original costly "Exponential Kernel" design (86.9%, K i in Eq. ( 2)). Fur thermore, "PT-Spike(16)" boosts the accuracy by ~ 400%, ~ 19% and ~ 38% when compared with the designs of "one-to-one map ping" (~ 21%), "non A-Decoding" (~ 68%), and the theoretical "Tempotron" learning rule (~ 49%), respectively, which clearly demonstrates the effectiveness of the proposed "Precise Temporal Encoding", "A-Decoding" and "PT-Leaming".
C. Processing Efficiency
The occurrence frequency of synaptic events is calculated to evaluate the system processing efficiency, including both weighting and weights updating. Fig. 6a compares the number of weighting operations among three designs in the feed-forward pass. Unlike the other candidates, the amount of weight operations of "PT-Spike (16) " is different between training and testing due to the "Fire&Cut" mechanism in"A-Decoding". Hence, the weighting of the first testing iteration is also included in "PT-Spike(16)". Even the "non A-Decoding", i.e. "PT-Spike(16)" without the "A-Decoding" technique, gains ~ 185 x weighting operation reduction as compared with "Diehl-15" since rate-coded SNN requires a long time window to process the spikes with enlarged neuron model size, causing tremendous weighting processes on each time slot. Compared with "non A-Decoding", weighting operations of "PT-Spike(16)" can be further reduced by ~ 28% and ~ 69% in first training iteration and testing iteration, respectively. As expected, the "early-detection" working mechanism in "A-Decoding" removes many unnecessary weighting operations on both "initialized" weights and "well-trained" weights.
We also characterize the occurrence frequency of weights up dating during the first training iteration to evaluate the processing efficiency in the feed-back pass. As Fig. 6b shows, even "Worst Case" (i.e. "PT-Spike (16) 
6C-1
Spike(16)", respectively, demonstrating the effectiveness of "dual level acceleration" from decoding and learning.
D. Power Consumption
To roughly evaluate the power efficiency contributed by the proposed architecture, we adopted a similar methodology used in [7] , [18] . A new candidate "Minitaur" [8] is introduced for a fair comparison since it is a more hardware-oriented rate-coded SNN. As Fig. 6c shows, "PT-Spike(16)" saves ~ 8 x and ~ 64x power for each input neuron and each input image over "Diehl-15", respectively, indicating the efficiency of our proposed single-spike coding tech nique. Compared with the hardware-oriented rate-coded SNN design "Minitaur", "PT-Spike(16)" can still achieve ~ 1.4x (~ 6 .6 x ) power reduction on each input neuron (input image).
E. Discussions
The research of the time-based SNN represented by extreme sparse spikes, i.e. sin gle sp ik e d esign , is still in its infancy, and to our best knowledge, we have not seen any exemplar large networks successfully demonstrated for performing the realistic cognitive tasks. Due to the unique time-based learning and information representation, the research in this area is quite challenge and unique. In this work, we adopt a proof-of-concept simple design, i.e. Single-Layer Perceptron to illustrate the design optimizations of the time-based SNN, and demonstrate its potentials for realistic applications,though the classification accuracy is still lower than that of state-of-the art DNNs and CNNs.
Extending our design to multi-layered network will enhance its capability to handle more complicated cognitive tasks, however, is non-trivial, as a multi-layer learning rule needs to he developed to facilitate the spatial information transfer among different layers. While our proposed approach cannot be directly applied for the multi layered network in its current form, the novel techniques proposed in this paper, i.e. "Temporal Kernel Coding", "PT-Leaming" and "A-Decoding" form the basis for the time-based multi-layer network. We believe the initial architecture developed in this paper will serve as a basic framework to the multi-layer network design, and may encourage more interesting researches in this domain.
V. C o n c l u s i o n
As the rate-based spiking neural network (SNN) is subject to power and speed challenges due to processing large number of spikes, in this work, we systematically studied the possibility of utilizing the more power-efficient time-based SNN in real-world cognitive tasks. Three integrated techniques-precise temporal encoding, efficient supervised temporal learning and fast asymmetric decoding, were proposed to construct the Precise-Time-Dependent Single Spike Neuromorphic Architecture, namely, "P T -S p i k e The single-spike temporal en coding offers an energy-efficient information representation solution with the potentials of model size reduction. The supervised learning and asymmetric decoding can work cooperatively to deliver a more effective and efficient synaptic weight updating and classification. Our evaluations on the MNIST database well demonstrate the advantages of "PT-Spike" over the rate-based SNN in terms of network size, speed and power, with a comparable accuracy.
