Abstract-Deep neural networks (DNN) have been shown to be very effective at solving challenging problems in several areas of computing, including vision, speech, and natural language processing. However, traditional platforms for implementing these DNNs are often very power hungry, which has lead to significant efforts in the development of configurable platforms capable of implementing these DNNs efficiently. One of these platforms, the IBM TrueNorth processor, has demonstrated very low operating power in performing visual computing and neural network classification tasks in real-time. The neuron computation, synaptic memory, and communication fabrics are all configurable, so that a wide range of network types and topologies can be mapped to TrueNorth. This reconfigurability translates into the capability to support a wide range of low-power functions in addition to feed-forward DNN classifiers, including for example, the audio processing functions presented here.In this work, we propose an end-to-end audio processing pipeline that is implemented entirely on a TrueNorth processor and designed to specifically leverage the highly-parallel, low-precision computing primitives TrueNorth offers. As part of this pipeline, we develop an audio feature extractor (LATTE) designed for implementation on TrueNorth, and explore the tradeoffs among several design variants in terms of accuracy, power, and performance. We customize the energy-efficient deep neuromorphic networks structures that our design utilizes as the classifier and show how classifier parameters can trade between power and accuracy. In addition to enabling a wide range of diverse functions, the reconfigurability of TrueNorth enables re-training and re-programming the system to satisfy varying energy, speed, area, and accuracy requirements. The resulting system's end-to-end power consumption can be as low as 14:43 mW, which would give up to 100 hours of continuous usage with button cell batteries (CR3023 1:5 Whr) or 450 hours with cellphone batteries (iPhone 6s 6:55 Whr).
D
UE to the effectiveness of employing deep neural network approaches for solving a rapidly growing segment of data-intensive applications [1] , [2] , [3] , [4] , a new class of reconfigurable computing platforms is emerging to efficiently support deep neural networks. While there has been a proliferation of deep neural network implementations on GPUs and multicore CPUs, these implementations require a significant power budget. In contrast, using vastly different computational primitives at a very different scale and efficiency, human neural networks simultaneously solve a wide range of problems, including classification, association, recognition, perception, sensory fusion, planning, abstract cognition, etc. Consequently, there are several efforts designing efficient neural network platforms [5] , [6] , [7] . The IBM TrueNorth processor [8] is a reconfigurable platform capable of realizing some of the most massive deep neural networks on a single chip with extremely low power consumption [9] . The TrueNorth chip has demonstrated performance of greater than 1; 200FPS at $ 200 mW on state of the art image classification tasks [10] .
Recognition of voiced digits is one of the most commonly used applications in automated voice recognition. It is commonly implemented using the Mel Frequency Cepstral Coefficient (MFCC) [11] feature extraction followed by a classifier. In this work, we have developed a baseline MFCC feature extractor using a custom FPGA accelerator as well as implementing MFCCs on a general purpose embedded ARM processor. Fig. 1(right) shows the default flow for the baseline MFCC approach using TrueNorth only as a classifier.
The MFCC feature extractor cannot be directly ported into the TrueNorth fabric due to fundamental differences in how computing is performed by the fabric. TrueNorth is designed to support massively parallel, low-precision spiking neural networks and has special constraints on connectivity and configuration structures to achieve high energy efficiency. This paper presents insights on how a new audio feature extractor can be designed to conform to the computational structure of the TrueNorth fabric, enabling the single-chip solution shown in Fig. 1(left) . Further, we show how this mapping can be tuned to trade off among area, power and accuracy of the application. We also integrate the audio feature extractor with a front end microphone and a back-end classifier to show the overall application level performance. Our results show that the IBM TrueNorth fabric is an effective platform for running ultra-low power analytics applications.
The major contributions of this paper include:
We propose an end-to-end audio processing pipeline that is implemented purely on TrueNorth to save power and interface overhead. We designed an audio feature extractor, called Lowpower Audio Transform with TrueNorth Ecosystem (LATTE) [12] , to use the low-precision synaptic weights that TrueNorth supports directly and efficiently. We extend the work by implementing several designs of LATTE using different precision and function options, and show the experimental results for those designs along with discussion of the performance-power trade-offs. We customize the energy-efficient deep neuromorphic networks (Eedn) structures that our design utilizes as the classifier. We parameterize the designs to be small for a limited power budget, or large to obtain a high accuracy, supported by the experimental results.
The design of the pipeline is configurable. For different target applications, we can reconfigure the TrueNorth chip to run different LATTE and Eedn designs and have different bit precisions corresponding to different power, area, and accuracy solutions. The power consumption of the proposed always-on system can be as low as 14:43 mW, which is up to 100 hours with a button cell battery (CR3023 1:5 Whr [13] ) or 450 hours with a cellphone battery (iPhone 6 s 6:55 Whr [14] ). The rest of this paper is organized as follows. Section 2 provides an overview of the TrueNorth Fabric. Section 3 presents the audio feature extraction algorithm and the details of how the operations are mapped onto TrueNorth. Section 4 presents the configurable features of the audio feature extraction pipeline. Sections 5 and 6 evaluate the performance and power consumption of the end-to-end system. After related work, we present our conclusions in Section 8.
TRUENORTH
This section describes the architectural features of the TrueNorth processor and the corresponding programming model which developers use to design and program applications targeted for TrueNorth. This section only provides a brief introduction; more details are found in [15] , [16] , [17] .
TrueNorth Chip
The TrueNorth chip is the first silicon realization of the parallel, event-driven, and scalable TrueNorth Architecture [8] , [15] . Each chip is composed of 4,096 neurosynaptic cores (Fig. 2) which are composed of arithmetic logic to compute the neuron state, memory to store the neuron state, connectivity, and parameters, as well as, a router to communicate data between cores. The neurosynaptic core implements 256 neurons and 256 axons, all connected via a synaptic crossbar which allows for full connectivity, resulting in a total of 64 k total programmable synapses per core (1 M neurons, 256 M synapses per chip). Dendrites are represented as the vertical direction of the crossbar, where a particular neuron may connect to one or more axons via the programmable synapse. Spiking events are generated when a neuron's membrane potential reaches its specified threshold, and then they are injected into the inter-core network. Target axons receive the spikes, in turn causing the connected neurons to update their membrane potentials, and so on.
The neurosynaptic core is capable of implementing several methods for computing neuron behavior based on a set of programmable weights and parameters. This work employs the use of the following simplified equation to model the neural synaptic integration [17] 
where V j is the membrane potential of the jth neuron at time t, A i represents the ith axon's reception of a spike (1 if spike received; 0 otherwise) and w i;j is the entry in the cores synaptic weight matrix (1 when the ith axon is connected to the jth dendrite). G i integer axon type 2 f0; . . . ; 3g for which a synaptic weight is defined per neuron and s G i j is the synaptic weight between the ith axon and the jth dendrite. j is a leak applied to the membrane potential at each timestep. The leak can be a positive or negative number. For example, in this work we use the positive leak to generate a periodic spike train with a configurable period. Finally, the neuron firing behavior can include a stochastic process by enabling the pseudorandom number generator (PRNG) for the neuron. With this feature, a neuron fires when its membrane potential exceeds the threshold plus a random value generated by the PRNG.
As a concrete example of TrueNorth's operation, we next describe how a spike may cause a downstream neuron to fire. Assume a core with synaptic weights and axon types, as shown in Fig. 2 , has just received spikes on A 0 and A 2 , denoted spike 0 and spike 1 . First, a trigger signal is received which increments the timestep and moves spike 0 and spike 1 from their respective axons into the synaptic crossbar. Spike 0 is sent to neuron 0 while spike 1 is duplicated to both neuron 0 and neuron 2 as both contain an active synapse. In neuron 0 , V 0 is updated by À2 followed by þ2 (as A 0 has type S 0 having a synaptic weight of À2 and A 2 has type S 3 which has a weight of þ2). Neuron 2 , on the other hand, is updated by À3 (as A 2 has type S 3 which has a weight of À3 for that neuron). After the synaptic and leak updates to the membrane potential are complete, random numbers are drawn when in the stochastic-spiking mode (a separate random number per neuron). If the current membrane potential exceeds its firing threshold plus the random number, even if it did not just receive spikes, the neuron will fire, resulting in a spike being produced.
Programming Model
Programming TrueNorth is primarily accomplished using the Corelet Language via the Corelet Programming Environment (CPE) [16] . The CPE is a composable, hierarchical programming framework that specifies a network of neurosynaptic cores that only exposes the external input axons and external output neurons. All other details of the network, such as number of cores and internal connectivity, are encapsulated into a primitive known as a corelet (Fig. 3 ). TrueNorth programs are described as the composition of one or more corelets which are eventually mapped onto the physical TrueNorth processor.
In addition to CPE, a TrueNorth compatible model can also be generated using Caffe [18] or MatConvNet [19] with customized TrueNorth compatible layers [10] , [20] . These layers are able to automatically generate and configure communication in terms of corelets, requiring developers to only specify standard MatConvNet inputs, such as the dataset, network parameters (number of layers, connectivity, loss functions, etc.) and learning parameters (learning rates, batch size, number of iterations, etc.), while using matConvNet's constraint solvers. The result of this flow is a TrueNorth compatible network whose accuracy converges probabilistically to that of the standard MatConvNet network. With this tool, TrueNorth based classifiers can be easily trained on datasets, without the need to explicitly develop customized corelets. In this work, MatConvNet is used to train the DNN classifiers used to recognize spoken digits, using both MFCC and LATTE features.
TrueNorth Eedn
TrueNorth is a constrained neuromorphic processor, using low-precision synaptic weights, spiking communication, and blockwise inter-core connectivity. In conventional convolutional networks, weights are high-precision floatingpoint or fixed-point values. Also, conventional convolutional networks place no constraints on filter size, or connectivity. Consequently, directly mapping these networks onto TrueNorth can significantly reduce the accuracy. Instead, IBM has developed an energy-efficient deep neuromorphic network [10] , a specialized framework for training convolutional networks that directly incorporates the constraints of TrueNorth into the learning algorithm. By incorporating the constraints into the learning algorithm, the resulting networks achieve both high accuracy and high energy efficiency when run on TrueNorth.
As shown in Fig. 4 , the Eedn structure is similar to conventional convolutional networks. In each layer of the network, there is a rows Â columns Â features data structure, and a shared filter also in a 3-D rectangular shape. At run time, the filter is applied to the input data, sliding through the entire data structure, and the convolutional output propagates to the (n þ 1)th layer.
There are three key differences between Eedn and traditional convolutional networks. First, Eedn uses trinary weights À1, 0, 1 during network operation (forward and backward passes), but maintains a high precision hidden weight during training. For operation, the high precision hidden weight is mapped to one of the trinary values by rounding with hysteresis. The trinary weights are implemented with two duplicated inputs, each weighted 1 or À1, chosen by the synaptic connectivity. Second, Eedn uses spiking neurons that have a threshold activation function. The derivative of this function is approximated for training. Third, Eedn partitions layers and the corresponding filters into multiple groups to ensure that filters are sized such that they can be implemented using single TrueNorth cores.
AUDIO FEATURE EXTRACTION

Traditional Audio Feature Extraction with MFCC
Raw audio signals (sound waves) are typically transformed from a time-domain waveform to a frequency domain representation in order to extract features. The Mel-Frequency Cepstrum (MFC) [11] is widely used to represent the frequency composition as coefficients (MFCCs) for audio analysis or recognition. Fig. 5 shows the MFCC pipeline which transforms a sound wave to MFCCs. To obtain the compressed power spectrum of a digital sound wave, the MFCC pipeline contains the following steps: Hamming window, Fast Fourier Transform (FFT), Mel-scale mapping, logarithm, and then Discrete Cosine Transform (DCT). Assuming the audio signal is framed, the Hamming window operation smooths the beginning and the ending of the frames to improve the FFT response (based on periodic signals). The FFT computes the frequency spectrum from the windowed frames. The Mel-scale mapping compresses the linear frequency spectrum, using triangular overlapping windows spaced according to the Mel-scale, a perceptually defined distribution. The logarithm non-linearly compresses the spectrum magnitude, again based on the characteristics of human auditory perception. Lastly, the DCT block decorrelates the coefficients.
Proposed Feature Extraction-LATTE
We developed LATTE in order to lower the power required for audio preprocessing, as well as to aggregate the full auditory processing pipeline on TrueNorth. LATTE is inspired by the DCT, which finds the degree to which a windowed time-domain input signal matches a set of fixed-frequency cosine waves. As shown in Fig. 6 , LATTE feature extraction computes the input response to four different phases (½cos; sin; Àcos; Àsin) for 64 frequencies, followed by a max-pooling operation over the four phases for each frequency. This results in a 64-dimensional frequencydomain output feature vector.
A 1D DCT, with input dimension n and output dimension k, can be represented by the matrix transform
where x i are the input values and C ij are the constants based on freq j , the frequency components of the DCT, where C ij ¼ cosð i n Â freq j Â 2pÞ. Following this concept, LATTE can be mathematically described by Equations (3), (4), and (5). The weighted sum stage is defined as Thus it extracts an invariant feature (frequency), while reducing the effects of a variant feature (phase). The TrueNorth architecture has several unique constraints that contribute to its energy-efficient operation. These constraints pose challenges to programming the architecture, and appear to challenge the fidelity of computation. However, our results show that these constraints and the use of approximate features lead to only a modest amount of degradation in classification accuracy, while improving energy efficiency by over an order of magnitude. First, TrueNorth has a limited number of non-zero synaptic weights per neuron (four), restricting the number of quantization levels used to represent sine or cosine coefficients. Second, TrueNorth uses spike codes of limited precision (32-levels in this work). Third, spike codes can only represent positive values. Fourth, neurons have limited fan-out, only incident on a single destination core. Finally, spiking neurons compute a limited set of functions. Thus, instead of computing MFCCs or DCT coefficients directly, we map LATTE features to the TrueNorth architecture as follows.
Implementation
The LATTE corelet implementation contains three subcorelets: splitter, weighted sum, and max-pooling, as depicted in Fig. 6 . The splitter subcorelet duplicates the input signal (256 continuous audio samples), generating copies of the signal required as input to each of the individual weighted sum cores. The weighted sum corelet of LATTE next performs the parallel computation of k weighted sums over the input vector X with n elements. Fig. 7 shows the detailed design of the LATTE weighted sum subcorelet, which is the main component of the transform. The cosine (and sine) constants are first scaled and rounded to the integers 2 fÀ2; À1; 0; 1; 2g, because each axon can be assigned one of four types. According to those integer values, the inputs weighted fÀ2g are assigned to axon type-0, fÀ1g to type-1, f1g to type-2, and f2g to type-3. The axon type of the inputs weighted f0g are X's (don't cares). Then we program the synaptic connectivity matrix w ij to all 1's, except for axons which have a weight of zero. Since spikes only represent positive values, we adopt a dual-rail encoding, using a pair of neurons to represent the positive and negative response to a given input. Both neurons receive identical input and have identical parameters, except for opposite signed synaptic weights. The positive neurons are programmed to have a synaptic weight vector: S G i j ¼ ½À2; À1; 1; 2, in order to compute þcos and þsin, while the negative neurons have a synaptic weight vector: S G i j ¼ ½2; 1; À1; À2, to compute Àcos and Àsin. Also, we fine-tuned the firing thresholds for each of the neurons to have a proper spiking probability. In particular, the high frequency components have lower energy, so the neurons should be more sensitive, which means a lower threshold than those of the low frequencies. The neurons in the weighted sum are not retentive; they reset in every cycle, regardless of whether they spiked or not. With the programming above, only one of the pair of positive and negative neurons will spike, according to the matching degree of the input spike distribution and the specific cosine (or sine) wave. The entire weighted sum subcorelet uses 2 Â k cores to generate k sets of frequency spectra, for a total of 4 Â k weighted sum output signals.
The LATTE max pooling stage finds the maximum response of the phase channels ½cos; sin; Àcos; Àsin output from the weighted sum block. (These phase channels also correspond to the ½0; p 2 ; p; 3p 2 cosine phases). Because the spikes from the four channels might occur at different times, the values they are representing should be compared based on the accumulation over a period of time. Therefore, the max pooling subcorelet can further be divided into three subcorelets as shown in Fig. 8 : time divider, stochastic-to-deterministic conversion (sto-to-det), and deterministic-to-stochastic conversion (det-to-sto). TrueNorth compatible DNNs use a stochastic spike code, where a spike is generated at any given timestep with probability proportional to the value of the data to be propagated through the network. For a given period of length p (32 in our experiments), the time divider propagates spikes to the first output channel on the odd periods and to the second channel on the even periods. Then the sto-to-det subcorelet accumulates on the receiving period and spikes deterministically on the non-receiving period. Since the signals are deterministic, the four phase channels can be directly combined (as a logical OR operation) to get the maximum value [21] . Similarly, the det-to-sto subcorelet receives the combined spikes on the receiving period and spikes stochastically on the nonreceiving period. At the end of the max pooling pipeline, the final output is formed by merging the two timedivided signals with another OR operation. All three subcorelets are clocked by spontaneous neurons that spike periodically and deterministically, to delineate the computation periods. In TrueNorth, the OR operation is achieved by simply targeting spikes to the same destination axon. As a result, no additional cores are required to implement this function. The time divider and the sto-to-det blocks both use ceilð 
The max pooling corelets finds the max of four values represented in a specific length of spikes. If we remove the max pooling and directly wire the signals together, it is essentially doing the max pooling in each tick. We can also expand the max pooling corelet by increasing the number of phases processed to more than four.
CONFIGURABLE PREPROCESSING PIPELINE
The transduction, LATTE, and folding corelets preprocess the audio samples and feed the audio features into the Eedn classifier. In Fig. 1 , the preprocessing pipeline is built together with the classifier in the TrueNorth, while traditional preprocessing operations are implemented in software and performed by the ARM core. According to the requirements of the application, all three stages of the preprocessing pipeline can be configured to tradeoff area, energy, delay, and accuracy.
Transduction
Audio data captured by a digital electrical recording device (microphone + ADC) is represented using binary numbers, but a spiking neural network operates with signals in spiking representations. Therefore, a transducer converts the signal from binary representation, to a spiking signal that the neurons can process. In the transduction corelet (Fig. 9) , each of the digital bit-lines is connected to the respective axon, which is assigned the power-of-two weight according to the significance. The neurons in the transduction corelet accumulate the weights, and spike as many times as the accumulation.
The bit-resolution of transduction for the design shown in Fig. 9 is flexible in the range from 1 to 8 bits. In this paper, we use TIDIGITS [22] as the application dataset. The TIDIGITS data are sampled in 20k Hz, which means 20 new samples coming in each TrueNorth tick ($ 1 ms). In order to measure (or detect) features that correspond to low-frequency spectra (human voice range: 80 to 255 Hz), a frame should be longer than 1=80 sec. In addition, we want to encode the multi-bit signal into a sequence of spikes that have sufficient resolution. Therefore, we use 5 bit resolution for the input signals by sending the 5 bit data to the five lowest-significant bits of (1) and divides the signals into two periods (2) and (3). The sto-to-det subcorelet converts the signals (2) and (3) from a stochastic spike code into a deterministic burst spike code, (4) and (5) respectively. The OR operation on the burst code computes the max across 4 phases. The det-to-sto subcorelet converts the signal back to stochastic code (6) and (7). Finally, the two period streams are recombined to a single output stream (8), per feature. inputs to the transduction corelet, and propagating the spiking signals in a length of 32 ticks to the LATTE corelet. Accordingly, we frame the samples to the size of 32 ticks (32 ms), using 512 (or 256, depending on the LATTE design) of the 640 available samples in every period.
LATTE Options
The LATTE corelet can be configured to trade off among accuracy, area, and power budgets. Table 1 shows the five different configurations explored in detail in this work by changing parameters such as frame size, max-pooling type, and weight representations. We start with a baseline version, LATTE-III and perturb various parameters in our analysis. The LATTE input frame size determines the number of available samples in each sampling period for a given number of cores. The LATTE-I configuration reduces the frame size by half as a means of reducing power and area consumption, but this configuration results in accuracy loss due to the reduction in samples.
The max pooling corelet counts the input spikes of each port for a period of time, and spikes according to the largest number counted. In LATTE-II, we replace the max pooling corelet with an OR operation of the spikes at each tick. While the OR operation is simpler, there may be an output spiking rate even when all the inputs spike sparsely. In that case, we would have to reduce the sensitivity of the weighted sum corelet and lose some detailed features to save power. In the LATTE-IV configuration, we expand the max pooling corelet to take more than the four phases used in the baseline in order to realize a more precise feature extractor. Finally, the LATTE-V configuration uses a wider range of encoded weights to enhance accuracy.
Folding for frame collection
While each output frame from LATTE contains the features from a (32 ms) frame of samples, the audio segment to be classified usually has a length longer than one frame. Consequently, the features from multiple frames are required by the classifiers. In the traditional approach, features of the frames are collected using buffer by the software. In TrueNorth, we designed a folding corelet to perform the same frame collection operation.
In the TrueNorth ecosystem, the signals are streaming spikes which cannot be stored indefinitely in a local buffer. Therefore, we virtually create a long traveling path for the output spikes from the LATTE corelets, so that the spikes remain present in the system for a longer time. As shown in Fig. 10 , the signals are duplicated in every fixed distance (k) along the virtual path, and one of the duplicates is sent to the output port while the other keeps traveling. In that case, the traveling path is essentially folded, and we obtain the signals on the side of the folded path.
To build the virtual path, we use the n delay elements in TrueNorth's axon buffers (k for each) to perform a frame collection over the time length n Â k. The n and k are configurable parameters. In our implementation, we choose n ¼ 23 and k ¼ 48 to cover a period of 1; 104 ms, which is generally long enough to cover the spoken digit of the TIDI-GITS dataset. Fig. 11 shows the features produced by both MFCC (top) and LATTE (bottom). LATTE features (Fig. 11 bottom) are approximated features using the DCT-like processing on TrueNorth, while the MFCC features (Fig. 11 top) are high-precision and are used as a baseline. As described in Fig. 10 . The folding corelet is a virtually folded long data path. In every several tick distance (k), the data are duplicated and sent to one of the outgoing ports. Fig. 11 . Example MFCC (top) and LATTE (bottom) extracted features for the spoken digits (1-9, zero and "oh"). Differences between the digits are seen in the differences in feature response across time and frequency in both representations. Although the MFCC and LATTE feature responses are visually very different different from each other, they both carry sufficient information for speech recognition.
END-TO-END SYSTEM PERFORMANCE
Section 3.1, MFCC feature computation consists of hamming window, FFT, mel-scale mapping, logarithm, and DCT. All those operations rely on the signal representation in high-precision and cannot be mapped directly onto the TrueNorth ecosystem. The MFCC pipeline without the DCT step obtains the mel-scale frequency amplitude, in log scale. Following the same concept, LATTE finds the matching degrees of the input signal to a set of predetermined frequencies which are approximately distributed in mel-scale. The thresholds of the neurons for different frequencies are set according to the observation that signals at higher frequency often have lower amplitude. Therefore, the sensitivity of neurons at high frequencies is higher, and the logarithm is no longer required. In summary, we design LATTE to save power by removing or replacing the high-precision processing steps in the MFCC pipeline.
In this section, we show the performance of various combinations of the preprocessing pipelines using five different LATTE designs and four different sized Eedn networks. All the experiments follow the same flow: 1) We feed the raw audio data of the TIDIGITS dataset to the preprocessing corelet (or the ARM preprocess function with MFCC). 2) We collect the extracted features from the preprocessing stage in spiking format, and save them with the time and input pin information. 3) We cut the spike streams into fixed time windows, and store in the lightning memory-mapped database (LMDB) [23] format. 4) We train the Eedn classifier with the LMDB on a GPU, and generate the network corelet for test on TrueNorth. 5) We test the trained classifier network corelet with the saved spike streams, which are the exact output spike stream from the preprocessing corelet.
LATTE Designs versus Accuracy
The LATTE feature extractor comprises the largest portion of the preprocessing corelet. We implemented the five preprocessing pipelines with different LATTE options shown in Table 1 . As shown in Fig. 12 , the performance of the pipeline using LATTE-III is 94.6 percent. The design using LATTE-I saves about three quarters of the preprocessing overhead while reducing the accuracy to 90.7 percent. In LATTE-II, removing the max pooling corelet, which is a relatively small part of the pipeline, drops the accuracy to 91.2 percent. In contrast, adopting an extended max pooling corelet that processes more phases in LATTE-IV increases the accuracy to 95.1 percent. Finally, the use of high-precision weights in LATTE-V raises the accuracy to 95.5 percent, but increases the size of the preprocessing corelet almost by half. We choose the preprocessing pipelines with LATTE-I and -IV to be the preferred designs for low-power and high-accuracy oriented designs, respectively. In addition to the LATTE corelet designs in TrueNorth, we implement a floating-point version of LATTE that runs on an embedded ARM core with an accuracy score of 97.8 percent. Compared to the corelet versions of LATTE, the floating-point LATTE uses input data in floating-point representation, and floating-point weights. The floatingpoint LATTE algorithm shows the upper bound of the accuracy that the LATTE algorithm can achieve. On the other hand, the classifier trained and tested with the software implementation of preprocessing pipeline using MFCC achieves 99.5 percent accuracy.
Eedn Size versus Accuracy
We implement multiple designs of the Eedn classifier network. Four different of the specialized Eedn designs (Eedn À i, Eedn À ii, Eedn À iii, and Eedn À iv) shown in Table 2 were explored and these configurations used 170, 300, 552, 1,071 cores, respectively. The data fed into the first layer are 46 frames (23 folds, 2 channel) by 64 features. The size of the column dimension decreases as the data moves to the upper layers, meaning each upper column contains the information of multiple columns of lower layers. In the last layer, the 220 output features are correlated with the 11 classes (20 each).
The number of features generated on the lower levels dominates not only the size but the accuracy. The basic design (Eedn À i) mentioned above uses 170 TrueNorth processing cores. We can expand the network by increasing the number of lower layer features which is more important to the accuracy. As shown in Fig. 13 , the LATTE accuracy curve is nearly a straight line, which means that the increase in accuracy is approximately proportional to the number of Eedn cores in log scale. On the other hand, the implementations using floating-point LATTE and MFCC have the highest accuracy with Eedn À iii, and slightly lower with Eedn À iv because of overfitting.
POWER
For power measurement, we use the single-board design platform called NS1e released by IBM. The board includes 
an ARM core, a TrueNorth chip with 1 million neuron cells, FPGA for configuring the TrueNorth chip, and other peripheral circuits such as ethernet, memory, and interfaces.
On the ARM core, we run Linux as the operating system so that we can compile and run the MFCC C++ codes. Fig. 14a shows the active power comparison of LATTE on TrueNorth, MFCC on ARM, and MFCC on FPGA [12] . The power consumed by the operation of the preprocessing corelets are 3:774 mW (with LATTE-I) and 5:92 mW (with LATTE-IV ), while the preprocessing approaches implemented in ARM and FPGA are as high as 62:3 mW and 122 mW, respectively. The software approach on ARM consumes lower active power than the FPGA approach; however, power consumed by the Linux, which is required by the former, is likely to be larger than the FPGA static power 12 mW reported in [12] .
The NS1e board is a general purpose design platform. To support a variety of application requirements, it spends most of the power on the supplemental elements and interfaces, as shown in Fig. 14b . We measured the TrueNorth static power, and scaled it down by the factor of #cores used = #total cores to give a more precise estimation of the real case. We also power TrueNorth with a reduced supply voltage (0:8 V), which is sufficient for TrueNorth to perform the real-time task. Fig. 15 shows the measured power of the TrueNorth endto-end systems with scaled static power. The NS1e is powered with 1:0 V supply voltage. As shown in Table 3 , the two systems using LATTE-IV with Eedn À i and Eedn À iv (170 and 1,071 cores), have the power consumption 28.264 and 72:15 mW, respectively, on Ns1e. When we power the chip with a reduced voltage, the power consumption is saved by almost half. The high-precision system (95.092 percent) consumes 38:59 mW, and the low-power system with a reasonable accuracy (91.8572 percent) consumes 14:429 mW.
RELATED WORK
The search for robust audio features is likely as old as the task of speech recognition itself. Many different features have been used, including LPC [24] , MFCC [11] , PLP [25] , and RASTA [26] features. Others have investigated biological and neural computational auditory features, exploring approaches such as efficient auditory codes [27] and spatiotemporal receptive fields (STRFs) [28] , [29] . Others have sought to replicate biology, using neuromorphic approaches to compute audio features [30] , as well as low-power sensors [31] that transduce and directly output features instead of raw audio. Finally, the field of approximate computing [32] examines the use of reduced-precision computing in order to improve energy-efficiency. The LATTE approach may apply to other neuromorphic architectures [33] as well. However, TrueNorth performs exceptionally well due to its low-power and deterministic digital synapses and computation.
There are several publications using the TIDIGITS dataset. [34] proposed a method to train on the data of adults, test with the data of children, achieving a 90.4 percent score, and then improve the accuracy to 95.88 percent by adapting the children's data in training. [35] reported the accuracy from 98.73 percent (children) to 98.89 percent (adults) using the MFCC plus the stranded Gaussian Mixture acoustic Model (SGMM). The highest accuracy on the TIDIGITS dataset, to the best of the author's knowledge, is 99.81 percent reported in [36] . They use the Power Normalized Cepstral Coefficients (PNCCs), which is an improved preprocessing method from MFCC, along with the Gaussian mean supervectors (GMS) and support vector machines (SVMs). All of the above are computationally intensive software-based solutions.
In this work, we use GPUs to train the Eedn classifier, and run the classifier on the TrueNorth architecture, which implements integrate-and-fire neural computation digitally for robustness. There are several other neuromorphic hardware accelerators/simulators (see [33] for a comparison). Targeting high performance and general-purpose programmability, the SpiNNaker chip [5] is constructed from an array of 18 ARM cores. The FACETS [6] and the BrainScaleS [7] projects use mixed-signal electronics, emulating the biological integrate-and-fire behavior using analog computation cores, and performing high-level communication digitially. GPUs [37] are also popular for accelerating general deep-learning tasks, but consume significant amounts of power. There are also many variations of CNN-specific accelerators, such as DianNao [38] , DaDiannao [39] , and EIE [40] including some that specifically aim for low-power and embedded implementations [41] . Recent works [40] , [42] have examined the relative efficiency of inference for general spiking neural networks, such as TrueNorth, compared with these CNNspecific accelerators, and imply that non-spiking designs can also achieve high efficiency on CNN-inference. However, as demonstrated in this work, the reconfigurable TrueNorth architecture provides the flexibility to also perform more general-purpose computations beyond inference, thereby maintaining high end-to-end powerefficiency for entire tasks mapped to the fabric.
CONCLUSION
This paper presents insights on how a novel task, in this case, a new audio feature extractor, can be designed to conform to the computational structure of the TrueNorth fabric, enabling single-chip solutions with substantially improved energy efficiency. While TrueNorth offers massive parallelism and efficiency for its supported primitives, the most effective way to map to these primitives is not always obvious, and we show how this mapping can be tuned to trade among area, power and accuracy at the application level. Our results show that the IBM TrueNorth fabric is an effective platform that can be configured to single-chip ultra-low power analytics applications. Customizing from the feature extraction front-end (LATTE) to the back-end classification network (Eedn) to best match the TrueNorth architecture has allowed us to implement an always-on audio-recognition system that is capable of lasting for 100 s of hours of continuous use on button cell or phone batteries, fundamentally extending deployment scenarios.
ACKNOWLEDGMENTS
This work is supported in part by US National Science Foun- Dharmendra S. Modha (M'92-F'11) received the BTech degree in computer science and engineering from IIT Bombay, Maharashtra, India and the PhD degree in electrical and computer engineering from the University of California at San Diego, San Diego, California. He is an IBM chief scientist for brain-inspired computing. He has made significant contributions to IBM businesses via innovations in caching algorithms for storage controllers, clustering algorithms for services, and coding theory for disk drives. He is an IBM master inventor. He is a cognitive computing pioneer and leads a highly successful effort to develop brain-inspired computers. His work has been featured in The Economist, Science, New York Times, Wall Street Journal, The Washington Post, BBC, CNN, PBS, Discover, MIT Technology Review, Associated Press, Communications of the ACM, IEEE Spectrum, Forbes, Fortune, and Time, amongst others. His work has been featured on covers of Science (twice), Communications of the ACM, and Scientific American. He has authored more than 60 papers and holds more than 100 patents. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
