Abstract-The control of upper limb neuroprostheses through the peripheral nervous system (PNS) can allow restoring motor functions in amputees. At present, the important aspect of the realtime implementation of neural decoding algorithms on embedded systems has been often overlooked, notwithstanding the impact that limited hardware resources have on the efficiency/effectiveness of any given algorithm. Present study is addressing the optimization of a template matching based algorithm for PNS signals decoding that is a milestone for its real-time, full implementation onto a floating-point digital signal processor (DSP). The proposed optimized real-time algorithm achieves up to 96% of correct classification on real PNS signals acquired through LIFE electrodes on animals, and can correctly sort spikes of a synthetic cortical dataset with sufficiently uncorrelated spike morphologies (93% average correct classification) comparably to the results obtained with top spike sorter (94% on average on the same dataset). The power consumption enables more than 24 h processing at the maximum load, and latency model has been derived to enable a fair performance assessment. The final embodiment demonstrates the real-time performance onto a low-power off-the-shelf DSP, opening to experiments exploiting the efferent signals to control a motor neuroprosthesis.
ical signal as electroencephalogram (EEG), electromyogram (EMG), and electroneurogram (ENG). Despite EMG-controlled prostheses already entered the clinical practice, ENG ones, which are necessarily invasive, requiring to access the peripheral nervous system (PNS) [1] , are still experimental. ENG is potentially able to provide more information than EMG to control more degrees of freedom (unless approaches such as targeted reinnervation [2] are used), at the expenses of a significantly higher computational complexity [3] . In order to control a motor neuroprosthesis, the decoding algorithm must be implemented onto an embedded signal processing platform, characterized by limited resources, low operating frequency and tight real-time bounds to fulfil. This aspect is often overlooked, assuming it as a second-order issue, with obvious consequences on the validity of the achieved results in a real scenario.
The aim of this paper is to present the real-time, full implementation onto a floating-point digital signal processor (DSP), of a PNS signal decoding algorithm based on spike sorting and classification. To the best of our knowledge, this is the first time a completely embedded solution for PNS signals decoding, with a power profile compatible with a wearable implementation, is presented and evaluated. Different optimization aspects are discussed, from the simplest steps needed to switch from the off-line to an on-line implementation, to the algorithm modifications aimed at achieving the smallest memory footprint and improved code efficiency. DSPs are presented as valuable targets for on-line PNS signal decoding, being able to effectively perform the required functions with all the flexibility typical of microcoded implementations. Compared to the very preliminary porting exploration [4] , the current solution enables the development of a completely stand-alone neuroprosthesis, re-trainable without external computers.
The remainder of this paper is organized as follows. In Section II, the background information on both the aim of this work and the chosen original algorithm is provided. Section III deals with the algorithm optimization and changes applied to achieve real-time performance onto an off-the-shelf DSPs. Porting details are provided in Section IV. The datasets used for testing are illustrated in Section V, whereas the achieved results are presented in Section VI. Conclusions are given in Section VII.
II. MOTIVATION AND BACKGROUND
Neural signal decoding is a critical signal processing stage required by CNS-and PNS-based neuroprostheses, mainly based on spike sorting algorithms to analyse the single neuron activity from extra-cellular recordings [5] . Thanks to the evolution of the neuronal interface at the PNS level with intrafascicular electrodes characterized by a good selectivity [6] , in principle it is possible to perform decoding on the PNS with techniques similar to those exploited for the CNS. However, at cortical level, signal-to-noise ratio (SNR) is usually good enough to clearly recognize the action potentials, whereas on the peripheral nerves their amplitude is typically lower and so the SNR [7] .
There is a huge number of techniques for CNS signals decoding [8] . Some of them undoubtedly have clear on-line vocation, as for those based on Kalman filtering [9] , and have been successfully exploited on human subjects [10] . Other techniques, more specific for PNS signals and based on spike sorting and classification, are more challenging form a computational perspective [7] . Taking into account the final goal of this work, in this paper one of these algorithms has been selected [11] . Originally tested on the animal model to decode afferences, i.e., the tactile/proprioceptive stimuli provided on the leg where the neural interface is implanted, is has been successfully exploited for off-line decoding of efferences on a human amputee [12] .
Despite the chosen algorithm had potentialities to be adopted in an on-line scenario, this is not true for other better algorithms such as one of the top state-of-the-art ones (SPC, [13] ). This algorithm consists of the following processing steps:
• a fourth-order noncausal bidirectional elliptic IIR bandpass filtering between 300 Hz and 3 kHz; • SD using a fixed amplitude threshold based on the signal variance; • SS based on superparamagnetic clustering (SPC). The algorithm assumes a refractory period of about 2 ms, not including any strategy for the overlapping spikes. For each detected spike, limited to 64 samples at a sampling frequency of 24 kHz by a time windowing, the algorithm uses as features for the classifier the wavelet transform coefficients that meet the criterion of Kolmogorov-Smirnof normality. As last stage, the algorithm includes the unsupervised SPC, automatically selecting through a Monte Carlo simulation the temperature parameter value, exploited to influence the cluster size and modifiable by the user in order to really achieve the best performance. This algorithm has been considered here as golden standard for performance comparisons.
A. Original Formulation of the Chosen Algorithm
The chosen algorithm [11] consists of four stages: wavelet denoising (WD), spike detection (SD), spike sorting by template matching (SS) for feature extraction, and classification (CL). The signal is assumed to be sampled at 12 kHz. This is in line with the current literature, stating that 10 kHz is the optimal sampling frequency for similar recordings [14] .
WD is performed exploiting a translation-invariant wavelet transform with Symmlet 7 mother wavelet, three decomposition levels, hard thresholding of the details and complete removal of the approximation. The latter choice limits the signal bandwidth from 750 Hz to 6 kHz. The minimax threshold method is used, providing a scaling factor for the standard deviation of the noise , computed through the median absolute deviation, over 45 s of quiescent signal.
SD is based on a fixed threshold, computed as three times the standard deviation of the samples in such time interval.
SS is performed by template matching based on the Pearson's correlation coefficient. Every new detected spike is extracted (variable support length) and compared by cross-correlation with the existing templates. After the spike and a template have been aligned on the highest correlation value, when: 1) the correlation value is higher than a fixed threshold , 2) the ratio between the mean square difference of the spike and template, and the power of the template, is less than 0.5, then the spike matches the template. The best matching template, , (exhibiting the highest correlation, ) is selected. During the tuning phase (TPh), the templates are created when at least one of the above conditions cannot be satisfied, and the best matching template is updated by synchronized averaging every time all the conditions are met. At the end of TPh, the templates with a reciprocal correlation higher than undergo a weighted synchronized averaging. Then, only the most used templates are retained for the SPh. After training, during steady phase (SPh), i.e., the same procedure is performed without any template update.
A pattern to be recognized by the classifier is a 1 array of features where every element represents the percentage of spikes (in a -sample long window of the input signal, lasting s, where ranges from 3-6 s) with a waveform highly similar to one of the templates available from the SS stage. Classification is performed by a -support vector machine (SVM) with radial basis function (RBF). To solve the multi-class problem (each class represent a decoded stimulus) one-against-one approach was used.
III. OPTIMIZATIONS FOR REAL-TIME PERFORMANCE
Before making any change to the algorithm, the architectural target needs to be identified. Application specific integrated circuits (ASIC) and field programmable gate arrays (FPGA) represent an efficient solution but require a complex design and exhibit limited flexibility. In fact, ASICs cannot be modified after the production, whereas even though FPGAs can be reconfigured, this requires more effort than a simple program revision. Conversely, microprogrammed implementations are more flexible but less efficient, due to the general-purpose processor architecture. In this case, DSPs represent a valuable target, being embedded processors expressly designed for advanced signal processing. As any embedded processors, DSPs presents limited resources and programming issue to be considered. Nevertheless, when appropriately programmed, they could easily outperform mainstream desktop processors on signal processing tasks [15] or achieve comparable results with dramatically better power consumption performance [16] .
Once identified the architectural target, as a broad family of devices such as DSPs, the algorithm needs to be adapted for the on-line execution, and the code ported on the selected specific device. Despite porting is a rather specific task depending on the chosen processor, also algorithms modifications to enable on-line effective and efficient processing depend on the chosen device.
1) Block-on-Line Processing: In order to cope with the conflicting requirements of processing and classification, a blockon-line approach has been chosen for the on-line implementation. The input is a stream at kHz (signal samples) and the output is a class assigned to a classifier sample (hereafter called pattern) at Hz, which is similar to other works in the literature [17] . Furthermore, such a value, which can be modified, entails a fair delay and a reasonable refresh rate for the prosthesis commands. An additional trigger signal is processed to mark the underlying neural activity for training purposes.
To achieve a good spikeness for the pattern, intended as the number of spikes per window, the observation window for features computing should be large enough to contain several spikes and should "slide" on the input signal with a frame rate adequate to follow local variations in the signal in real time. Given , the frame rate defines the time instants in which the patterns will be extracted to be processed by the classifier, i.e., with . It is possible to choose , so that is the number of new signal samples, forming a , needed to push on the window (the window slides with an overlap of , e.g., 75% for ). Since only the classifier requires to operate on the whole window, a "virtual" sliding window is implemented in the code for the first three stages (WD, SD, SS), the last one being in charge of preserving a memory of the spike sorting results for the latest blocks (updated block-wise) to compute the features for a single pattern. This choice reduces the memory requirements and the computational redundancies. Noticeably, is a user-defined parameter.
To provide an overview of the proposed algorithm in its on-line version, Fig. 1 describes with a pseudocode the different processing parts executed every time a new block of input samples is available. The different conditions responsible for the transitions between the states of the algorithm can be automatically set by the code at run-time (e.g., whenever an established number of spikes has been processed, after a specific amount of time, in response to an external trigger).
The sequence of the macro-states of the algorithm changes over time in order to achieve the automatic tuning of the various stages before the normal operation: 1) WD with threshold tuning in loop with SD; 2) WD and SD with established thresholds, followed by SS in TPh; 3) templates reduction and definition of the final set of templates; 4) WD and SD with established thresholds, followed by SS in SPh; 5) CL creation and training; 6) WD and SD with established thresholds, followed by SS in SPh and then by CL (pattern recognition).
A. Wavelet Denoising Optimization
For orthogonal wavelets, exploiting a dyadic discretization, WD can be implemented using a trellis of quadrature mirror FIR filters, a low-pass ( )/high-pass ( ) pair for analysis and their mirrored versions for synthesis. The analysis produces the approximation signal ( , low pass), passed to the next level, and the detail one ( , high pass), retained for thresholding. Since every low-pass filter in the analysis trellis halves the Nyquist frequency of the approximation signal, it is usually downsampled for both computational and memory efficiency (the same filters can used in all the stages). Such a scheme is not translation invariant so in [11] it is combined with cycle spinning to overcome this limit. This approach is not suited for real-time processing because of the computational cost and the intrinsic off-line nature of the procedure. An à trous approach, which oversamples at each scale the analysis filters rather than downsampling the associated streams, has been rather adopted.
Furthermore, the Symlet 7 mother wavelet, chosen to be similar to a typical action potential [14] to exploit matched filter properties in the processing, presents a large time support. This has reflections on both computational complexity and delay, compared to shorter wavelets such as the Haar one which will be alternatively considered.
Moreover, similarly to other works like [13] , the WD-filtered signal bandwidth could be narrowed to 375 Hz-3 kHz by adding one level of WD and cleaning the first detail. Remarkably, the upper cut-off frequency is still larger than 2 kHz, which is known to contain the most of the physiological information [14] . With four levels, a latency of 196 samples can be experienced with the Symlet 7 whereas only 16 samples are required by the Haar. This affects the memory footprint too, since between analysis and synthesis, the detail signals need to be delayed to obtain the correct alignment in time. This is achieved writing with the correct offset in the recomposition filters buffer, taking into account the time lag as , where (1) being the length of the filter and % the reminder of the integer division.
The minimax method for threshold computing presents a dependency from the length of the signal, which is misleading in on-line processing where the window length is extremely short. A fixed precomputed scaling factor equal to 3.9 has been used, corresponding to the value computed in [11] . Moreover, the sample standard deviation can be more efficiently used in place of the median absolute deviation since, in an on-line integrated approach, can be updated only when no spikes are detected in the signal. In order to evaluate over a frame of samples larger than , i.e., , optimizing the latency, it is possible to compute for every incoming block of new samples (at each scale ) their squared sum (all the detail coefficients are zero-mean)
, so that for every new block of samples the last partial squared sums can be summed up, multiplied by and then the squared root computed (2)
B. Spike Detection Optimization
Rather than a simple amplitude threshold, another one based on a nonlinear energy operator (NEO) [18] can be adopted to improve the performance, obtaining a pre-emphasis of the spikes
The threshold is then obtained as a scaled version of the mean value of the NEO in a predefined time interval (4) where is empirically chosen. Tests performed on synthetic datasets [19] showed that the NEO is generally less sensitive to the choice of the amplitude threshold of the SD stage, and SD is more accurate and with limited computational complexity.
Only the detected spikes whose support is shorter than a parameter (expressing a time value in terms of number of samples) are analysed. During the TPh, larger spikes are discarded whereas during SPh a smaller chunk of samples is possibly extracted looking at the threshold crossing only. This allows recognizing spikes embedded into bursts but does not address the problem of the overlapping spikes.
C. Spike Sorting Optimization
The SS is a correlation-based algorithm (Fig. 2) , which must be able to work in real time both when the templates are being created (TPh) and when they are stable (SPh). When the WD thresholds are being computed, SS is skipped. During the TPh, the algorithm can store up to waveforms in a templates matrix. Such a structure is stored in a linear array, words long, row major, for faster access on a DSP. An infinite template matrix is emulated during TPh by overwriting the template with the lowest occurrence when a new one needs to be stored and the matrix is full. To speed up the computation of the cross-correlations with the Pearson's coefficient, the templates are standardized to have zero mean and unitary variance.
During SPh, the SS acts as a feature extractor for the classifier, which operates downstream. A periodic template update can be allowed. Despite a new pattern is generated every 0.25 s, it takes into account the ENG activity over samples. A matrix has been used to store row by row the occurrences of each template in each of the blocks composing an -wide sliding window. Such a matrix is updated column-wise at every new block and a pattern for the classifier is a feature vector obtained summing up by columns the occurrences of each template and dividing all the elements by the sum of the spikes occurrence in that window (the spikeness index). Such an approach reduces the processing time at the expenses of a slightly larger memory footprint. Before training, the patterns undergo a normalization process.
D. Classification Optimization
Compared to the work presented in [4] , the whole classifier has been recoded, and the training has been also embedded into the DSP firmware. A binary trigger has been exploited for on-line labelling during training. In an interactive training scenario, labelling can be performed exploiting GPIO pins of the processor, by manually selecting the performed (or desired) movement among a given set, or automatically (in case of rigidly structured and synchronized training procedure). To avoid calling the classifier in case of low ENG activity, when the trigger flag is active the pattern is passed to the classifier only if (5) where and represent respectively the sample mean and standard deviation of the spikeness index computed over the whole signal which will be used for training.
The soft-margin version of SVM classifier (C-SVC) with RBF kernel has been preferred to the original -SVM, because of its easier optimization. The optimal values to be assigned to the various parameters to maximize the results in terms of classification accuracy and processing latency have been experimentally selected ( for the RBF, cost for the C-SVC, tolerance of the termination criterion equal to 0.1). The multi-class problem can be better solved by means of a "One-vs-The rest" approach rather than a "One-vs-One", since the number of classifier models to train is reduced to the number of classes.
Algorithm Versions
Despite some of the modifications do not introduce differences in terms of algorithm effectiveness, some others could. To this aim, different versions of the algorithms have been developed and tested, changing some aspects related to the first three stages (regardless the chosen mother wavelet for WD).
• V1, similar to [4] and then to the original algorithm (WD with three levels, removing , bandwidth 750 Hz-6 kHz, SS on the output of the WD stage) but the NEO has been introduced in the SD; • V2, a modified version of V1 (WD with three levels, removing , bandwidth 750 Hz-6 kHz, NEO for SD) but the SS is performed on the output of a 300 Hz-3 kHz band-pass filter as in [13] , using WD only for the SD; • V3, a modified version of V2 (NEO for SD, WD only for the SD, SS on the band-pass filtered signal and not on WD output) but exploiting a WD with four levels, removing and also leading to a bandwidth of 375 Hz-3 kHz which is close to [13] ; • V4, similar to V1 (NEO for SD, SS on the output of the WD stage) but, as in V3, the WD has four levels, removing and also leading to a bandwidth of 375 Hz-3 kHz.
ALGORITHM PORTING ON A DSP PLATFORM
Despite the presented optimizations could be applicable to any architectural target, they have been studied to achieve the highest performance on DSP platforms. For instance, the choice of block-on-line processing rather than sample-by-sample processing, is beneficial for DSPs because of the presence of specialized hardware/software modules tacking advantage from block processing (direct memory access (DMA), address generators, specialized software libraries, etc.).
As target for this study, we chose a TMS320C6748 floatingpoint DSP, hosted on the OMAP-L138 low-power applications processor by Texas Instruments, also including an ARM core and several shared resources. The ARM core can be exploited to perform some processing tasks or as host processor, easily managing the communication with external acquisition units, I/O peripherals such as a small detachable LCD display or a storage unit. The DSP can run up to 465 MHz and presents a Very Long Instruction Word (VLIW) architectural model enabling the execution of up to eight arithmetic operations (mixed fixed-and floating-point) in parallel. The efficiency of the implementation on this kind of processor strongly depends on the VLIW code density: the best results can be obtained only by means of optimized libraries and an adequate coding technique, also respectful of the processor memory organization.
The chosen DSP presents a memory hierarchy organized in four levels. Level 1 (L1D and L1P, 32 kB each) includes two memories configurable either as RAM or cache (data and instructions). Level 2 (L2, 256 kB) is a RAM that can be configured partly or totally as second level cache. L1 and L2 are internal to the DSP core. Level 3 (L3, 128 kB) is an on-chip shared memory, external to the DSP core, on the same clock domain of L1 and L2 and theoretically with the same bandwidth. Nevertheless L3 pays additional latency due to competing traffic, prioritization, and use of shared resources. Level 4 (L4) is the off-chip (external) memory. Potentially bigger, it introduces a penalty of several clock cycles compared to the others for the access. In our implementation, all the code sections working on the data at 12 kHz, and the related variables, have been placed in L2 (configured as all-RAM) along with the on-line CL code at 4 Hz, whereas both the code and the variables used for the CL training have been placed in L4.
The application code has been written in C. All the parameters requiring an on-line tuning are automatically set up by the algorithm at run-time without any interaction with the user. The minimal set of functions of the C++ LIBSVM library [20] , required by the implemented classifier, have been translated in C with limited memory footprint. All the advanced coding practices for this kind of platform, along with a careful memory allocation, have been exploited, and the code has been compiled with the highest optimization (-o3). Some optimizations, which can be enabled or not, limit the portability of the C code to the processors of the same family. For instance, some data movements are managed by the Enhanced Direct Memory Access (EDMA3) peripheral in order to execute them in parallel with the CPU processing. Advanced single-precision DSP library functions can be enabled at compile time to improve the VLIW code density, thus leading to a better parallelism exploitation, which in turn means that more instructions can be performed in parallel on the different issue slots of the processor. Some of these functions, e.g., the DSPF_sp_fir_gen(), used in the WD stage for the FIR filtering, take advantage of the chosen block-on-line approach, allowing to reach very high performance compared to a sample-by-sample approach. Another library, MATHLIB, has been used to speed up the processing of some scalar math operations.
IV. DATASETS EXPLOITED FOR TESTING
As stated before, different versions of the on-line algorithm have been proposed, along with a different mother wavelet se-lection (Haar rather than Symlet 7. In order to evaluate their effectiveness relying on a ground truth, limitedly to the first three stages of the algorithm, we exploited a synthetic dataset [13] . It includes artificial mixtures of real spikes (from three neurons), extracted from neocortex and basal ganglia, and additive noise representing the contribution from other far neurons. It can be also used also to test SS algorithms for PNS signals decoding, when the action potentials can still be identified because of a good SNR.
The synthetic dataset [13] includes different signals (Easy1, Easy2, Difficult1, Difficult2) with increasing levels of complexity. The complexity is related to the similarity in the morphology of the action potentials of the three neurons, revealed by the Pearson's correlation coefficient (Easy1 containing spikes correlated up to 0.8, the other datasets having higher correlation). Since the proposed algorithm is based on template matching, the results achieved on the Easy1 dataset shall be considered with the greatest attention. The background noise level is defined in terms of its standard deviation and it ranges from 0.05 to 0.40. The average firing rate is 20 Hz, and the number of spikes per neuron is about 3000 in each one-minute signal.
In order to test the CL stage too, we exploited real afferent signals from the PNS, kindly provided by Prof. Xavier Navarro and his team (Universitat Autònoma de Barcelona), acquired through Longitudinal Intrafascicular Electrodes (LIFE) in the sciatic nerve of rats [21] and recorded according to the protocol used in [22] . Five classes of sensory events can be identified on segments of the available signal, i.e., touch sensation elicited over four different areas of the rat limb (A to D) stimulated with Von Frey filaments, and flexion movement performed with animal hind limb.
V. RESULTS
The evaluation of the proposed algorithm and its implementation is presented in this section in terms of effectiveness (quality of the different versions) and efficiency (in the light of a realtime implementation). Effectiveness analysis will lead to the selection of the best version to be analysed in terms of efficiency, measured in terms of processing capabilities and power consumption.
A. Effectiveness Analysis
PNS neural signals are usually affected by a low SNR [7] . From this perspective, WD represents an important pre-processing stage whose influence on the spike detection can be demonstrated by using the synthetic dataset. V1-V4 version, with both Symlet 7 and Haar wavelets, are compared with the SPC only in terms of spike detection figures of merit, i.e., true positives (TP) rate (the percentage of true detected spikes over those present in the signal, Fig. 3 ) and false positives (FP) per minute (Fig. 4) .
In both cases, the proposed SD approach, combining WD and NEO, performs better than the one of the SPC for the largest part of the signals, with isolated exceptions on the Easy1 database. The Haar wavelet reveals a higher TP rate than the Symlet 7 but also a higher number of FP/min. SS performance has been evaluated in terms of percentage of matching. From Table I we can see that the results seem to be strongly influenced by the noise level and are largely worse than those achievable off-line with the SPC for all the dataset but the Easy1 one, as expected. The best performance can be achieved with the V4 version of the algorithm and the Haar wavelet. Results are definitely worse for the versions of the on-line algorithm performing the SS on a band-pass filtered version of the signal rather than on the wavelet denoised one. This seems to be correlated to the limited effect of band-pass linear filtering on such signals when performing spike sorting by correlation methods. Again the Haar wavelet seems to be more effective than the Symlet 7. Remarkably, the SPC works with features obtained from wavelet analysis, so that the effect of band-pass filtering is superimposed to that of wavelet sub-band analysis.
Since all the V1-V4 versions of the on-line algorithm present the NEO as SD stage, it is worth to evaluate how much this stage influences both detection and sorting, compared to a solution based on a fixed threshold. Limiting our analysis to V4 with the Haar wavelet, the behaviour of the two solutions is similar, Fig. 5 . SVM classifier accuracy (mean and standard deviation over 3000 random training/test pairs on the same dataset) with and varying , comparing the algorithm implemented in [4] with the current V4 version using the Haar wavelet.
the NEO allowing a considerable reduction of the FP/min, up to 75%. All the more so, in terms of matching, as can be seen comparing SPC, V4-H and V4-H abs rows in Table I , the results exploiting the NEO in the spike detection are better than those achievable with a fixed threshold.
It is then worth to see whether such results reflect what happens on the real PNS signals. Tests have been performed evaluating the classification accuracy while varying the time frame for a pattern ( ), comparing the algorithm implemented in [4] with the current V4 version (Fig. 5) . Accuracy has been computed as average over 3000 different training/test partitions of the same dataset, randomly divided every time using 80% of samples for the training set and the rest for the test set. V4 is more robust to the variation of , allowing to obtain a more efficient hardware implementation for the same classifier performance. Moreover, its accuracy is always higher, particularly for the most interesting low values of , from a minimum of about 74% to a maximum of 96% with a standard deviation that decreases as increases.
B. Efficiency Analysis
The previous analysis reveals how, compared to the original algorithm and to its first tentative porting, the proposed modifications lead to improved effectiveness performance. In particular, the best version is V4, using the Haar wavelet. This solution has been then profiled in order to derive a latency model. This is a mathematical model, descending from cycle-accurate profiling, able to provide (with some approximations) the expected latency (CPU clock cycles) for a given code under different test conditions. Since the code optimizations, and consequently its performance, can be different if only a part of the code is compiled, in order to create a fair latency model the code was not modified (except for the parameters to tune) for the analysis of the different sections but the input data changed in order to be able to trigger specific behaviours. Compared to other techniques, this approach is very time consuming but allows pursuing accuracy.
The main issue in the creation of the latency model is the high number of branches that the algorithm can take during the different phases of its execution. Referring to Fig. 2 , it is clear that different latencies will be experienced when only the SD is performed or when the system is creating the templates (TPh) or during normal operation (SPh). Even if we created a model for each possible combination, varying the proper parameters, only the two worst-case working conditions will be analysed hereafter. The worst case in the TPh is identified in Fig. 2 with the darkest shading, whereas the branches including the lighter shading are those related to the SPh. White blocks can be considered not part of any worst case scenario. Only the parts of the code that must be executed in real time have been evaluated for the latency model: the CL training and the templates reduction have been profiled apart.
From the code profiling, the WD for the V4 version of the algorithm, with the chosen window length, requires about 320 kcycles with the Haar wavelet. Such a result cannot be compared with that reported in [4] because the current memory requirements are higher and then a larger use of the external memory is required, with the consequent latency penalties. Within the current framework, the implementation of the original 3-level WD stage with the Symlet 7 wavelet would require about 640 kcycles. Taking also into account the demonstrated superiority in terms of performance of the Haar wavelet compared to the Symlet 7 one, the adoption of the former should be preferred. The final model during the TPh is described by the following relation: where is the cycle count, is the duration in samples of the spikes, is the number of elements of the templates matrix, is the template length in samples, is the number of spikes in a -sample window ( in our case). It is worth to note that the dependence from is very small compared to the other parameters. For this reason, considering an average case of , as from the available signals, the final model during the TPh is (7) Using this model, it is possible either to:
• know the latency for a given when the maximum number of templates in this phase is fixed to and every template length is or • invert the model considering the maximum number of available cycles (Real-Time Bound , which is 116.25 Mcycles when clocking the DSP at 465 MHz) so that, fixing the structural parameters, it is possible to know how many spikes can be analysed in real time. The second choice leads to the following model:
Some of the implemented optimizations forbid the adoption of this model at a very fine granularity because, for instance, the parameter must be a multiple of four. The set of curves in Fig. 6 (top chart) allows evaluating the maximum number of processable spikes per second, during TPh, as a function of , for different values of . Following a similar reasoning during the SPh, the final latency model is the following, where is the number of templates after their fusion (i.e., in the SPh) and is the number of support vectors (9) The number of spikes per second in this case is depicted in Fig. 6 (bottom chart) , where has been fixed to a typical value of 700. With 40 templates in TPh and 10 in the SPh, and exploiting the same of the SPC algorithm for the synthetic database (32 samples at 12 kHz), the on-line algorithm is able to process more than 890 spikes/s in TPh and more than 3500 in SPh, which represents an important result in terms of possible exploitation even in case of extension to multichannel recordings. In this case, the computational power can be distributed across the different channels with acceptable performance levels.
In order to be able to appreciate the distribution of the computational load across the different stages of the algorithm, a point in the parameters domain has been taken and the amount of clock cycles required to the processor to carry out the different stages of the algorithm is presented in Table II .
In order to investigate the accuracy of the latency model in the steady state, actual profiling data was compared to the results obtained with the model for a different number of incoming spikes. The same parameters used for the algorithm evaluation on the real dataset ( , ) was employed. The model prediction error was of 5% on average (standard deviation 0.58%). Remarkably, the model always overestimates the actual cycle count.
In terms of memory footprint, the real-time code requires only 39 kB of memory and the data in L2 expressly instantiated in the code require 177 kB of memory. Taking into account also the other code sections (used by the DSP/BIOS operating system of the DSP and its variables), the code on the internal RAM reaches 252 kB of the available 256 kB. The L4 memory usage reaches 580 kB, including 259 kB for the external data section of the CL and 26 kB for the related code.
C. Power Consumption Analysis
An accurate estimate of the power consumption of the more demanding component in the system, the OMAP processor, can be achieved by using the spreadsheet available in [23] . Accuracy depends on the realism of the operating parameters used to fill in the spreadsheet. Compared to performing actual measurements on an EVM, this methods allows overriding the power consumption of the various hardware on-board components sinking current from the power source although they are not used. For this application, we chose a configuration where both cores (DSP and ARM) are enabled and the mDDR SDRAM, the EDMA3, PLL0 module, the SPI0 port and the USB1.1 port (useful as interfaces towards external devices, e.g., an acquisition unit) are also active. Percentage of utilization of the SPI0 port has been obtained by comparing the maximum bit rate for the port (50 Mb/s) with the expected one (12 kHz, eight channels, 16 bit per sample yield about 1 Mb/s). The utilization of the mDDR has been derived directly from the cycle accurate profiling of the application by counting the number of cycles spent to perform data movement from/to the L4 memory. A junction temperature of 50 has been used. The ARM core has been supposed to be under a typical workload. Different device frequencies and DSP loads have been tested in order to obtain the power consumption versus the number of processed spikes, as shown in Fig. 7 .
From this analysis it is possible to estimate which power source could be used in practice. For example, with a standard Li-Ion single cell battery (4.2 V, 3000 mAh capacity), supposing the system must process 400 spikes/s (which is compatible with the duration of a spike), the projected duration of the battery would be 34 h when clocking the DSP at 456 MHz or 45 h at 300 MHz. These numbers are compatible with a real scenario.
VI. CONCLUSION
This paper presents a complete study of the optimization of a state-of-the-art algorithm for PNS signals decoding, already exploited off-line on animals and humans, in order to obtain the best performance on a limited-resources embedded platform such as an off-the-shelf DSP. Compared to custom VLSI or FPGA implementations, the adoption of these highly efficient micro-programmed architectures leads to a greater flexibility. For the time being, this is particularly useful for closed-loop experiments when the signal processing algorithms need to be quickly adapted to previous experimental evidences. The proposed work also identifies improvements to the original algorithm able to guarantee the real-time performance with higher quality in the results compared to the original version.
Several tests have been performed both on synthetic neural datasets and on real afferent signals recorded in vivo from rodents. The optimal version of the algorithm allows achieving an accuracy up to 96% in classification. Spikes sorting performance on a synthetic dataset is on average of 93% correct classification, which is comparable to the results obtained with a top spike sorter (94%) on the same dataset. Due to the template matching nature of the implemented algorithm, such results are limited to the Easy1 part of the dataset, taking into account that the others presents spikes with a correlation higher than 0.8.
The derived latency model allows the identification of the working point under different parameters setting. In a singlechannel implementation, the algorithm is able to process 890 spikes per second when the unsupervised templates creation procedure is running (working on 40 templates), and up to 3500 after training (with 10 templates), in both cases with 2.7 ms templates. Such numbers prompts the possible extension to a multichannel scenario involving closed-loop real-time experiments including the complex phase of classifier training, now included in the same embedded framework so that also the training phase could be carried out without any external tool. Power consumption reveals performance compatible with a usage period longer than 24 h, with a fanless device requiring about 600 mW in the worst-case condition.
