Abstract-Principal component analysis (PCA) spike sorting hardware in an integrated neural recording system is highly desired for wireless neuroprosthetic devices. However, a large memory is required to store thousands of spike events during the PCA training procedure, which impedes the on-chip implementation for the PCA training engine. In this paper, a mean pre-estimation method is proposed to save 99.01% memory requirement by breaking the algorithm dependency. According to the simulation result, 100 dB signal-to-error power ratio can be preserved for the resulting principal components. According to the implementation result, 6.07 mm 2 silicon area is required after a 283.16 mm 2 area saving for the proposed PCA training hardware.
I. INTRODUCTION
On-chip implementation of neural signal processing along with the recording circuitry can significantly reduce the data bandwidth and is a key to enable the wireless neural recording system with a large amount of electrodes [1] . Without such data processing, large amount of data need to be transferred to a host computer, and typically a cable is required. In this case, patients and test subjects are restrained from free movement, which disturbs the measurement and observations and impedes the advance in fundamental neuroscience research and wireless neural prosthetic devices for patients.
A promising approach to achieve the bandwidth reduction is to extract spike features immediately after spike detection on the implant site [1] [2] [3] [4] [5] . Only the event times and some additional features about classification are transmitted after the signal processing. This approach achieves more than 100-fold data reduction while preserving the neuron signature for discrimination of individual neuron signal sources. The principal component analysis (PCA) [1] [2] [3] and wavelet transformation [4, 5] are the most widely used tools for this approach. In this paper, we propose to achieve a dedicated co-processor for PCA-based spike sorting algorithm.
A complete PCA-based spike sorting hardware has two major parts-the parameter training and the on-the-fly feature extraction. After spike detection, the training collects the detected spike events and extracts the major characteristic vectors, the principal components (PCs). In general, these vectors can optimally differentiate neurons in least square terms. The operations of the covariance matrix and the leading eigenvectors are two major mathematic calculations in this part. In the on-the-fly feature extraction, the feature scores are extracted by projecting the detected spike events on the PCs. The operation of inner product is required here. Note that periodic re-training is frequently required because of the movement of the electrodes and the change of the environment [6] .
In realizing a complete on-chip PCA-based spike sorting coprocessor, the most challenging problem is to design the PCA training engine to calculate PCs from the detected spike events. There are two main difficulties. First, eigenvalue decomposition (EVD) of the covariance matrix is typically required to generate the eigenvector. The EVD procedure requires matrix diagonalization that is very complicated for hardware implementation. Second, thousands of detected spike events are recommended to used for PCA training. It is necessary to store all these spike events in a large on-chip memory because of the algorithm dependency. The large on-chip memory requirement impedes the low-area implementation. In [7] , the first eigenvector generation hardware is proposed based on the iterative eigenvector distilling algorithm [8] . In this paper, we will focus on the second problem to complete a selfcontained PCA spike sorting engine. The rest of this paper is organized as follows. In Section II, a big map of the neural signal processing platform is proposed along with the PCA spike sorting co-processor. Section III describes the memory problem and the proposed solution for on-chip PCA training engine. The corresponding architecture design is described in Section IV. Finally, Section V presents the implementation and simulation results, and Section VI concludes this work.
II. NEURAL SIGNAL PROCESSING PLATFORM AND
PCA SPIKE SORTING CO-PROCESSOR Figure 1 shows the proposed neural signal processing platform supporting the PCA spike sorting algorithm. The platform architecture is a versatile hardware structure that can provide the interoperability for different functional modules, the scalability for different application requirements, and the adaptability for future extensions without substantial modifications. Further, a bio-signal processing unit (BPU) is embedded to provide the flexibility in algorithm development while a dedicated co-processor for PCA spike sorting is designed to efficiently accelerate the computationally intensive algorithm. Under the real-time, low-power, and small-area constraints, this platform structure along with the general-purpose BPU and the dedicated spike sorting coprocessor can provide both flexibility and efficiency for signal processing in neural prosthetic devices. For the PCA spike sorting co-processor, the neural data are amplified, filtered, and digitized by the analog frontend interface circuitry (AFIC) and then fed to the co-processor. A band-pass digital filter is first performed to reject the low-frequency field potential and high-frequency thermal noise. Then, the nonlinear energy operator (NEO) spike detector [9, 10] calculates the energy function for the neural waveform. Once the result reaches the threshold at the peak of the convex curve, a find-spike-event signal is sent. If a spike event is detected, a section of neural waveform is sent to both the PCA training engine and feature extraction engine. The training engine calculates the covariance matrix and eigenvectors of the detected spike events. The results of leading eigenvectors, or the PCs, are stored in the PC memory. After training, the feature extraction engine performs inner product between the trained PCs and the subsequent detected spike events to produce the feature scores.
As for signal processing for multiple channels, the traditional parallel hardware that duplicates the processing unit multiple times for multiple channels results in a large area overhead. In order to save the hardware cost, we start with the optimized AFIC [11] that shares an ADC and part of amplifier by multiple channels in a multiplexed fashion. Then, a memory hierarchy of systolic cache buffer is proposed in [12] to efficiently share one processing unit for multiple channels and do sequential scheduling on cyclic basis without a bulky memory and processing latency. Note that this cycle-basis scheduling among different channels is automatically controlled by a local finite state machine (FSM) for the filtering, spike detection, and feature extraction hardware units. For the PCA training engine, the training schedule and period among channels are controlled by BPU and can be programmed according to the application requirements.
III. PROBLEM STATEMENT AND PROPOSED METHOD

A. Memory Problem for PCA Training Hardware
During the parameter training, thousands of detected spike events are recommended to be used in PCA training. It is necessary to store all these spike events in a large on-chip memory because of the algorithm dependency. The large memory requirement makes the small-area implementation impossible for the PCA training engine.
In order to have a clear idea between the hardware requirement and algorithm dependency, Fig. 2 ing mathematic equations are listed as follows:
The PCA algorithm has three main operations. It calculates the mean, μ x , and covariance matrix, Cov x , of the detected spike events, X 1 ∼X n , and then calculates the leading eigenvectors, PC 1−3 , of the covariance matrix. The spike events, X 1 ∼X n , are the input for both mean and covariance matrix operations. Next the covariance matrix operation must be performed right after the mean operation since the output of mean operation, μ x , is the input of covariance matrix operation. Therefore, when a spike event is detected and input from the spike detector to calculate the mean vector, this spike event is required to be stored on chip for the subsequent covariance matrix operation. Suppose that one spike sample has 9-bit precision, 32 samples are used to represent one spike event, and up to 4096 spike events can be used for the PCA training procedure, a memory size of more than 1 M bits is required.
B. Proposed Mean Pre-estimation Method
In order to save the large memory for training spike events, we need to break the algorithm dependency between mean and covariance matrix operations without affecting the accuracy of PCA results. A mean pre-estimation method is proposed as shown in Fig. 3 . The corresponding mathematic equations are listed as follows:
In the proposed algorithm, after the mean operation, we use the subsequent thousands of spike events, X n+1 ∼X 2n , for the covariance matrix operations. That is to say, the mean vector here is estimated from the previous thousands of spike events, X 1 ∼X n . In this way, during the operations of both mean and covariance matrix, we can on-the-fly process the input spike event, accumulate the results, and then immediately throw away this input spike event. No large memory is required. The proposed mean pre-estimation method should not affect the accuracy of the final PCA results. The characteristic shapes of the recorded spike events are determined by the morphology of neurons' dendritic trees and the distance and orientation relatives between neurons' and the recording electrode. These physical factors do not change within a period of time. Therefore, the mean vector of the previous n spike events should be very similar to the mean vector of the subsequent n spike events if n is large enough. 
IV. ARCHITECTURE DESIGN
The block diagram of the PCA training engine is shown in Fig. 4 . Before the training procedure, the BPU will program the essential training parameters such as the selected channel and the number of training spike events into the coefficient register file. Afterwards, the BPU will sent a start signal to PCA FSM to trigger the PCA training procedure. During the training procedure, the reconfigurable processing core calculates the mean vector, the covariance matrix, and the corresponding eigenvectors in turn. All intermediary and final PC results are stored in the memory units.
The algorithms to calculate the mean, the covariance matrix, and the eigenvector (with the modified algorithm proposed in [7] ) have very similar mathematic operations. Therefore, a reconfigurable PCA processing core is designed to support all the essential operations with the same processing elements along with a reconfigurable router. In the rest of this section, we will show how this reconfigurable engine can support the mean and covariance matrix operations. As for the hardware configuration for the eigenvector operation, please refer to [7] .
A. Mean Configuration
The processing flow to calculate the mean vector is shown in Fig. 5 . When the start-mean signal is sent from PCA FSM, the hardware enters the listening mode waiting for new spike events. When a new spike event is detected, the hardware enters the accumulation mode. The detected spike event is accumulated in the register files as the intermediary result of the mean vector. The input spike event is threw away right after the accumulation. After the 2 m spike events are input and processed, the hardware enters the division mode. The right shift operation is performed m times for each sample of the mean vector in the register files. Finally, the result of mean vector is stored in the register files for the subsequent covariance matrix operation. Note that "m" can be programmed as 2 to 14 by the BPU during the hardware initialization.
B. Covariance Matrix Configuration
The processing schedule to calculate the covariance matrix is shown in Fig. 6 . Before the processing, the register file should have a mean vector, μ x , averaging from the previous 2 m spike events. When the start-covariance-matrix signal is sent by PCA FSM, the hardware enters the listening mode waiting for new spike events. When a new spike event is detected, the detected spike event is first subtracted by the mean vector. The resultant column vector, Y i , is stored back to the register file. The input spike event can be threw away after the subtraction. Then, the column vector, Y i , is multiplied by its transposed vector, Y T i . The resultant square matrix is accumulated in the covariance matrix memory. Y i can be erased from the register file after the accumulation. After the 2 m spike events are input and processed, the hardware enters the division mode. The right shift operation is performed m times for each entry of the covariance matrix. Finally, the resultant covariance matrix is stored in the covariance matrix memory for the subsequent eigenvector generation.
V. SIMULATION AND IMPLEMENTATION RESULTS
In the beginning of this section, it is necessary to summarize specifications of the proposed hardware. In our PCA training engine, the precision of the input spike sample and the output PC sample is nine bits. 32 samples are used to represent one spike event and PC. Up to 16384, 2 14 , spike events can be used for one PCA training procedure. The hardware has the memory capability to store 48 trained PCs for 16 channels.
A. SNR Simulation of Mean Pre-Estimation Method
The proposed mean pre-estimation algorithm is verified with the neural data download from [13] . The simulation is done with the hardware C model. There are more than 3096 spike events, named X 1 to X 3096 , in each neural sequence. We set "m" to 10 and perform PCA training algorithm for the spike events of X 2049 ∼X 3096 . The resultant mean and the first three PCs are used as the benchmark. Two experimental cases are compared. In the first case, we use the spike events of X 1025 ∼X 2048 for mean estimation. In the second case, the spike events of X 1 ∼X 1024 are used for mean estimation. Note that the NEO spike detection algorithm [9] is used in this experiment. After spike detection, the detected spike events are aligned horizontally and vertically according to their peaks. 16 samples are used before and after the spike peak to represent each spike event. Table I summarizes the signal-to-noise ratio (SNR), or signal-to-error power ratio, of the mean vector and final PCs between the benchmark and two experimental cases. It clearly shows that the proposed mean preestimation method preserves more than 100 dB SNR and does not affect the accuracy of the PCA training results.
B. Memory Requirement Comparison
The memory requirement for the original PCA algorithm and the proposed method are summarized in Table II . For the original algorithm, to store all training spike events requires 2 14 × 32 × 9 bits memory. To store all PC results for 16 channel requires 3 × 16 × 32 × 9 bits memory. The maximum precisions for intermediary values of mean vector and covariance matrix are 23 and 32 bits respectively during the accumulation. Therefore, the required memory sizes are 32 × 23 and 32 × 32 × 32 bits. With the proposed algorithm, the memory to store the training spike events is saved. The unit area for each memory bit is about 60 μm 2 in .35 μm 2P4M CMOS process. As the result, the proposed mean pre-estimation method can save 283.16 mm 2 (99.01%) memory area and enable the low cost implementation for on-chip PCA training engine.
C. Implementation Result
We implement the proposed on-chip PCA training engine in verilog language, verify it with the test patterns generated by hardware C model, and synthesize the hardware in .35 μm 2P4M CMOS process. The chip is operated with 640 kHz operation frequency and 5 V supply voltage. According to the synthesized results, totally 6.07 mm 2 core area is required. Table III summarizes the area profile of this on-chip PCA training engine.
VI. CONCLUSION
In this paper, a mean pre-estimation method is proposed to ease the large memory requirement of the on-chip PCA training engine. 99.01% memory to store training spike events is saved by breaking the algorithm dependency. According to the simulation result, 100 dB signal-to-error power ratio can be preserved for the final PCA results. According to the implementation result, 6.07 mm 2 silicon area is required after a 283.16 mm 2 area saving with the proposed method. 
