ABSTRACT Closed-loop stimulation of many neurological disorders, such as epilepsy, is an emerging technology and regarded as a promising alternative for surgical and drug treatment. In this paper, a real-time seizure detection algorithm based on STFT and support vector machine (SVM) and its field-programmable gate array (FPGA) implementation are proposed. With a two-stage patient-specific channel selection and feature selection mechanism, those redundant and uncorrelated spectral features are removed from the entire feature set. The evaluation results on CHB-MIT epilepsy database show that the mean detection latency of the proposed algorithm is 6 s, the sensitivity is 98.4%, and the false detection rate is 0.356/h. The performance of our proposed algorithm is comparable to other existing seizure detection algorithms. Moreover, we implement the proposed seizure detection algorithm on Xilinx Zynq-7000 XC7Z020 with high level synthesis. Each classification of the input electroencephalography signal can be finished within 313 µs, and the power consumption of the programmable logic is only 380 mW at 100 MHz. In hardware implementation, an optimization strategy for the nested-loop structure within nonlinear SVM is proposed to improve pipeline efficiency. Compared with existing method, the experimental result shows that our method can speed up the nonlinear SVM by 1.70×, 1.53×, 1.37×, and 1.26× with the unroll factor equal to 1-4 at the same DSP utilization rate. The evaluation results affirm the possibility of integrating the proposed algorithm and FPGA implementation into a wearable seizure control device.
I. INTRODUCTION
Automated real time seizure detection and its hardware implementation are an interdisciplinary field requiring effort and domain knowledge from neurologists, electrical engineers and biomedical researchers. Dealing with this problem is of great importance, since there are about one percent people in the world suffering from epileptic seizure and about 20% to 40% of these patients will finally become refractory to antiepileptic drugs [1] . For the patients with medically refractory seizure, surgical resection of the epileptogenic foci could be the next step therapy after further evaluation. However, epilepsy surgery cannot always ensure seizure freedom for some patients since localization of seizure focus is not always accurate and potential damage to normal brain function exists [2] . Alternatively, vagus nerve stimulation (VNS) has shown effectiveness in reducing the seizure frequency in refractory epilepsy patients and could be a replacement to surgery [3] , [4] . Apart from being used as scheduled stimulation, closed loop VNS has also shown efficacy for controlling seizure activity and even interrupting seizure [5] . However, this study [5] also reports that additional caregivers are necessary to trigger the VNS device when seizure occurs. Thus, automated real time seizure detection is vital for building a robust and reliable closed loop VNS system to help epilepsy patients get rid of physical injuries in everyday life.
Electroencephalography (EEG) has been widely adopted by neurologists for epilepsy diagnosis. Substantial differences can be identified from EEG signal during ictal, inter-ictal and post-ictal periods, therefore EEG can act as a biomarker for seizure. Previous studies have been focused on developing advanced signal processing techniques to separate normal and abnormal EEG signal as much as possible. In [6] , discrete wavelet transform (DWT) and chaos analysis on EEG are used jointly to detect chaotic differences within specific frequency bands after multi-level decomposition. Srinivasan et al. [7] create a machine learning system with approximate entropy (ApEn) and artificial neural network (ANN) to classify the EEG signal. Esteller et al. [8] use a simple feature called line length and a hard threshold for seizure onset detection. In [9] , Hilbert-Huang transform (HHT) is used to extract the local amplitude and frequency information for seizure and seizure-free EEG. In summary, the previous seizure detection systems normally consist of two parts, information extraction and discrimination. The commonly extracted information includes spectrum amplitude/energy [10] , [11] , chaotic dynamics [6] , [12] and statistical measures [13] , [14] . The discriminator normally includes machine learning based classifier [15] , [16] and hard threshold [17] . Although plenty of work has been conducted on this topic, few of them consider the algorithm adaptability on hardware and in most work only off-line studies based on Matlab simulation have been performed. These disadvantages have limited the use of seizure detection algorithms in real clinical environment.
Hardware implementation of seizure detection algorithms have become feasible recently due to the technology advancement of semiconductor industry, many researchers consider application specific integrated circuit (ASIC) as target platform to realize the algorithms. Yoo and Altaf et al. [18] , [19] utilize filter banks to extract spectral energy and support vector machine (SVM) as the online classifier. In [20] , a processor with embedded machine learning classifier is developed to support different kernel functions for various biomedical applications. Although ASIC implementation of the seizure detection algorithm can achieve high energy efficiency and small chip area, its disadvantage is the inadaptability to handle patient-to-patient variations due to lack of reconfigurability. An alternative way to ASIC implementation is fieldprogrammable gate array (FPGA), which could be configured with different algorithms and parameters flexibly. Compared with ASIC, FPGA also has the advantage of high productivity, especially with the help of high level synthesis (HLS) tools which allow hardware engineers to directly transform software languages like C/C++ to hardware description language (HDL) and simplify the verification process [21] . Thus, the time-to-market of the product can be greatly reduced. Above all, FPGA is a good candidate for realizing seizure detection algorithms.
In this work, we divide the real time seizure detection problem into two parts, 1) designing hardware-friendly and robust seizure detection algorithm with high responsiveness and low false detections, 2) mapping the algorithm on FPGA devices with HLS. For seizure detection algorithm, we employ fast Fourier transform (FFT) to extract spectral energy and nonlinear SVM to do the binary classification. SVM is adopted for its excellent classification performance and moderate computation load. To further improve detection performance, patient specific channel selection and feature selection are integrated in our system. As for the FPGA implementation, we propose optimization strategies for both the FFT module and SVM module. The main contributions of this work can be summarized as follows.
Firstly, we propose an efficient optimization strategy for the RBF-SVM decision function. Through changing the execution order of the nested loops, pipeline with an initiation interval of one clock cycle can be achieved. The total latency can be greatly reduced under different unroll factors compared with existing method.
Secondly, a complete design flow for HLS based FFT is proposed. With this flow, trade-off between area and latency in FFT design can be easily achieved to meet various throughput requirements in different applications.
Thirdly, a patient specific seizure detection system with an average detection latency of 6 seconds is realized. Algorithm and hardware co-design makes our system feasible to be used as wearable monitoring devices for epilepsy patients in daily life.
The rest of the work is organized as follows. Section II describes the details of the seizure detection algorithm and the test results on CHB-MIT scalp EEG database. Section III describes the complete design flow of using HLS to build the FFT module. Section IV discusses the implementation of RBF-SVM with multiple optimization techniques. Section V integrates FFT module and RBF-SVM module in a complete system and on-board test is performed. Section VI is the conclusion and future improvement.
II. ALGORITHM DESIGN
The system diagram of the proposed real time seizure detection algorithm is shown in Fig. 1 . There are two processes involved in the system, the online process and off-line process, which are targeted for different purposes. For the online process, after the multi-channel EEG signal collected from analog front-end (AFE) and digitized by analog-to-digital converter (ADC), they will be first segmented into half overlapped epochs. These epochs will then move to the FFT module to be transformed to frequency domain and the extracted spectral energy of different bands will form a feature vector. The feature vector will be classified by SVM module to make a decision about the patient's state. For the off-line process, the mutual information (MI) based channel selection module ranks the channels in a greedy manner according to the contribution to joint MI between the total features among selected channels and the class label. The feature selection module determines the optimum feature subset among the selected channels to further eliminate the uncorrelated features. The off-line process help reduce the redundancies in the online process and make the final model more compact.
The length of the half-overlapped window in Fig. 1 is  4 seconds, corresponding to 1024 points in time domain. FIGURE 1. System diagram of the proposed seizure detection algorithm. The online process performs short-time Fourier transform on multi-channel EEG signals and extracts spectral energy as feature vector, the SVM performs classification for each input feature vector and decides the patient's state. The off-line process reduces the number of input EEG channels by channel selection, the spectral features are also screened to reduce feature vector size by feature selection.
Considering the sampling frequency of 256 Hz in CHB-MIT database, after the 1024-point FFT the resolution of the frequency domain signal is 0.25 Hz. The spectral feature extraction of the multi-channel EEG signal can be expressed as equation (1), assuming the number of channels is N . [23] , we have adopted a wider frequency range from 0.25 Hz to 48 Hz, which also includes the low gamma band (<50 Hz). The wider range could possibly provide more useful information about seizure for some patients according to our observation of spectrum distribution of EEG signal, while the frequency above 50 Hz is discarded because the signal strength is weak and contaminations like muscle artefact can be serious.
For most patients the number of EEG channels is 23 [23] , therefore the feature vector dimension is 276. For such a big feature vector, there must exist plenty of redundant and even noisy features which could not contribute to the seizure detection algorithm. These features will also increase computation burden to the online process. Duun-Henriksen et al. [22] state that the features from extra-focal channels would provide no more useful information than normal EEG in seizure detection. Thus, channel selection is necessary for constructing a robust algorithm which only focuses on the channels mostly correlated with seizure activity. Compared with directly applying feature selection algorithm on the entire feature set, channel selection could largely alleviate the computation burden. With channel selection as a pre-screening, the feature selection process on the selected channels can be much faster and more efficient. The reduced number of channels and compact feature set will also lessen the computation load of online classification process.
Three key performance metrics for evaluating the seizure detection algorithm are sensitivity, seizure onset detection latency and false detection rate. Sensitivity represents the ratio of correctly identified seizure events to all the seizure events marked by neurologists. Seizure onset detection latency indicates the average time needed before the algorithm can acknowledge the ''onset'' of a seizure, low detection latency means high responsiveness of the algorithm. False detection rate is the number of false detections given by the algorithm in a unit time interval when the epilepsy patients are in normal state, an algorithm with high false detection rate may unnecessarily intervene the patients' everyday life and sometimes the patients may ignore the true detections. The difficulty of designing a robust and reliable seizure detection algorithm lies in the trade-off among the three performance measures. Pursuing high sensitivity and low onset detection latency would inevitably introduce more false detections, since some rhythmic activities in normal EEG could resemble seizure patterns [23] . Apart from the above three metrics, we also include by sample sensitivity as a performance metric. By sample sensitivity stands for the portion of correctly identified samples among the complete seizure duration, this metric could assist doctors to evaluate the progress of seizure activity like average seizure duration for daily monitoring.
A. MUTUAL INFORMATION BASED CHANNEL SELECTION
Considering that modern EEG monitoring devices could have as many as 64 channels [24] , analyzing all the channels could be a computation intensive task. Meanwhile, some channels may include redundant information or even noisy data which could not help improve the detection performance. Under these circumstances, channel selection is a necessary process for building a seizure detection model.
In this work, we adopt a mutual information based channel selection method [26] , [28] . Mutual information quantifies the dependence information between two random variables, which can reveal both the linear and nonlinear correlation. In our case, since we do not know the exact relationship between the channels and the patients' normal or ictal state, using mutual information is more suitable than correlationbased channel selection [25] . Assuming the size of feature vector matrix X in one channel is R×L, where R is the number of features in one channel and L is number of samples. X = {X c1 , X c2 }, X c1 stands for the normal samples and X c2 stands for the seizure samples, c1 and c2 indicate the normal and seizure class labels. The mutual information between feature vector matrix X and class label c is defined as follows.
Here X is a continuous random variable and c is discrete class label, thus I (X ; c) is the mutual information between continuous and discrete variables. If an invertible square matrix W transforms X to another random variable Y by matrix multiplication Y = W T X , we know that I (X ; c) = I (Y ; c) [27] - [29] . Considering if the row vectors Y r of Y are all independent, we can expand I (Y ; c) to the summations of marginal mutual information I (Y r ; c). I (Y r ; c) can also be expressed as equation (2), thus we have:
Hence, the problem of finding I (X ; c) is equivalent to finding the linear transform matrix W and the entropy of the independent row vector Y r .
To extract Y r from the original random variable X , independent component analysis (ICA) should be performed. FastICA [30] is a common method for decomposing the original random variable to independent components, it tries to maximize the non-Gaussianity of the extracted components in an iterative manner. We choose FastICA for its accuracy, fast convergence and high computational efficiency. In our case the size of the unmixing matrix W is R × R since Y and X have the same number of components. The pseudocode of FastICA is in algorithm 1.
After finding the linear transform matrix W and random variable Y with independent row vectors, the entropy H (Y r ) and conditional entropy H Y c r | c need to be calculated. Since Y r is a one-dimensional continuous random variable, we adopt the Kozachenko-Leonenko (KL) estimator [27] , [42] to find its entropy, as expressed in equation (5) . ψ (·) is the digamma function and N is the number of samples, in this case N = L. c d is the unit ball volume and 
d is the dimension, ε(i) is the distance between a sample and its kth nearest neighbor. The calculation of H Y c r | c is similar to H (Y r ), which only accounts for the samples in normal or ictal state.
With the help of FastICA and KL estimator, I (X ; c) now can be calculated as in equation (4). After evaluating the mutual information between every channel and class label, the channel with maximum MI will be picked. Ranking of the rest of the channels will be in a greedy manner [28] . If all the channels are in the set , first we define a set which contains all the ranked channels as S , S = X r 1 , X r 2 , . . . , X r n . We also define a set for all the unranked channels U , obviously = S + U . The joint mutual information between the n ranked channels and the class label is I X r 1 , X r 2 , . . . , X r n ; c , and the n + 1th channel is ranked according to (6) . The channel X n+1 in U which has the maximum joint mutual information will be ranked as the n+1th channel in S . This process will continue until every channel in U is ranked. Shih et al. [31] suggest an average of 4.6 channels after channel selection. In [22] , only 2 to 3 channels are extracted. In our application we choose 3 channels for each patient considering the trade-off between hardware complexity and detection performance.
B. RANDOM FORESTS BASED FEATURE SELECTION
After channel selection, the total number of features has been greatly reduced since only three channels are kept for the online process. For the features in the selected channels, we screen them by performing feature selection to further eliminate the uncorrelated features. Random Forests [32] , [33] is an ensemble based machine learning method which uses bootstrap aggregating (bagging) of multiple decision trees to form a strong classifier or regressor. Each decision tree is trained from a subset of the entire training set and only part of the features are involved to grow the nodes of the decision tree. The randomization in bootstrap sampling and nodes splitting help Random Forests overcome the limitation of overfitting in common decision trees. Apart from being used in classification and regression tasks, Random Forests can also provide a useful characteristic to estimate the feature importance. By evaluating the decrease of correct votes in out-of-bag observations after feature p is permuted in the training set, feature importance can be easily calculated.
Assuming the total number of decision trees in the forests is N tree , C p i and C p,permuting i are the correct votes before and after the permutation of feature p, the decrease of correct votes in tree i is:
Average the decrease over all trees and normalize the result by its standard deviation, feature importance of the pth feature can be calculated as:
We have plotted a feature importance graph of 36 features from 3 selected channels in Fig. 2 . From the graph we can observe that the three features with highest importance are all from the low frequency bands, ranging from 0 to 12 Hz. This corresponds to conventional opinion about spectrum distribution of seizure activity [34] . But we also note that the next 4 features are originated from 28 to 44 Hz within low gamma band, which have exceeded the upper bound 30 Hz in many researches [19] , [23] . The result proves that the frequency bands ranging from 30 to 50 Hz may be useful for some epilepsy patients and could provide clinical value in seizure detection applications. Furthermore, we find that the top 3 features are all from channel 2, which may indicate this channel is within the seizure focus, therefore the seizure activity is mostly reflected in the features extracted from this channel.
We have plotted the relationship between the number of selected features and the classification result for patient 1 and patient 22 in Fig. 3(a) and Fig. 3(b) . For patient 1, the best sensitivity of 0.8 is achieved with only 8 features, exhibiting an improvement of 21.3% compared with using all 36 features from the 3 selected channels. Meanwhile, the false detections decrease by 17.8%. For patient 22, the best sensitivity of 0.867 is achieved with 10 features, and the relative improvement over using all features is 13.8%. While the false detections have been reduced by 12.9%. The test results of these two patients demonstrate that the best performance does not always come from more features. With an optimum feature subset selected, Random Forests can both greatly decrease the number of features needed and improve the classification result through eliminating those less influential features.
C. EXPERIMENTAL RESULTS DISCUSSION
We build the entire system with online feature extraction and classification, and off-line channel selection and feature selection for each patient in CHB-MIT database [23] . Leave one record out cross validation (LOOCV) is used to avoid overfitting and test the generalization capability of the model. Every time one seizure record is held for test and other seizure records are used as the training set to fit the SVM model. The experiment is performed on a Windows 7 PC with an Intel I5 CPU and 24GB RAM.
In Fig. 4 , a seizure record which starts at 5299s and ends at 5361s is exhibited. During the ictal period, rhythmic activities and severe fluctuations can be identified from all the three selected channels. With the proposed method, this seizure event is successfully detected within 2 seconds from its clinical onset, and the seizure alarm duration corresponds well with the clinical duration of this seizure record. The overall performance of the proposed method over the CHB-MIT database is depicted from Fig. 5(a) to Fig. 5(d) . Regarding the sensitivity metric, our method successfully detects 124 out of 126 seizure records, leading to a 98.4% sensitivity, which is superior to other seizure detection methods in the literatures as observed in Table 1 . The average false detection rate of our method is 0.356/h, which is equivalent to about 8.5 false detections per day. The mean detection latency of the seizure onset is 6 seconds, with 75% patients VOLUME 6, 2018 under 8 seconds and only 2 patients above 10 seconds. The overall performance of our proposed method is comparable to other seizure detection algorithms as observed in Table 1 , which could fully fulfill the requirements for real time detection and closed loop intervention. Meanwhile, the by sample sensitivity of our method is 74.2%, which means 3 quarters of the seizure samples are successfully recognized, this can also help the doctors diagnose the seizure durations. This metric is not included by most studies, but it can provide more detailed information for every patient and enable personalized treatment including drug delivery or electrical stimulation.
The proposed real time seizure detection algorithm is realized and verified on CHB-MIT database. The overall performance is on a par with previous researches with high sensitivity and moderate detection latency and false detection rate. Benefited from the off-line two stage channel selection and feature selection, the online process which needs to be implemented in hardware is simplified during feature extraction and classification. In next section, we will discuss the hardware implementation of fast Fourier transform, which is the core part of feature extraction. 
III. HLS BASED FAST FOURIER TRANSFORM DESIGN
High level synthesis (HLS) is a design approach which uses high level description language (i.e. C/C++ or SystemC), instead of hardware description language (Verilog or VHDL) to conduct hardware design [21] . HLS is becoming more and more popular in FPGA and ASIC communities due to its high productivity and increasing reliability, designers can concentrate more on the algorithm design and the HLS tool will schedule the algorithm into state machine and map with micro-architectures like DSPs and registers. With HLS, the gap between algorithm development and hardware design has been reduced, resulting in shorter time to market, which is very critical in many data-driven areas like machine learning and artificial intelligence.
FFT is the core part of feature extraction in the seizure detection algorithm. Compared with DFT, FFT has deceased the computation complexity from O(N 2 ) to O N log 2 N . However, the fully parallel structure of FFT still consumes too many DSP resources despite its low latency. Previous studies explored to utilize HLS for FFT design can be found in [40] , [41] , and [45] . Considering the limited resources on FPGA, we design a radix-2 decimation in frequency (DIF) FFT with an iterative data path. With the proposed structure, we aim to use limited DSP resources and explore low latency with an efficient pipeline strategy. Apart from the 1024-point FFT in the seizure detection algorithm, we have proposed a complete design flow for arbitrary length (2 n ) FFT which could be used in different applications. To validate the effectiveness of the proposed structure, different lengths (128, 256, 512, 1024 and 2048) of FFT have been realized, their resource utilization and latency have been measured. Next we will introduce the detail of our FFT design.
In Fig. 6 , an example of 8-point radix-2 DIF FFT is presented. The basic computation block for this structure is radix-2 butterfly and twiddle multiplication unit, as shown in Fig. 7 . There are three stages in the 8-point FFT, with each stage composing four basic computation blocks. Since these blocks in each stage have the same functionality, we consider a double-folded structure which could be exploited for resource sharing. The pseudocode for 8-point FFT with double-folded structure is shown in algorithm 2, note that only the folded structure is presented to explain the idea. The outer loop corresponds to the three stages in Fig. 6 , the inner loop represents the four radix-2 butterfly and twiddle units in each stage. Data preprocessing and postprocessing are needed before and after the computation block to adjust the data sequence. The implementation detail of the pseudocode will be discussed in next paragraph.
In Fig. 6 , we assume the data before stage 1, 2 and 3 are x (0) to x (7), x (0) to x (7) and x (0) to x (7), respectively (The data before stage 2 and 3 are omitted in the Algorithm 2 Double-folded structure for 8-point FFT .
. . for i = 1 to 3 do for j = 1 to 4 do data preprocessing; basic computation block; data postprocessing; end for end for . . .
FIGURE 9.
The two-step permutation between neighboring stages, step 1 only performs element exchange, step 2 performs both element exchange and reordering.
figure for clarity). At stage 1, x (0) to x (7) form four pairs (x (0) , x (4)), (x (1) , x (5)), (x (2) , x (6)) and (x (3) , x (7)), and each pair of data is fed to the basic computation block, while the result is assigned to x (0) to x (7). At stage 2, the four pairs are x (0) , x (2) , x (1) , x (3) , x (4) , x (6) and x (5) , x (7) . At stage 3, the data are paired as x (0) , x (1) , x (2) , x (3) , x (4) , x (5) and x (6) , x (7) . Obviously, between neighboring stages, permutations are needed to change the data sequence, the process is described in Fig. 8 . Since we only use one computation block, the output of each stage is generated sequentially and cannot be directly mapped to the input of next stage due to RAM port limitation. Thus, we take a two-step permutation and use a temporary buffer to store the intermediate data, as in Fig. 9 . We take stage 1-2 permutation as an example. Since the outputs are generated sequentially in the natural order (0, 1, 2, 3) and (4, 5, 6, 7) , the first permutation step should not involve any reordering operation and therefore only element exchange is performed. The elements in the temporary buffer are (0, 1, 6, 7) and (4, 5, 2, 3). At the second step, both element exchange and data reordering are performed to generate the input data for the second stage computation sequentially. The same process is done between stage 2 and stage 3. With this two-step permutation, the inner loop in algorithm 2 can be pipelined efficiently since no memory access conflict exists. The control signal for element VOLUME 6, 2018 FIGURE 10. The folded data path of the proposed radix-2 DIF FFT, only one raidx-2 butterfly and twiddle unit is consumed in this structure.
FIGURE 11.
Measured total execution latency versus FFT length from Vivado HLS. When 1024 points FFT is performed with the proposed structure, the estimated latency is 6165 cycles.
exchange and address signal for reordering are hard-coded externally in Matlab. We have written a Matlab script to automatically generate the element exchange matrix, permutation matrix and twiddle factors for arbitrary FFT length (2 n ) and save them as header files to be later synthesized in Vivado HLS.
The complete pseudocode for arbitrary length FFT is shown in algorithm 3. Pma and pmb are permutation address matrices for bram a and b in step 2, em1 and em2 are the element exchange matrices in step 1 and step 2, respectively. Twiddle_rom is the look-up table for twiddle factor matrix. The pseudocode exhibits a nested-loop structure. Since there is no data dependency and memory port conflict in the inner loop, L2 can be pipelined with an initiation interval of one clock cycle. Only one computation block is consumed and used repeatedly for N 2 log 2 N times. Specifically, the inter-dependence should be set false to enable efficient loop pipeline. The final step is bit-reverse permutation which permutes the data to natural order at the output side. The complete folded data path is shown in Fig. 10 . Theoretically, the estimation of the total latency is N 2 log 2 N + N clock cycles, while the first part corresponds to the nestedloop iterations and second part corresponds to bit-reverse permutation.
We have realized different lengths of FFTs with the proposed structure at 100 MHz, the measured latency is shown in Fig. 11 . The results are in accordance with our theoretical analysis, which proves that our proposed structure is valid for different FFT lengths. The resource utilization result is collected after placement and routing in Vivado with necessary external components for a minimum system, the result is given in Table 2 . We observe that LUT and FF resources do not change a lot when the FFT length increases, this indicates that the control complexity and pipeline cost for different FFT lengths are almost the same. For BRAM, since the storage needed for twiddle factor matrix and permutation matrix keeps increasing rapidly, and the buffer size for intermediate signal grows proportionally to the FFT length, therefore BRAM size increases fast and is the major cost for implementing the proposed FFT structure. Memory requirement for different FFT lengths is shown in Table 3 . The twiddle_rom and permutation matrix account for about 2/3 and 1/3 of the total memory consumption, respectively. All the FFTs use the same amount of DSPs because they only need one set of basic building block in an iterative manner.
This design strategy can also be extended to higher radix FFTs, especially in situations where extremely low latency and high throughput are required, i.e. massive MIMO systems [43] . For example, when we use the strategy to design a radix-4 FFT, a total latency of N 4 log 4 N + N could be expected. Compared with radix-2 implementation, radix-4 FFT could further decrease the latency by about 75%. More DSP and BRAM resources will be the main overhead for higher radix FFTs, while optimizations could be made to reduce storage requirement for twiddle factors and other permutation matrix since lots of redundancies exist inside these matrices.
IV. OPTIMIZATION AND IMPLEMENTATION OF RBF-SVM
After the pre-determined spectral features are extracted from the FFT module, they will be fed to the nonlinear SVM module for classification.
To account for the condition with maximum computation load during nonlinear SVM classification, we set the number of features to 36 during the SVM training process. The resulted number of support vectors is 319. The SVM decision function is shown in equation (9) . lm i is Lagrange multiplier and l i is the corresponding class label, in this binary classification problem l i ∈ {−1, +1}. k(·, ·) is the kernel function, in this case we choose radial basis function (RBF) kernel and k (x,SV i ) = e (−γ x−SV i 2 ) , while γ is determined in the training process by grid search and normally a power of two which can be realized by a shift operation. N s is the number of support vectors and N s = 319. The Euclidean distance in the RBF kernel can be expanded as a summation as shown in equation (10), while M is the input feature vector dimension and M = 36. From equation (10) we can observe that inner-summation carries the most computation burden since the accumulation of square-of-difference inside the kernel function will repeat MN s times, while outersummation will only accumulate N s times. The optimization of the inner-summation will dominate the performance of the 
Algorithm 4 Original RBF-SVM Decision Function
The data path of equation (10) is shown in Fig. 12 . There are two multiplication-accumulations (MACs) involved, corresponding to two summations in equation (10) . p_sum i is used to denote the result of the first MAC, which is the Euclidean distance between each support vector and the input feature vector. The second MAC is to accumulate the exponential term multiplied by the corresponding Lagrange multiplier, the result is denoted as final sum. Most computation load lies in the first MAC, which consists of N s M multiply and add operations. The second MAC consists of N s multiply and add operations.
The pseudocode of the above data path is listed in algorithm 4. The two MACs have led to imperfectly nested loops L1 and L2 (imperfectly nested loops mean there are statements between the nested loops, while perfectly nested loops only have statements in the innermost loop). The support vectors are represented as 2D array and accessed sequentially in a row-major order. After analyzing the body of L2, dependence is observed between the inner loop iterations. Due to the nature of self-accumulation of p_sum [i] , the nested loops cannot pipeline efficiently with an initiation interval VOLUME 6, 2018 of one clock cycle. Thus, this original solution cannot fully utilize the DSP resources.
A. ARITHMETIC TRANSFORMATION OF RBF-SVM DECISION FUNCTION
As proposed in [19] , the Lagrange multiplier lm i in the decision function can be transformed to e ln lm i , therefore the multiplication can be changed to addition inside the kernel function as in equation (11). Furthermore, by scaling ln lm i with a factor of 1/γ , the addition can be grouped into the summation as in equation (12). (ln lm i ) /γ is denoted by optimized Lagrange multiplier.
The reformulation of the SVM decision function leads to a simplified data path as shown in Fig. 13 . Only one MAC is needed and the second MAC is replaced by an accumulator. Optimized Lagrange multiplier (ln lm i ) /γ is pre-calculated and used to initialize p_sum i . The final accumulator takes the class label l i as the control signal to determine whether to add or subtract the exponential term. This simplified data path alleviates the overall computation burden in the SVM decision function and the increased work load in MAC 1 is trivial. Besides, since the accumulator for f_sum only accounts for a small part in total computations, we can put more effort on the optimization of the partial sum p_sum i .
B. OPTIMIZATION OF THE NESTED LOOP STRUCTURE
According to the data path of the reformulated SVM decision function and the analysis of algorithm 4, we decide to take a two-step optimization to remove the loop-carried dependence in the partial sum calculation.
• Loop distribution: From algorithm 4 we know that the original L1 and L2 loops are imperfectly nested loops, while imperfectly nested loops make the execution order of L1 and L2 fixed, preventing efficient pipeline strategy to be applied. Since L2 loop body does not depend on the statements between L1 and L2, we break the outer loop L1 and form two new independent loops L3 and L4. After loop distribution, L1 and L2 become perfectly nested loops.
• Nested loop interchange: 
The pseudocode after the two-step optimization is shown in algorithm 5. The loop iterations in L2 are now executed at every clock cycle consecutively. Due to the loop interchange, the 2D array SV [N s ] [M ] should be transposed to SV [M ] [N s ] to keep the row-major access order. Now the perfectly nested loops L1 and L2 are only responsible for calculating the partial sum as in Fig. 13, L3 and L4 are targeted for the final sum after the partial sum is fully calculated. Compared with algorithm 4, the loop dependencies have been removed through code restructuring, enabling further optimizations like automatic loop unroll to be applied more easily. The total latency of algorithm 5 can be estimated as given in equation (13) . N s M is the latency for the nested loops L1 and L2, N s is the latency for L3. At 100 MHz, the pipeline initiation interval of L4 is 5 clock cycles, therefore the latency for L4 is 5N s . In this case when N s = 319 and M = 36, the estimated latency is about 13398 cycles for one classification. When the feature dimension M 6, the latency will be approximately equal to N s M , which is linearly dependent on both the number of support vectors and the input feature dimension.
• Loop unroll: After observing the nested-loop structure in algorithm 5, we find that parallelism of the inner loop can be easily achieved. In algorithm 6, the inner loop has been unrolled by 4, which corresponds to 4 times of hardware resources being used. Since p_sum is mapped with BRAM inside the FPGA, array partition must be performed on p_sum to increase the memory bandwidth to enable multiple read and write simultaneously. Hardware implementation of p_sum in algorithm 6 is shown in Fig. 14 . Cyclic partitioning of the 2D support vector matrix and p_sum is performed. Four identical arithmetic blocks form parallel processing units, leading to 75% decrease in latency. With this structure, trade-off between area and latency can be flexibly realized, therefore it could fulfill various requirements in different applications. The latency after loop unroll can be estimated using equation (14) , where f is the unroll factor. When f is equal to 4 as in algorithm 6, the estimated latency is about 4785 cycles. Further increasing the unroll factor will weaken the effect on reducing the total latency. When M /f 6, Latency ≈ N s M /f , which is proportional to N s M and inversely proportional to f . This relationship holds when M is a big number and f is a small ratio of M. 
; end for L4:
C. HLS SYNTHESIS RESULTS AND DISCUSSION
We realize algorithm 5 with different unroll factor 1, 2, 3 and 4 in hardware. All the input signal and intermediate signals use single-precision floating point number, and HLS math library is used to synthesize the floating point arithmetic logic units (add, subtract and multiply) and the exponential function. The support vector matrix is stored in external DDR memory, considering the limited on-chip memory and patient specific nature of the seizure detection algorithm. The hardware utilization is measured after placement and routing in Vivado, which also includes necessary components like direct memory access (DMA), AXI interconnect and ZYNQ7 processor to form a minimum system. The latency in clock cycles is measured directly from Vivado HLS synthesis report. The target FPGA platform is Zedboard, and the clock frequency is set to 100 MHz.
We also realize the method in [44] as a comparison. Tsoutsouras et al. [44] pipeline both loops in the original RBF-SVM decision function simultaneously without loop interchange. Because of the iteration dependence, manual adjustment of the unrolled inner loop is necessary to optimize the adder tree in [44] , therefore the loop cannot be 
FIGURE 15.
The relationship between unroll factor and latency for our proposed RBF-SVM and the method in [44] .
automatically unrolled. The measurement result of our implementation and [44] is shown in Table 4 . We can see that for the same unroll factor, our proposed method uses the same amount of DSP resources as in [44] , and the DSP consumption scales linearly with unroll factor in both methods. But our method uses less LUT and FF, this is because double-pipeline scheme in [44] requires complex FSM to schedule the arithmetic operations and more pipeline registers will be inserted to achieve a low initiation interval. In our method, only the inner loop needs to be pipelined, therefore the pipeline configurations are simpler and less resources are used. We also observe that our method costs more BRAMs compared with [44] , the reason is our method requires to partition the memory for partial sum to increase bandwidth, while [44] does not need to partition the memory. More BRAM consumption is the main overhead of our proposed method.
The latency comparison of the two methods is shown in Fig. 15 . Our method can speed up the total time consumption by 1.70x, 1.53x, 1.37x and 1.26x when the unroll factor is 1, 2, 3 and 4. The latency of [44] is almost inversely proportional to the unroll factor, since it only contains a big nested loop. While in our method, only the latency for partial sum calculation is inversely proportional to the unroll factor, the latency for L3 and L4 is a constant. Therefore, it can be observed that the benefit of unrolling will be smaller when the unroll factor grows larger. The latency measurement results are in accordance with the previous theoretical analysis. The proposed optimization method for RBF-SVM greatly reduces the total latency through pipelining the nested loop efficiently. The simple pipeline strategy also reduces the LUT and FF consumption compared with other complex implementation. When the ratio between feature dimension M and unroll factor f is high, the latency is approximately inversely proportional to the unroll factor, therefore a notable speedup can be achieved. The main cost for the proposed optimization method is more BRAM consumption due to memory partitioning.
V. SYSTEM INTEGRATION
After the 1024-point FFT module and RBF-SVM module are realized separately, we integrate them in a top-level system. The clock frequency for the programmable logic is set to 100 MHz. Vivado HLS first synthesizes the C++ code to Verilog HDL and packs the synthesized code as an IP block. After that the IP is imported to Vivado to build a basic system with ZYNQ7 processor and other interconnects. Synthesis, placement and routing are also performed before generating the bitstream file. Finally, the bitstream file is downloaded to the FPGA board and we perform the on-board test to verify the functionality and collect the execution latency in Xilinx SDK. To evaluate the impact of different unroll factors in RBF-SVM, we realize two implementations with unroll factor equal to 1 and 4, respectively. The measured latency and hardware utilization rate are listed in Table 5 . Both implementations have achieved very low excution latency (< 1 ms) and fulfill the requirement of real-time monitoring. Compared with implementation 1, implementation 2 further decreases the total latency by about 100 µs due to the parallelism involved in the RBF-SVM. The hardware utilization rate of both implementations is not high, which means even low-end FPGA platforms like Zybo can sustain such implementations substantially. The power consumption of the FPGA board with implementation 2 is about 4.56W as shown in Fig. 16 . The total on-chip power is 1.922W, while the ZYNQ7 processor consumes 1.542W and the programable logic consumes 0.38W.
VI. CONCLUSION
In this paper, we have proposed a complete design flow from algorithm development to hardware implementation for real time seizure detection. STFT is employed to extract spectral features and nonlinear SVM is used to classify each feature vector as normal or ictal state. The employed channel selection and feature selection in the algorithm not only help reduce the feature vector size, but also improve the detection sensitivity and reduce false detections. During the hardware implementation, a double-folded structure for FFT with arbitrary length is proposed and area-delay trade-off can be easily realized. The proposed optimization strategy for nonlinear SVM exploits pipeline and parallelism techniques at the same time to reduce the latency.
The potential use of the proposed nested-loop optimization method is not restricted to nonlinear SVM, any nested-loop structure with similar MAC operations may also benefit from this method. Our future work will concentrate on using fixed point arithmetic to realize the entire algorithm to further reduce the hardware utilization and power consumption.
