Abstract. Brain Computer Interface (BCI) systems, which are based on motor imagery, enable humans to command arti cial peripherals by merely thinking about the task. There is a tremendous interest in implementing BCIs on portable platforms, such as Field Programmable Gate Arrays (FPGAs) due to their low-cost, low-power and portability characteristics. This article presents the design and implementation of a Brain Computer Interface (BCI) system based on motor imagery on a Virtex-6 FPGA. In order to design an accurate algorithm, the proposed method avails statistical learning methods such as Mutual Information (MI), Linear Discriminant Analysis (LDA), and Support Vector Machine (SVM). It also uses Separable Common Spatio Spectral Pattern (SCSSP) method in order to extract features. Simulation results prove achieved performances of 73.54% for BCI competition III-dataset V, 67.2% for BCI competition IV-dataset 2a with all four classes, 80.55% for BCI competition IV-dataset 2a with the rst two classes, and 81.9% for captured signals. Moreover, the nal reported hardware resources determine its e ciency as a result of using retiming and folding techniques from the VLSI architecture' perspective. The complete proposed BCI system achieves not only excellent recognition accuracy, but also remarkable implementation e ciency in terms of portability, power, time, and cost.
Introduction
Nowadays, humans attempt to communicate with the outside world in a better and more convenient way. Although, at rst glance, physical transactions seem essential for this goal, science is trying to make this relationship independent of physical actions.
Brain Computer Interface (BCI) is a system, enabling humans to control objects by merely mak-ing decisions and thinking about them, i.e., without any sort of intervention of physical movement. In other words, a BCI system is an interface between the brain and computer to translate, process, and classify the electroencephalograph (EEG) signals [1] . After classi cation, this system sends commands to external devices, such as a wheelchair, to control it ( Figure 1) . A motor imagery system controls arti cial devices by solely thinking about them. Movements of an arti cial hand by thinking about its movement is one of the various examples of the applications of such systems, which is a predominant strategy for motor rehabilitation in stroke patients. Note that BCI systems are meant to be wearable, yet not sensible, in future applications; hence, it will be convenient and transparent to others. Moreover, BCI technology as a classi cation system is well established and has various applications, for instance, for patients with paraplegic devices. Therefore, any e ort to improve its performance and/or e cient implementation is needed and valuable.
Although there exist several other controlling approaches, especially for patient or elderly care, motor imagery BCI system has some advantages over them. For instance, eye tracking [2] is not applicable in case the patients have visual or eye muscle problems, such as Amyotrophic Lateral Sclerosis (ALS) and Locked-In Syndrome (LIS), and, therefore, cannot control their eye muscles to move their eye in a certain direction [3] [4] [5] . Other problems of eye tracking can be found in dim light conditions or where the patient needs to notice his/her surrounding which would be dangerous for them to move their eye in other directions (e.g., when a patient in a wheelchair is passing on the street). Moreover, people with glasses can use BCI system much more conveniently than those with eye tracker can. Apart from their di erences, both eye tracking and motor imagery BCI require a wearable device; eye tracker requires a pair of glasses to hold a camera to track eye, and BCI system requires a hat with electrodes. Moreover, note that despite the advantages of BCI system over eye tracking, eye tracker has the advantage of higher accuracy in comparison to motor imagery BCI.
In particular, in motor imagery BCI system, when a subject imagines moving, a typical desynchronization of upper alpha and beta rhythms is observed in the sensorimotor cortex [6] , followed by a re-synchronization. This pattern of activation can be easily detected by EEG and used to feed a BCI system for di erent purposes. In fact, imagination movement tasks evoke EEG signal changes in a way that these changes can be detected in temporal and spatial traces of EEG rhythmic components, at speci c electrodes (channels) located on subjects' scalps. Therefore, we can use electrophysiological properties of motor imagery as temporal, spatial and frequency characteristics to detect the mental task. This paper focuses on BCI systems that are based on translating EEG signals recorded through motor imagery.
A BCI system consists of units of signal acquisition, preprocessing, feature extraction, feature selection/reduction and classi cation. Signal acquisition unit records the brain activates such as EEG signals, while the pre-processing unit increases the signal-tonoise ratio of the signals. On the other side, feature extraction unit prepares features that are meaningful to the classi cation stage and omits outlier and artifact features. Then, the feature selection/reduction unit reduces the number of features and/or acquisition channels to decrease the dimension of feature vectors. Finally, classi cation unit classi es the features into logical control signals.
Solving a BCI problem includes two phases, i.e., the train and test phases. In the train phase, the parameters of the system are adjusted through learning from previously recorded samples. In the test phase, the new data are recorded and its class is estimated using the parameters adjusted in the train stage. Unfortunately, such algorithms require high-speed computers to process the steps of BCI, limiting their use and portability. Therefore, an e cient algorithm to meet these speci cations is needed to be designed. In order to overcome this shortcoming, using hardware platforms such as Field Programmable Gate Arrays (FPGAs), which are more portable, less expensive, and power e cient, seems reasonable. Moreover, there are other reasons to use FPGA for implementing a BCI system. First, the parallelism of FPGAs can be used for high computational throughput, especially at low clock rates. Second, FPGAs are more exible and more power-e cient than processors are. Third, FPGAs are more suitable for real-time embedded solutions [7] . This paper proposes an e cient high-accurate algorithm for classi cation of EEG signals for a BCI system. Moreover, the paper presents an e cient hardware implementation of the proposed BCI on a FPGA platform. The proposed algorithm is based on statistical learning methods, and the hardware implementation considers the optimization of power, area, and frequency to meet the targeted speci cations of end applications. The achieved results show that the proposed BCI can classify EEG signals from BCI competition IV (4 classes), BCI competition III (3 classes), and our in-house signals (2 classes) with accuracy rates of 67.2%, 73.54%, and 81.9%, respectively. The hardware implementation results validate the successful implementation of the proposed method on hardware at the clock frequency of 560 kHz.
The remainder of this paper is organized as follows: In Section 2, the current literature is presented and discussed; in Section 3, all datasets used in this paper are presented; Section 4 describes the proposed methodology for the BCI system. Section 5 shows the simulation results of software, and Section 6 discusses signal processing concerns for hardware implementation. The detailed hardware implementation is presented in Section 7. Section 8 discusses some trade-o s between accuracy and hardware resources. In Section 9, the obtained hardware experimental results are reported. Finally, Section 10 concludes the paper.
Previous work
There is a rich literature addressing the research in BCI systems, where each paper may consider a speci c part of the system. The related literature can be divided into two main categories: algorithms and hardware implementation. In the following, state-of-the-art work from each category is brie y described.
Previous work on algorithms
There exists a large amount of literature on BCI algorithms and methods. Two main sub-categories are methods of feature extraction and those of classi cation:
1. Feature Extraction: Adaptive Auto-Regressive (AAR) is one of the suitable algorithms to extract features from EEG signals [8] [9] [10] . A classical approach to estimating time-varying AR-parameter is the segmentation-based approach. In this case, the data are divided into short segments and the AR parameters are estimated from each segment. The shorter the segment length, the higher the time resolution; however, it has the disadvantage of an increasing error in AR estimations. The AAR model appropriately describes the random behavior of EEG signals and provides parameters with a high time resolution. Moreover, by using AAR model, it is not necessary to use frequency band [11] . In [12] , an AR model is utilized because of achieving better estimations of short time series compared to Fast Fourier Transform (FFT). AAR can also be used to remove artifacts, such as eye blinking and muscle movements, from EEG data, based on blind source separation [13] . However, the AR model cannot capture the transient features in EEG data, and the timefrequency information is not easily seen in the AR parameters [14] . Therefore, several researchers have used wavelet coe cients that provide localization of signal components with spectro-temporal characteristics [15] [16] [17] [18] [19] . The main bene t of wavelets is the time-frequency localization. The advantage of timefrequency localization is that the wavelet analysis changes the time-frequency aspect ratio, providing a good frequency localization at low frequencies for long time windows and good time localization at high frequencies for short time windows [20] . On the other hand, the Power Spectral Density (PSD) is another approach that indicates the distribution of power of a signal between di erent frequencies [21] .
Various other algorithms have also been proposed that extract features in spectral or spatial domains such as the Common Spatial Patterns (CSP), classifying brain tasks. In fact, CSP, using the variance as a new feature, tries to extract features that are able to maximize the variance of a particular task and minimize the variance of the other [22] . A signi cant drawback of CSP is that it does not consider the spectral characteristics of the EEG signal [23] . To overcome this problem, many researchers have proposed several variants of CSP [24] [25] [26] [27] . One of the promising algorithms for the feature extraction, which is an advanced version of CSP, is called Filter Bank Common Spatial Patterns (FBCSP) whose advantage is that it considers the spectral characteristics of the EEG signals [28, 29] . Moreover, an advanced version of FBCSP is called Separable Common Spatio Spectral Pattern (SCSSP) [30] , which overcomes all shortcomings of FBCSP, i.e., the high computational cost of training, lack of analysis of both spatial and spectral characteristics of signals, and lack of having a metric to rank the discriminatory power of extracted spatio-spectral features. In addition, CSP methods su er from sensitivity to noise and, also, over tting. In order to overcome these problems, di erent methods are proposed in the literature for regularizing CSP, namely Regularized CSP (RCSP). In [31] , various RCSP algorithms are reviewed and summarized, as well as the standard CSP. Thereafter, four novel RCSP methods are proposed and evaluated which outperform previous RCSP methods.
Classi cation: Linear classi ers, such as Linear
Discriminant Analysis (LDA) and Linear Support Vector Machine (SVM), can be applied to distinguish classes using linear functions [32] . Methods such as [33] [34] [35] [36] have used these classi ers in BCI systems. LDA is suitable for online BCI systems because of a low computational cost, which persuades researchers to use it in motor imagery-based BCI systems. However, the main shortcoming of LDA is its linearity that can provide poor results on complicated nonlinear EEG data [37] . LDA can also be used for feature reduction in EEG signals [38] , as is used in the proposed method, too. On the other hand, Linear-SVM uses a discriminant hyperplane to classify EEG features. This method is very popular because of good generalization properties [39] , simplicity in implementation, and robustness [40] . Nonlinear classi cation, such as Bayesian and K-Nearest Neighbors (KNN), are used in [41] [42] [43] . Nonlinear methods show better classi cation performance than the linear methods; however, they are more complicated, especially from the implementation point of view. In spite of all these e orts, designing an e cient algorithm for classi cation in BCI systems, while considering the hardware constraints, is still a major challenge.
Previous work on hardware
Most of the literature in which a full BCI system is implemented are concentrated on the steady-state visual Evoked Potential, which is an EEG signal response to the ickering visual stimulus [44] , [45] , [46] . However, in [47] , a BCI system is implemented based on motor imagery, which includes Finite Impulse Response lter as a preprocessing, CSP as a feature extraction, and Mahalanobis distance as a classi cation, on Stratix IV FPGA Board with operational frequency of 200 MHz. This paper proposes a more e cient hardware implementation with less hardware resources compared to the design in [47] . A more complete work of [47] is reported in [48] . A novel approach is introduced in [48] , de ning the tasks and working with devices by a state machine. The user can do the task or switch to other tasks by thinking about right and left hand movements.
There exist several other works on BCI systems with a hardware implementation perspective. For example, Wearable Mobile EEG-based Brain Computer Interface System (WMEBCIS) [49] is proposed for detecting drowsiness. A low-cost FPGA-based BCI multimedia control system is also proposed in [50] . Their proposed framework is used to control multimedia devices. In their work, they use Steady-State Visual Evoked Potential (SSVEP), which is light-emitting diode stimulation panel (several command symbols). SSVEP is also utilized in [51] to control environmental devices such as television. Controlling hospital bed nursing system in a FPGA-based BCI system is also addressed in [52] .
Best-performing methods on BCI competition datasets
There are reported accuracies on BCI competitions available in BCI competition web site [53] . The best reported method (in [54]) on BCI competition IIIdataset V, which has classi ed all three classes of this dataset, has used LDA to extract features from PSD values of the EEG signals and distance-based discriminator for classi cation. The best performing method (as reported in [55]) on BCI competition IVdataset 2a, which has classi ed all four classes of this dataset, has employed FBCSP and Naive Bayes Parzen Window as feature extraction and classi cation, respectively. However, the main shortcoming of these methods is that they provide low accuracies and are not suitable for hardware implementation as they have high computational complexity. However, the methods proposed in [31] and [48] , mentioned previously, outperform the best methods reported in [54] and [55] , and can be considered as two of the best performing methods on BCI competition IV-dataset 2a. Nonetheless, it should be noticed that these two methods only classify the rst two classes (movements of right and left hands).
Description of datasets
There are two available methods for recording the EEG signals, i.e., invasive and non-invasive methods. The former implants the electrodes in the brain through an operation; thus, it can be dangerous and harmful to the brain. However, the latter records the signals by electrodes on the head skin. In order to standardize the placement of the electrodes on the head, the 10-20 standard has been introduced. In this standard, the distance of adjacent electrodes from each other is 10% or 20% of the whole distance between back (inion) and front (nasion) of the skull. This standard has improved over the years and has multiple versions. Figure 2 depicts the locations of almost all sensors in these standards.
Recording brain signals in the non-invasive approach is commonly called Electroencephalograph (EEG) [56] . The EEG method has some advantages and disadvantages. In fact, a lower price, high time resolution, and robustness of movement could be mentioned as its advantages, and low SNR, distilled water requirement and low spatial resolution are its disadvantages.
Capturing EEG signals, in the lab, consists of multiple steps. After the appearance of the xation cross, a cue is performed to warn the experienced person to start thinking about a particular task, which is the motor imagery step. The last step is resting, which happens before starting the next experiment. Figure 3 illustrates the steps of the EEG signal capturing.
In this paper, three datasets are used to verify the proposed algorithm, which are detailed in the sequel.
BCI competition III -dataset V
One of the datasets used to evaluate the proposed algorithm is dataset V of BCI competition III [53] . This dataset was created by IDIAP Research Institute. Recorded signals are brain signals from three di erent persons, each of which is involved in the experiment for four times. All these four experiments were performed in one day with 5-10 minutes break between every session. The sampling rate is 512 Hz and 32 channels have been used to record the signals. There are both raw signals and power spectral densities available in this dataset; however, only raw signals are used in this paper. In this dataset, each person is asked to think about the following three tasks in di erent time slots:
Imagination of repetitive self-paced left-hand movements;
Imagination of repetitive self-paced right-hand movements;
Generation of words beginning with the same random letter. 
BCI competition IV -dataset 2a
The second dataset used in this paper is dataset 2a of BCI competition IV [53] , consisting of signals from nine people. Signals from each person are captured in two sessions on di erent days. Each session consists of 6 parts and each part includes 48 experiments. In each experiment, every person is asked to do one of the following four motor imagery tasks:
Imagination of left-hand movements; Imagination of right-hand movements; Imagination of both legs movements; Imagination of tongue movements.
The sampling rate is 50 Hz and data are ltered by a band-pass Butterworth lter.
Captured signals
In order to devise a local platform to evaluate the algorithms, we managed to record EEG signals by an Emotiv ® system, including 14 channels with the sampling rate of 128 Hz. The EEG electrodes have been placed in locations AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4 based on International 10-20 system (see Figure 2 ). Eight persons were asked to imagine performing one of the following two tasks in each experiment:
Imagination of left-hand movements; Imagination of right-hand movements.
Four experimental sessions were held in one day, each of them including 30 experiments. All subjects were sitting on a comfortable armchair in front of a computer screen during the recording session. A session takes eight seconds, including two seconds to show a xation cross on the black screen at the beginning of session, one second for a cue pointing either to the left or right on the display, three seconds for performing the corresponding task in front of the black screen to avoid visual feedback, and two seconds in order to relax their minds (see Figure 3 ). The rst three experimental sessions are used as training data, and the last one is used as test data. Before starting to record, all subjects were asked to practice with the experimental conditions for ve minutes in order to reduce their stress and increase their concentration by becoming familiar with conditions. Subjects were asked not to move or blink during these three seconds of motor tasks in order to reduce the e ects of artifacts such as electromyogram (EMG) and electrooculography (EOG). Signals were ltered by a highpass lter with 0.1 Hz cut-o frequency and, then, ltered by a notch lter with the frequency of 50 Hz to remove the power line e ect. Moreover, note that, in the captured signals, subject 7 is about 60 years old, and the other subjects are young, around 23. The signi cant di erence between the accuracy rates for subject 7 compared to the other subjects might be due to the older age of this subject.
Methodology
The method in this paper consists of several computational blocks, utilized in both train and test phases. Figure 4 shows di erent parts of this method. Each part will be discussed in detail in the following.
Pre-processing
Preprocessing, which tries to remove noises and artifacts, is crucial to obtaining high classi cation accuracy in BCI systems. In the proposed method, two techniques, i.e., DC-block and Laplacian lter, are used for this purpose. 1. DC-block: Filtering is an important step to remove unnecessary information from raw signals. In this paper, the xed point DC blocker [57] is applied to remove the DC shifts. The DC component in EEG signals varies between di erent recording runs, even for a speci c subject. This component does not include any information regarding the motor imagery task and may degrade the accuracy of the algorithm. Based on this fact and for the sake of achieving stability during all recording runs, the DC component should be omitted by means of a DC lter. An e ective solution from the hardware implementation point of view is the DC-blocker, which performs the DC ltering by means of minimum hardware resources based on the following equation [57] :
where = 0:996: 2. Laplacian Filter: One of the most important limitations of EEG signals is their poor spatial resolution. One common technique in order to alleviate this problem is the Surface Laplacian (SL) ltering [58] . It is also used to remove the noises and artifacts whose origin may be outside of the skull [59] , which eventually improves the Signal-toNoise Ratio (SNR). There are four di erent spatial lters, namely standard ear reference, Common Average Reference (CAR), small Laplacian and the large Laplacian. We employed the CAR method as it has been proven that the CAR is suitable for a communication system related to and rhythms, which are the main frequency bands (8) (9) (10) (11) (12) Hz and 12-32 Hz, respectively) (some works, such as [60, 61] , cite the range of rhythm as 18-30 Hz) for motor imagery EEG signals [62] . In fact, in the CAR method, the potential of each channel (electrode) is subtracted by a weighted average of the next-nearest neighbor channels, according to their distance. If the distances are equal, it can be easily formulated as follows:
where V ER i is the potential between the ith electrode and the reference, and n is the number of electrodes in the montage.
Feature extraction
Feature extraction is the process of distinguishing pertinent signal characteristic (i.e., features related to the user's intent) from unnecessary contents. The most popular methods in this eld are AAR, wavelet-based, PSD and SCSSP, described in the sequel. 1. AAR: Autoregressive (AR) method is a simple, yet e cient, method for describing probabilistic behavior of a time sequence. In this paper, AAR is used for the adaptive estimation of the AR parameters [9] . The mathematical expression of this method can be written as follows:
where x k is a Gaussian noise with zero mean and variance 2 x , parameter p is the degree of AR model, and a i 's are AR coe cients. Moreover, k is a positive number, denoting the number of samples and is related to the signal duration as k = t f 0 , where f 0 is the sampling frequency. 2. Wavelet: This method is able to produce good frequency localization in the time window, which is, in particular, appropriate for signals with transient nature such as EEG. The Wavelet coe cients are obtained through decomposition of the signal into di erent frequency bands. This decomposition is performed in multiple stages by means of consecutive low-pass and high-pass signal ltering. In this paper, this method is applied to nd the coe cients of the frequency bands of and rhythms as features of EEG signals. 3. PSD: In this method, the power of the signal between di erent frequencies is considered as features. Herein, the value of PSD is computed in each 2 Hz frequency band within the 8-30 Hz to cover and rhythms [21] . 4. SCSSP: Separable Common Spatio Spectral Pattern (SCSSP), proposed in [30] , improves the Common Special Patterns (CSP) method [22] . This method selects proper features of the signals as illustrated in Figure 5 . Let us consider an EEG experiment with N ch channels (electrodes) and N i samples. After passing these signals through a set of N f band-pass lters, N i matrices with the size of N f N ch are obtained. Let X denote a N f N ch matrix, and then its spectral ( i ) and special ( i ) covariance matrices are computed as follows:
where N i is the number of training samples X. Then, by solving the generalized eigenvalue problems in Eq. (5):
desired eigenvectors (W L and W R ) and eigenvalues ( L and R ) are calculated.
Following this calculation, components k are computed as follows: R , and y k can also be found as follows:
where denotes Kronecker product.
It is suggested to use the normalized log-power features instead, which are calculated as follows:
where z k is the kth feature. The feature vector is then constructed as Z = [z 1 ; z N f N ch ; :::] T .
Notice that the SCSSP method is a binaryclassi cation algorithm. To generalize it for multi-class purposes, several auxiliary approaches, such as OneVersus-Rest (OVR), can be used.
Finally, the band-pass ltering is performed using Chebyshev type II lters of order 6 and bandwidth of 4 Hz. A total of N f = 4 lters are used to cover and 
Normalization
Normalization of features limits their dynamic range and improves the accuracy of the algorithm. It can be performed in both linear and non-linear forms. Assume that m k and k are mean and standard deviation of the kth dimension of feature vector (Z), respectively:
(Z ik m k ) 2 ; k = 1; 2; :::; l; (11) where N is the number of training samples. Then, the linear normalization of the train and test Z k is calculated as follows:
Feature election
To reduce the number of obtained features e ciently, the following two methods can be used: 1. FDR: Fisher Discriminant Ratio (FDR) is a ratio that considers the best features by maximizing the distance between the means of the classes and minimizing the variance within each class. Features that satisfy longer FDR are better ones to be selected. If i and i denote the mean and standard deviation of feature Z k in the ith class, respectively, the FDR for this feature is calculated as follows:
where C is the number of classes. A higher FDR value implies that the feature has a better contrast to classify classes. 2. MI: The mutual information I(Z; !) between variable Z from the feature space and class labels ! = f! 1 ; ! 2 ; :::; ! c g is de ned as follows:
; (14) where p(:) denotes the probability function. The more independent variables Z and classes are, the less mutual information the feature will have. It should be noted that the output value of I is always greater than zero. Features satisfying higher quantity of MI with regard to the classes are selected as better features [63] [64] [65] .
Feature reduction
After selecting proper features, still many of them may have dummy information. To reduce the dimension of features, several methods have been proposed, two of which are described in the following: 1. LDA: Linear Discriminant Analysis (LDA), also known as Fisher LDA, is a popular method for dimension reduction and classi cation [66] . The important advantage of LDA is that it amends and improves the total achieved accuracy while providing a very simple methodology. As LDA requires very low computational complexity, it is a worthy selection for BCI systems [67] . LDA projects the data using a transformation matrix W onto a new space with lower dimensions. If the number of classes is C, the new feature dimension (d) will be C 1. LDA tries to minimize within-class scatter S w and maximize between-class scatter S b formulated as follows:
where z j i is the ith sample of class j, j is the mean of class j, n i is the number of samples in class j, and is the mean of means of all classes. because when uniform probability distribution is assumed for classes, p(class : j) can be discarded. 3. SVM: Support Vector Machine (SVM) is one of the well-known classi cation algorithms and has been widely used in BCI due to its simplicity. SVM constructs hyperplanes to separate the feature vectors into several classes. These hyperplanes maximize the margins, that is, the distance between the nearest training samples and the hyperplanes [68] . The goal during the training process is to nd the separating hyperplane with the possible largest margin from the edge of classes [66] .
Performance results
In this paper, for each part of the system, several methods were investigated and tested to nd the optimum approach. Given the fact that the algorithm was going to be implemented on hardware, all values were considered xed point in experiments performed in MATLAB. Twenty bits were found to be su cient for each value of each channel, in which 11 bits were considered for the decimal part, 8 bits for the integer section, and one sign bit. The number of bits is carefully determined using the bit-loading process. This means that through extensive simulations, the dynamic ranges of the variables were monitored and logged. As shown in Figure 6 (a), the number of bits associated to fractional part su ces to be 11 to have stable and good total accuracy in the experiments. Moreover, Figure 6 (b) veri es that 20 bits are su cient for representing the whole number in this work. Among feature extraction methods, SCSSP performs better than others do, for all di erent classication methods (see Table 1 ). Therefore, SCSSP is the best choice for the feature extraction. As can be seen in this table, the normalization of features improves the accuracy rate by 5.4%. MI and LDA methods perform the best among feature selection and reduction methods, respectively. The results show that Bayesian approach outperforms other classi cation methods with a slight improvement compared to the linear SVM. However, because of its low implementation cost on hardware, linear SVM is selected for classi cation in this work (in the next sections, the numeric comparisons of linear SVM and Bayesian classi ers are reported for both software and hardware implementations). In conclusion, the nal structure of the proposed method is depicted in Figure 7 .
Overall results of the algorithm (for complete number of classes) are listed in Table 1 , showing better performance than the best reported results of BCI competitions [54, 55] and also recent state-of-the-art papers [31, 48] for both datasets (67.2% for BCI competition IV, dataset 2a; 73.54% for BCI competition III, dataset V). Moreover, the results show a high performance of 81.9% for captured signals, re ecting the excellent performance of the proposed algorithm. Notice that, in the captured dataset, as mentioned in Section 3.3, subject 7 is much older than the other subjects are. This may explain the signi cant di erence between the accuracy rates of this subject and the other subjects.
Several recent state-of-the-art papers working on motor imagery (such as [31, 48] ) have considered merely two classes of right-and left-hand movements for experiments. To compare the proposed method with similar state-of-the-art methods, we evaluate this work with two mentioned classes, too. The experiment is performed on BCI competition IV, dataset 2a. According to Table 1 , this work outperforms [31] with a slight improvement; however, it does not reach the performance of [48] . One of the reasons for not outperforming [48] is that our proposed system tries to make a balance between accuracy and hardware e ciency by considering hardware issues, too, while accuracy is a matter of much concern in [48] resulting in lower hardware e ciency. The other reasons for the lower accuracy of our proposed method, compared to [48] and some details of [31] and [48] , are explained in the following.
This work di ers from [31] in some important items. First, in [31] , several additional subjects are used as a prior knowledge (for regularization) in the training phase. Obviously, the use of additional subjects in training improves the performance. Additional subjects might not be always available. Secondly, they have not proposed the hardware implementation of their method; thus, their simulations are in the oating point while the reported results of this paper are in xed-point. In fact, the software experiments of the proposed method in this paper are performed in xed point in order to be realistic for hardware implementation. Third, [31] uses CSP and RCSP, while this work uses SCSSP which has signi cant advantages over CSP method. Fourth, they have used three pairs (N f = 3 pairs) Butterworth lters with order 5, while four (N f = 4) Chebyshev lters are used in this work with order 6.
In [48] , a motor imagery embedded system is proposed in both hardware and software frameworks. Their method consists of adjusting lters in preprocessing, using the CSP method for feature extraction and utilizing Mahalanobis distance as a classi er. Several di erences between [48] and this work are as follows. First, in [48] , the lter is optimized for each subject according to intrinsic characteristics of the subject. However, identical lters are used for all subjects in this paper. For instance, for the best total accuracy in [48] , the degrees of optimum FIR lters for subjects S2, S4, and S6 are respectively 221, 146, and 442. However, the orders of Chebyshev lters in this paper are all six, which are much easier to implement and consume much less area and power in hardware. Second, because of this optimization of every subject, [48] is subject-dependent in contrast to the proposed method in this paper. Being subjectdependent has serious disadvantages, such as the need 81.9% 1 In [31] , several additional subjects are used for regularization in training phase. In addition, in contrast to evaluations of this work, experiments in [31] are in oating point simulation. 2 In [48] , in contrast to this work, the lters are not identical for all subjects, and lter of each subject is optimized according to the subject's characteristics. Therefore, [48] is subject-dependent. Moreover, the orders of lters are signi cantly high in comparison to orders of lters in this work. In addition, as our best understanding from [48] , the training and testing sets di er from the standard sets in the dataset.
to train the whole system again by introducing new subjects, having slower training phase, di culty in the expansion and mass production, etc. Third, in [48] , the ratio of training to test samples is 60%/40%. However, the ratio of the training to test in the standard BCI competition IV, dataset 2a, is 50%/50%, which is used in our work for evaluation. In addition, as is reported in the following sections, this work in this paper highly outperforms [48] in the total consumed power. si ed tasks by the proposed method. This gure illustrates the scatter of feature points output from LDA block. Figure 8 (a) belongs to subject S7 of BCI competition IV. Because this dataset contains four classes, the dimension of feature vector from LDA output is of three. Figure 8(b) shows the rst subject of BCI competition III, which has three classes. Thus, as it can be seen, features are two-dimensional. Figure 8(c) shows captured signals. In this dataset, there are two classes, and the dimension of feature vector of LDA output is one. However, for better visualization, the horizontal axis is added as a dummy dimension, which represents the index of samples. As is obvious in this gure, notably high separability between classes has been achieved as a result of optimum blocks used in this method.
Signal processing concerns for hardware implementation
In this section, issues relevant to the BCI system performance will be studied. Some parameters, such as the number of lter banks, the number of channels, the number of classes, and the number of features, a ect the resulting hardware area. Thus, their values must be known before the hardware design. The dataset used for implementation is the third one (captured signals).
Number of lter banks
As previously mentioned, the EEG signals must pass the lter banks before applying the SCSSP algorithm. The number of banks should be e cient with regard to both the achieved accuracy and the hardware implementation cost. In this case, the number of lters is chosen to be four, so that these lters can cover alpha (8-12 Hz) and beta (12-16 Hz, 16-20 Hz, 20-24 Hz) frequency bands. The Chebyshev lter with order 6 is used for each lter path.
Number of classes
Another critical option is to choose the number of classes, in uencing some parameters of LDA and SC-SSP algorithms. In addition, it a ects the classi cation design directly, because the basic SVM is only applicable to binary-class tasks. In order to use the SVM method for classi cation tasks with more than two classes, one-versus-all or one-versus-one approach can be used. In this study, since the hardware is tested for the third dataset, which contains only two classes, the number of classes of dataset is set to two.
Number of channels
Another parameter that a ects the implementation is the number of channels. This parameter a ects the clock rate of the DC-block, Laplacian lter, Chebyshev lter, and SCSSP module. The number of channels is set to 14 in the third dataset.
Number of features
The number of features, set to 40, in uences the transfer matrix dimension of LDA used to reduce the number of features and SCSSP units, which generate the best features before applying LDA.
Hardware implementation
Herein, a two-class case is implemented as an instance of implementation. This implementation can be easily generalized to the case of more than two classes. Appli-cations of two-class case are very wide, such as moving the head of hospital bed up/down by thinking about movements of right/left hand. This can especially help patients with disabilities in moving body. There are also various applications in smart homes, where turning on and o the electrical devices can be performed by merely thinking about right and left hand movements. In addition, as mentioned in Section 2, recently in the literature, it has been proposed to code several tasks in a state machine using only the right-and left-hand signals [48] .
In this work, solely, the testing phase of classication is implemented in hardware, as training can be performed beforehand and there is no need to implement it. The training phase can be done in software such as MATLAB or C language (MATLAB is used in this work), and the resulted coe cients can be saved and used for the test phase, which is implemented in hardware.
Implementation of DC-block
The rst block in the proposed BCI system is the DCblock, with the structure, illustrated in Figure 9 . As can be seen in this gure, it consists of one adder, one constant multiplier, and one subtractor. Because there are 14 parallel data channels that need to be ltered independently, there is a need for 14 parallel DC-block cores.
In such a situation, by means of applying the folding technique on the multiplier module (i.e., to design the DC-block with only one multiplier for all input channels), the silicon area minimization is enhanced. This achievement is obtained by sacri cing the throughput. However, because of the large sampling frequency of the recording device that is equal to 128 Hz for Emotiv ® system, such a decrease in the throughput can be overlooked. As marked in Figure 9 , the critical path of this core consists of only one multiplier and one adder, resulting in the operational frequency of up to 132 MHz, which is fast enough to handle such a low-frequency EEG signal.
Implementation of Laplacian lter
As stated in the previous sections, the Laplacian lter is based on the CAR algorithm, in which the mean of all channels is subtracted from all of them. The structure of this core is shown in Figure 10 . There is one adder to compute the sum of all channels. The division of the output by the number of channels is realized through simple shift operations. In this way, a signi cant area and power can be saved, i.e., 1 14 = 1 16 + 1 128 + 1 1024 = 0:0713. As shown in Figure 10 , the critical path of this module consists of two adders. Therefore, the operation frequency of this module can be up to 234 MHz. The minimum speed of this module with respect to the sampling frequency of 128 Hz can be found based on the following equation: 14 x + 1 x + 14 x in which x is the minimum limit of the operation frequency, which is 3718 Hz.
Implementation of Chebyshev lter
Separable Common Spatio-Spectral Patterns (SC-SSP) [30] are used to extract features of motor-imagery tasks. In this algorithm, input data goes through a bank of band-pass lters in the frequency ranges of 8-12 Hz ( band) and 12-30 Hz ( band). We have used Chebyshev II band-pass lter of order 6. The block diagram of this lter is illustrated in Figure 11 . In this gure, b i and a i parameters are the nominator and denominator coe cients of this lter, respectively. It is important to note that this lter can be implemented by means of 12 multipliers (there are two constants and two zero coe cients). Each lter bank should be applied to all input channels simultaneously, and it ends up in a number of 168 (12 14) multipliers per lter bank, leading to 672 (4 168) multipliers with a bank of four lters, which is not feasible for implementation. In order to overcome this limitation, the folding technique is used between 14 input channels at each lter to reduce the total number of multipliers from 672 to 48 (4 12), resulting in a signi cant reduction in area and power consumption in this module. The critical path of this module consists of two multipliers and seven adders; however, it is improved by means of retiming technique to reduce this path to only one multiplier and two adders, resulting in maximum frequency of about 106 MHz. The latency of this module is 99 (1 + 14 (6 + 1)) clock cycles. Therefore, the minimum limit of the operational frequency in this module should be 12.3 kHz (99 128 Hz).
Implementation of SCSSP
After applying the proposed band pass lters, the SCSSP [30] core should be applied to the ltered data, in order to extract corresponding features of motor-imagery tasks. The functional structure of this algorithm is summarized in Figure 12 . As is shown in this gure, ltered vector X consists of 4 14 elements L , explained in SCSSP section. The result is named \Tmpt" vector, whose elements are squared to construct \Tmpt2" vector. This procedure is repeated after receiving the next sample whose \Tmpt2" vector is added to the previous \Tmpt2". This job is repeated for 3 f s (128 Hz) times. After that, the new \Tmpt2" vector is decomposed to 2 parts due to the number of classes. Each part of the vector consists of 20 elements. The next step is to sum up these elements so that each part of these vectors is divided by its corresponding sum result. The implementation of the logarithmic function is also implemented by the following equivalency:
An overview of di erent stages in SCSSP algorithm is marked by means of black hexagons in Figure 12 . Each stage is a sub-module and designed to dictate a total critical path of one adder and one multiplier for the whole design, which was made possible using an e ective pipelining technique speci cally in the divider and Arctanh operations, in addition to the folding and retiming techniques used in the design. These e orts resulted in an e ective high-speed feature extraction operating at 102 MHz clock speed. The minimum speed of the operational frequency in this module is 560 kHz.
Implementation of normalization
After feature extraction, the features should be normalized. To reach this goal, test features should be subtracted from the average of train features and, then, divided into standard deviation of train features. The block diagram of normalization is illustrated in Figure 13 . The latency of this module is 40 clock periods. The number of clocks taken to generate output for each feature is 27. Since there are 40 features, it can be concluded that the number of total clocks for this part is 40 27 + 40 = 1120. Therefore, the minimum operational frequency of this module is 1120 128 = 143360 Hz. The maximum speed of this module can be up to 267 MHz.
Implementation of MI
After normalization, appropriate features should be chosen. Fortunately, this part does not require any hardware. In a training stage, the indices of features are determined. Therefore, according to the index, the corresponding elements of the feature vectors are selected in a test stage. To explain better, the maximum number of features is set to 40, which is found to work for all subjects. In the training phase, mutual information module nds the best indices and the number of features for every subject. This training is performed in software, as explained previously. Then, those elements of transformation matrix (LDA module), corresponding with those numbers of features which are not chosen by mutual information, will be replaced with zeros (masking). Therefore, those numbers of acceptable features are only calculated in LDA module in the testing phase. In the testing phase, a masking process is performed merely in LDA module in order to select the best-found features out of 40 features. Note that, consequently, altering the number of features does not a ect the hardware at all. In conclusion, no hardware resources are required to implement mutual information module since the features are selected in the training phase.
Implementation of LDA
After choosing the best features, their dimension should be reduced. This reduction can be performed by means of the LDA algorithm, which was discussed previously. In this algorithm, at the train stage, a key matrix is generated. If eigenvectors of this matrix consist of d elements (here, d = 40), then the dimension of this matrix must be (C 1) d, where C is the number of classes. Hence, after multiplying the matrix of selected features by the eigenvectors, the dimension of output vector will be C 1. Herein, due to the classi cation of left-and right-hand movement motorimagery tasks, C is equal to 2 and the output is a scalar number. The block diagram of this multiplication is depicted in Figure 14 . In this module, the critical path consists of one multiplier and one adder, and the maximum speed of this module can be up to 203 MHz. The latency of this module is 40 clock periods; therefore, the minimum operational frequency of this module should be 5128 (40 128) Hz.
Implementation of SVM
The last core in the data ow is the classi er. This core is designed based on the Linear-SVM algorithm. The main reason for choosing the linear classi er is its simplicity and e ciency in hardware implementation. The block diagram of this core is shown in Figure 15 . Based on this gure, the critical path consists of just one multiplier. Therefore, the maximum operational frequency of this module is around 211 MHz. The latency of this module is 2 clock periods; consequently, the minimum limit of the speed of this core is 256 (2 128) Hz.
It is also worth mentioning that implementing non-linear SVM requires adding a non-linear function block to obtain the non-linear function of the dot product between the support vectors and the test vector.
Trade-o of accuracy and hardware resources
As mentioned before, linear SVM is preferred to Bayesian classi er in this work. The detailed comparison of these two classi ers in terms of performance results and hardware resources is reported in Table 2 . [68] , where w's are the weights of SVM, b is the bias, and here x is the output of the previous stage, which is LDA. As can be seen in this table, the Bayesian classi er slightly outperforms linear SVM in accuracy, while its hardware resource usage is signi cantly higher and is also much slower than SVM, making it not suitable for motor imagery systems in which fast decision-making is crucial. The very small improvement in accuracy while causing a signi cant increase in hardware usage and latency is not encouraging. Moreover, note that power is one of the main concerns in BCI systems ful lled by choosing linear SVM over Bayesian classi er. Nevertheless, one can prefer better accuracy if hardware usage is not an important issue.
For showing the trade-o between accuracy and hardware resources better, di erent permutations of settings are experimented by excluding/including the parts from/in the proposed system. The classi cation accuracy as well as the consumed power of the experiments are reported by a Pareto optimal graph illustrated in Figure 16 . The settings of di erent experiments (points) in this gure are listed in Table 3 . Experiments 1 and 9 in this Pareto optimal graph are the experiments detailed in Table 2 . The Pareto optimal graph clearly shows that there exists a trade-o between classi cation accuracy and hardware e ciency, and some settings are the best ones among the possible settings. As can be seen in Figure 16 , experiments 1, 3, and 4 are the Pareto optimal points from which the setting of experiment 1 is chosen as the nal setting in this paper, because accuracy is important in our goal.
Hardware results
In order to capture, analyze, and process the captured data, the following interfaces and tools are used. The Table 3 . The red points connected with the green lines are the Pareto optimal points. utilized evaluation board is Virtex-6 FPGA ML605 Evaluation Platform to which the input data is sent via an Ethernet port. The output signal is tracked and analyzed by Xilinx ChipScope software. The synthesis and analysis of HDL designs are performed using ISE Xilinx 14.5 software. The results of xed-point numbers show that 20 bits are enough for each value of every channel in a way that 11 bits are considered for the decimal part, 8 bits for the integer part, and one last bit for the sign bit. Thus, the number of bits for one test vector is 280 (14 channels 20) . A new test vector is processed every 0.0078s (f s = 128 Hz). Since each person requires three seconds to perform a given task (see Figure 3) , then the number of test vectors for each experiment is equal to 384 (3 128). Each person performs 30 experiments including 15 experiments for the rst class and 15 experiments for the second class (in our captured dataset).
Retiming and folding techniques were used by sacri cing the speed, which is justi able because of low sampling frequency of signal recording. Final structure satis es maximum frequency of 102 MHz and minimum frequency of 560 kHz, according to Table 4 . Table 4 shows the number of Adders, Subtractors, Multipliers, Dividers, and Arctan modules used in the proposed BCI, showing the importance of using folding and retiming techniques. Note that, for multiplications, Xilinx multiplication cores (LogiCORE IP Multiplier v11.2), which are optimized according to inputs of the core, are utilized. The hardware latency and power of each section and the total framework are also reported in Table 4 . Moreover, as seen in Table 4 , a wide range of allowed frequencies is provided in this proposed system, helping to increase frequency and have a satisfying and acceptable total latency even in a large number of channels.
The hardware resource usage is reported in Table 5. It can be seen that the total power of this work, i.e., 83.90 mW, is much less than the total power in [48] , which is 1.067 W. As reported in this table, the proposed system outperforms [47] and [48] signi cantly in terms of the number of LUTs, the number of block RAM/FIFO, the total latency, and the utilized power. Table 4 . Features of sections of the framework. The rst two columns report the valid operational frequency of the proposed BCI. The next four columns represent the detailed resources utilized in di erent sections. The last two columns display latency and power, respectively. Note that the total latency is the spent time started at the end of motor imagery (which is the 6th second according to Figure 3 ) and ended when prediction becomes ready. In this Notice that identical clock frequency (560 kHz f 102 MHz) is used in the sections of nal setup. The latency can be obtained in milliseconds according to selected clock frequency (10 milli-seconds in the worst case and 55.5 micro-seconds in the best case). 1 The numbers of channels in the hardware implementation are 14, 22, and 22 in this work, [47] , and [48] , respectively, and the reported latency and power in this table are set according to these settings. However, note that, according to Table 4 , latency of the proposed system would be 6534 (1=560 kHz) 11:66 ms for N = 22 channels much better than in [47] and [48] . In addition, as explained in paragraph 3 of Section 9, the hardware is not altered signi cantly by changing number of channels; therefore, power does not signi cantly change, too.
It outperforms [48] in the number of DSP blocks, too. The reason of outperforming [48] in the number of DSP blocks and multipliers is having lters with order six, while the order of lters increases signi cantly (to, e.g., 221 and 442) to achieve higher accuracy. On the other hand, the proposed system outperforms [48] in terms of block RAM/FIFO which is because most of the tuned and trained parameters saved in our system are scalars (e.g., in SVM) or vectors (e.g., in LDA) and merely one matrix saved for SCSSP. However, more matrices are required to be saved in [48] , which are the matrices needed for CSP section and covariance matrix in Mahalanobis distance. Moreover, folding and retiming techniques help our system have better power consumption and area. Paper [47] , although has the lower performance rate (72%) on BCI Competition IV dataset (2 classes), its latency is not completely proper for practical applications. Note that, in the proposed system, changing the number of channels does not alter the utilized hardware resources too much because normalization, SVM, and LDA structures are independent of the number of the channels, and merely some wires and bits in registers are added in DC-block, Laplacian lter, Chebyshev lter, and SCSSP. Consequently, the number of RAM/FIFO and DSP blocks do not change, while the number of LUTs blocks increases a little and not signi cantly.
Conclusion
This paper proposed an e cient hardware implementation for BCI systems. The proposed system consists of preprocessing, feature extraction, feature selection, feature reduction, and classi cation stages. Software simulations and hardware limitations guided the method to choose DC-block and surface Laplacian as preprocessing, SCSSP as feature extractor, MI as feature selector, LDA as feature reduction method, and SVM as classi er. Results determined the proper performance of the proposed system, compared to other reported results on the used datasets (67.2% for BCI competition IV, 73.54% for BCI competition III, and 81.9% for captured signals). The system was also eciently implemented and tested on a FPGA hardware platform. The minimum frequency of BCI system was 560 kHz. Moreover, because of using Folding and Retiming techniques, the number of resources (LUTs and block RAM/FIFO) decreased dramatically, on cost of a few additional DSP blocks.
Blinking can have destructive impact on the obtained accuracy. By using an appropriate EOG detection during the mental task, it is possible to increase the signal-to-noise ratio and, thus, accuracy. Moreover, in all these experiments, it was known when the subjects were going to do the mental task.
Predicting when subjects are going to think about the task by extracting ERD/ERS features can be added to the proposed framework (see Section 3.3 of [69] ). 
Biographies

