This paper presents an accurate and robust embedded motor-imagery brain-computer interface (MI-BCI). The proposed novel model, based on EEGNet [1], matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs), such as the ARM Cortex-M family. Furthermore, the paper presents a set of methods, including temporal downsampling, channel selection, and narrowing of the classification window, to further scale down the model to relax memory requirements with negligible accuracy degradation. Experimental results on the Physionet EEG Motor Movement/Imagery Dataset show that standard EEGNet achieves 82.43%, 75.07%, and 65.07% classification accuracy on 2-, 3-, and 4-class MI tasks in global validation, outperforming the stateof-the-art (SoA) convolutional neural network (CNN) by 2.05%, 5.25%, and 5.48%. Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0.31% with 7.6× memory footprint reduction and a small accuracy loss of 2.51% with 15× reduction. The scaled models are deployed on a commercial Cortex-M4F MCU taking 101 ms and consuming 1.9 mJ per inference for operating the smallest model, and on a Cortex-M7 with 44 ms and 6.2 mJ per inference for the mediumsized model, enabling a fully autonomous, wearable, and accurate low-power BCI.
I. INTRODUCTION
Brain-computer interfaces (BCIs) aim to provide a communication and control channel based on the recognition of the subjects intentions, e.g., when performing motor-imagery (MI), from neural activity typically recorded by noninvasive electroencephalogram (EEG) electrodes [2] . MI-BCI systems are designed to find patterns in the EEG signals and match the signal to the motor motion that was imagined by the subject. Such information could enable communication for severely paralyzed users, control of a wheelchair [3] , or assistance in stroke rehabilitation [4] .
MI-BCIs are still susceptible to errors mostly due to high inter-and intra-subject variance in EEG data [5] , [6] , resulting in low classification accuracy. Traditional methods approach this challenge with robust feature extractors, typically filter bank common spatial pattern (FBCSP) [7] or Riemannian covariances [8] , and classify these features with linear discriminant analysis (LDA) or support vector machines (SVMs) [6] . Convolutional neural networks (CNNs) have been proposed as a competitive solution in EEG classification, while requiring fewer parameters to learn and being computationally cheaper in inference than traditional BCI methods [9] , [1] . However, today's CNN models are designed to be executed on a CPU or GPU, requiring EEG data to be transmitted from the sensor node to an external compute engine through wired or wireless communication. Due to their computational complexity and resource requirements, those models have been predominantly confined to cloud computing with high-performance computers rather than used in real-world BCI applications, where latency, privacy, and wearability are crucial requirements besides the accuracy [10] , [11] . Recently, a new generation of wearable BCIs is attracting the academic and industrial researchers. An increasing number of battery-operated wearable solutions, using microcontroller units (MCUs), are proposed to bring computing capabilities towards the "edge" to perform real-time near-sensor computation [10] , [12] , [13] , [14] . Edge computing and near-sensor computation offer the following advantages: 1) lower energy consumption for the data transmission between sensors and remote processing; 2) longer battery lifetime; 3) significantly shorter latency compared to remote computation; 4) user comfort; 5) security and privacy improvements, as the data are processed locally and only little information is transmitted wirelessly if necessary.
On the other hand, edge computing poses several challenges when it needs to match the requirement of long-term battery operation, mandatory in wearable devices, and to continuously perform complex BCI models (e.g., CNNs) with a low-power processor. For instance, the ARM Cortex-M series is the most popular family of low-power processors used in embedded wearable devices [15] . Those MCUs allow several hours lifetime with a small-scale battery, but they have a resourceconstrained architecture. For example, an ARM Cortex-M4F processor offers few KB of RAM and million operations per second (MOPS) in a power envelope of few mW [16] . The more recent ARM Cortex-M7 provides an even better performance up to 300-400 MOPS with a higher power consumption of few hundred mW. To achieve the goal of deploying complex and accurate CNN models on these tiny microprocessors, the models need to be re-thought and redesigned with the abovementioned constraints in mind. Moreover, many researchers have demonstrated for computer vision that reducing the model size with clever network optimization techniques does not always cause a performance degradation [15] , [17] .
This paper proposes a novel embedded model for MI-BCI that focuses on bringing the next generation of edge BCI on autonomous wearable systems. The main contributions of the paper are as follows:
• We propose a novel embedded MI-BCI model which outperforms the state-of-the-art (SoA) model on the Physionet EEG Motor Movement/Imagery Dataset [18] . The model is based on EEGNet architecture [1] and achieves a global validation accuracy of 82.43%, 75.07%, and 65.07% on 2-, 3-, and 4-class MI task, which is 2.05%, 5.25%, and 5.48% higher than the SoA CNNs [19] . • We further propose methods to reduce the memory footprint for the execution of the model by temporal downsampling, channel reduction, and narrowing down the time window considered for performing one classification, without significant loss in accuracy. This allows us to target low-power embedded devices with very tight constraints. • We evaluate experimentally the benefits of our model in terms of energy consumption, latency, and accuracy on two different platforms: ARM Cortex-M4F and Cortex-M7. We compare the two platforms executing the inference of 4-class MI with accurate measurements. To the best of our knowledge, no previous work has evaluated MI-BCI on these low-power MCUs using CNNs by considering both runtime and power measurements besides the classification accuracy. Finally, we release open-source code developed in this work 1 .
II. RELATED WORK
The recent literature on MI-BCIs is very rich, mostly considering feature extraction and classifiers separately. EEG signals are typically pre-processed using spectral and spatial filters followed by log-energy feature calculation, better known as FBCSP [7] , [20] . The multi-spectral features are classified using either LDA, regularized LDA, or SVMs [6] .
Alternatively, the feature extractor and classifier can be combined and trained simultaneously with a CNN. Today, CNNs are among the most accurate BCI architectures and demonstrated impressive performance [19] , [9] , [1] . Schirrmeister et al. [9] provide an elaborate study on CNN architectures for MI-BCI, where the small Shallow ConvNet achieves an accuracy of 73.59% on the 4-class MI-BCI competition IV-2a [5] . With its temporal and spatial filters followed by square-log activation, Shallow ConvNet can be interpreted as a tunable variant of FBCSP.
Due to limited amount of data provided in the MI-BCI competition IV-2a dataset containing recordings of only 9 subjects with 144 MI-trials per class, a variant of Shallow ConvNet has been trained and validated in [19] on the much larger Physionet EEG Motor Movement/Imagery Dataset [18] , with recordings of 109 subjects with 21 MI-trials per class which is overall ≈2× the amount of MI-trials. The model is trained and validated globally in 5-fold cross-validation (CV) across subjects. It has achieved SoA accuracy of 80.38%, 69.82%, and 58.59% on 2-, 3-, and 4-class MI on that dataset. Additionally, the global models are adjusted for every subject using subject-specific transfer learning (SS-TL), which further improved the accuracy by 6.11%, 9.42%, and 9.92%. The main differences to the original Shallow ConvNet are the use of ReLU instead of square-log activation and the splitting of the 1 https://github.com/MHersche/eegnet-based-embedded-bci final classification layer into two fully connected layers, which increases the number of trainable parameters.
Another smaller, yet robust, CNN architecture is EEG-Net [1] , which achieved the same accuracy as the winner of the BCI competition IV-2a on 4-class MI [5] . The main difference to Shallow ConvNet is that EEGNet uses fewer feature maps, spatial separable convolutions, and more pooling layers, which reduces the number of weights as well as the feature map sizes. Its flexibility and small size, however, comes at the cost of significantly lower accuracy, e.g., 67% for 4-class MI. Efforts have been made to modify EEGNet by changing the pooling layers and expanding the network achieving 72% accuracy with subject-specific models [21] .
Most of these models are evaluated remotely offline, without considering the possibility to bring the computation closer to the sensors, where the data is acquired. Few studies have shown embedded implementations using traditional MI-BCIs with separate feature extractors [22] , [23] , [24] , but, to the best of our knowledge, no previous work has demonstrated accurate embedded MI-BCI on low-power MCUs using CNNs, which offers better accuracy at lower latency. In this paper, we propose a CNN novel model based on EEGNet to perform MI classification on Physionet dataset [18] . Our proposed model improves the 4-class MI accuracy by 5.48% on average while reducing the memory footprint by a factor of 4.6× compared to the current SoA on this dataset. In order to target resource-constrained low-power embedded systems, we further study model reduction methods to test the limitations of the proposed model architecture in terms of the sampling frequency, the number of EEG channels, and the length of the input signals. We implement two reduced models that are within the resource constraints of two popular low-power MCUs with ARM Cortex-M4F and M7 and measured the runtime and energy consumption.
III. DATASET DESCRIPTION
We use the publicly available Physionet EEG Motor Movement/Imagery Dataset [18] containing EEG recordings of 109 subjects. Four subjects are discarded due to variability in the number of trials, resulting in 105 subjects to be finally used. The EEG signals were recorded with the BCI2000 system [25] using 64 channels sampled at 160 Hz. The subjects performed motor movement and MI tasks, however, in this study we solely focus on the classification of MI. Every subject participated in three runs for MI of left fist (L) against right fist (R), and three runs for MI of both fists (B) against both feet (F). One run lasts 120 s and consists of 14 MI trials according to the timing scheme shown in Fig. 1 . This results in 21 trials per class per subject. A baseline run provides resting data (0), where the subjects did not receive any cues for 60 s while having their eyes open. In order to get trials with resting data, we extract windows of 3 s from the baseline run. As done in [19] , we distinguish between 2-, 3-, and 4-class MI using L/R, L/R/0, and L/R/0/F MI tasks, respectively. Fig. 2 illustrates the validation methodology inspired by [19] , which distinguishes global from subject-specific vali- dation. The global validation accuracy is determined by 5-fold CV across the subjects, i.e., training on 4/5 of the subjects and validating on the remaining, unseen 1/5 of the subjects. In SS-TL, the global model is further adjusted by doing transferlearning on part of the subject's data and validated on the remainder. This validation is done with 4-fold CV on every subject.
A. Validation methodology
In the example of Fig. 2 , a global model is first trained on subjects S1-S84 and validated on S85-S105, yielding the first fold accuracy of the global validation. SS-TL is then applied for S85-S105 on every subject separately in 4-fold CV, always starting with the global model from S1-S84.
IV. METHODS
This section introduces the novel embedded MI-BCI model proposed in this paper that matches the memory and complexity constraints of low-power MCUs with high accuracy.
We first describe how the compact EEGNet [1] is applied and evaluated on the Physionet Dataset [18] , and propose methods to further reduce the memory requirements of EEGNet. Table I gives further insights into the architecture of EEG-Net. Here, the number of input samples, EEG channels, kernel size of the first average pooling are kept variable; they have a direct influence on the number of parameters as well as feature map sizes. The last two columns show that EEGNet is indeed very compact: it requires to learn only 3,204 weights. However, large feature maps need to be stored during operation. Assuming we need to be able to store at least two consecutive feature maps at any time, the maximum number of stored features is the sum of the input and first layer, i.e., N s × N ch + N s × N ch × 8 = 276, 480 features.
A. EEGNet on Physionet Dataset
Training and evaluation of the EEGNet are implemented using Keras with TensorFlow (version 1.11) backend. The model is trained with Adam optimizer for 100 epochs with batch size of 16 and a fixed learning rate scheduler, setting the learning rate to 0.01, 0.001, and 0.0001 at epochs 0, 20, and 50, respectively. In SS-TL, the model is trained for 5 more epochs. All training hyperparameters were determined via 5-fold CV on the training set of the first fold of the global validation set (i.e., S1-S84) for 4-class MI, and kept the same for 2-and 3-class MI as well as all reduced EEGNet configurations.
B. Embedded implementation
For the evaluation on embedded processors, we choose two MCUs from STMicroelectronics: B-L475-IOT01A2 with an ARM Cortex-M4F processor at 80 MHz with 128 KB of SRAM and 1 MB of Flash memory and STM32F756 Nucleo-144 with an ARM Cortex-M7 processor at 216 MHz with 320 KB SRAM and 1 MB of Flash. Both MCUs utilize digital signal processors and floating-point units. We then use the X-CUBE-AI expansion package of STM32CubeMX [26] to deploy the trained models on the selected MCUs.
Based on Table I and considering 32-bit floating-point numbers, the estimated Flash memory needed for storing the parameters of the model is around 13 KB, whereas the RAM requirement for the to largest consecutive feature maps is roughly 1.05 MB. With these configurations, the model can not be executed on the selected low-power MCUs. As mentioned in the previous subsection, the output of the first layer requires most of the memory. To overcome this limitation, we reduce the input data size by: 1) downsampling the EEG data in the time domain. MI activities cause brain oscillations mainly within the µ (8-14 Hz) and β (14-30 Hz) bands [9] . Some CNNs have shown to learn temporal filters that cover the gamma (71-91 Hz) band, however, the main information was still extracted from µ and β oscillations [19] . Therefore, we downsample the signals by a factor of ds=2 or ds=3, which restricts us to maximal oscillations of 40 Hz or 26 Hz, respectively. The temporal filter and the pooling kernel sizes are scaled to N f = 128/ds and N p = 8/ds . This way, the network is expected to learn similarly to the original model after the depthwise convolution, independent on the downsampling factor. AF7  AF3  AFZ  AF4  AF8   F7  F5  F3  F1  FZ  F2  F4  F6  F8   FT7  FC5  FC3  FC1  FCZ  FC2  FC4  FC6  FT8   T9  T7  C5  C1  C3  CZ  C2  C4  C6  T8  T10   TP7  CP5  CP3  CP1  CPZ  CP2  CP4  CP6  TP8   P7  P5  P3  P1  PZ  P2  P4  P6 2) using a subset of electrode channels. The BCI2000 system conforms to the 10-10 international system electrode placement with N ch =64 electrodes. We reduce the number of electrodes to N ch =19 by taking the widely used 10-20 international system electrode placement, from which we exclude A1 and A2. As an intermediate configuration, we add the channels to cover the whole region of the brain equally reaching N ch =38 electrodes. We also investigate the case with only 8 electrodes based on the EEG headset by Bitbrain. Fig. 4 shows the 64-, 38-, 19-, and 8-electrodes configurations. 3) decreasing the time window of the signal used for each classification. We reduce the input signal from 3 s to 2 s or 1 s after the start of the MI cue (ref. Fig. 1 ). Noteworthy, this approach reduces the delay of the system in addition to the model size reduction. We study the impact on classification accuracy of each reduction approach, testing different configurations to choose the best combination in terms of accuracy and memory footprint for further deployment on both selected MCUs.
V. EXPERIMENTAL RESULTS
This section assesses the proposed methods on the Physionet Motor Movement/Imagery Dataset. We measure the classification accuracy as the ratio between correct classified trials over the total number of trials. Table II compares the average classification accuracy of the global and subject-specific model of EEGNet with the baseline CNN proposed in [19] . In global validation, EEGNet outperforms the baseline CNN by 2.05%, 5.25%, and 5.48% on 2-, 3-, and 4-class MI, respectively. EEGNet does not improve as significantly as the baseline CNN when applying SS-TL: the accuracy increases by 1.89%, 5.00% and 5.76% on EEGNet and by 6.11%, 9.43%, and 9.92% on the baseline CNN. Due to already high accuracy of EEGNet in global validation, however, the accuracy in subject-specific validation is still 0.82% and 2.32% higher than the baseline CNN in 3and 4-class MI and only 2.17% lower in 2-class MI.
A. Global vs. Subject-specific MI Classification

B. EEGNet Model Reduction
Table III studies the impact on the classification accuracy in 2-, 3-, and 4-class MI on global validation when reducing EEGNet by temporal downsampling, channel reduction, and narrowing the classified time window. Only one reduction approach is applied at a time; the remaining configurations are kept according to the standard EEGNet.
As expected, downsampling has a negligible effect on the accuracy with a maximum decrease among all MI tasks of 0.32% and 1.25% at downsampling factor ds=2 and ds=3, respectively. Even though part of the β band is ignored at ds=3 due to the cut-off frequency of the anti-aliasing filter at ≈26 Hz, the accuracy does not drop significantly. This result confirms that the significant information in this dataset is contained in α and lower β bands for the MI task. When reducing the number of EEG channels to N ch =38, the accuracy decreases only marginally by a maximum of 0.95%. However, further reduction to N ch =19 and N ch =8 significantly affects the performance with a maximum accuracy decrease of 2.66% and 6.52%, respectively. Similar trends can be seen when narrowing down the time window used to do one classification: the accuracy decreases by a maximum of 1.62% with a temporal window of T =2 s, and by 3.6% with T =1 s. As already mentioned in the previous section, narrowing the time window brings additional advantages in shorter classification delays, and thus, provides a trade-off between accuracy and delay to be chosen by the user.
Next, we test all combinations of reduction methods in order to find the best configuration in terms of accuracy vs. memory footprint. Fig. 5 shows the global 4-class accuracy of all reduction combinations, excluding the N ch =8 configuration due to the large drop in accuracy. We consider only the memory footprint required to store input and first layer features with 32-bit floating-point representation, since the number of features is two orders of magnitudes higher than the parameters, as pointed out in Table I . As delay might be an additional constraint for model selection, the configurations are marked according to the time window used for classification. EEGNet outperforms the baseline CNN in most configurations: it has at least 4.6× lower memory footprint (i.e., <1.05 MB vs. 4.80 MB), while achieving higher classification accuracy in most cases. We select two EEGNet configurations on the pareto-optimal curve, which satisfy the 
C. MCU Implementation
We deploy the selected models using STM32CubeMX v5.3 with X-CUBE-AI 5.0.0 package extension on B-L475-IOT01A2 with an ARM Cortex-M4F processor and STM32F756 Nucleo-144 with an ARM Cortex-M7 processor and measure the power consumption with a Keysight N6705C power analyzer. In order to have optimal performance, we enable the core instruction and data caches and ART accelerator sub-system to speed up instruction fetch accesses. We deploy Model 1 on both Cortex-M4F and M7 for comparison, and Model 2 only on Cortex-M7 due to memory constraints.
The Cortex-M7 offers the highest performance in ARM Cortex-M processor family. In fact, as can be seen in Table IV , it takes around 6 cycles per multiply-and-accumulate (MACC) operation, which is around 1.8× faster than the Cortex-M4F. However, the power consumption is 2.4× higher than Cortex-M4F at the same frequency of 80 MHz. Cortex-M7 can run up to 216 MHz, which reduces the latency of Model 1 by a factor of 4.9× compared to Cortex-M4F at a price of 1.4× more energy consumption. Model 2 has the lowest accuracy loss (i.e., 0.31%) after model reduction, but it can fit only into the Cortex-M7 processor. Running at the highest frequency, it takes around 44 ms and 6.2 mJ per inference. III: Classification accuracy (%) on 4-class MI using global validation. The standard EEGNet (Fig. 3) is reduced either by downsampling in time domain, reducing the number of channels, or narrowing the time window for a single classification. 
VI. CONCLUSION
We propose an embedded model based on EEGNet for lowpower MI-BCIs. The proposed model achieves 2.05%, 5.25%, and 5.48% higher classification than the SoA CNN on 2-, 3-, and 4-class MI, while requiring 4.6× less memory for storing the features during the execution of the model. We reduce the input feature map by downsampling in the temporal and spatial domain as well as narrowing down the time window and relax the memory requirements by 7.6× at 0.31% accuracy loss, and by 15× at 2.51% loss. We demonstrate the performance of the proposed models on two commercial MCUs. In particular, the implemented models execute in 44 ms consuming 6.2 mJ per inference on an ARM Cortex-M7 and in 101 ms using 1.9 mJ on an ARM Cortex-M4F processor, making them suitable for a battery-operated real-time wearable system to continuously perform online MI classification.
