Abstract-Real-time biosignal classification in powerconstrained embedded applications is a key step in designing portable e-health devices requiring hardware integration along with concurrent signal processing. This paper presents an application based on a novel biomedical System-On-Chip (SoC) for signal acquisition and processing combining a homogeneous multi-core cluster with a versatile bio-potential front-end. The presented implementation acquires raw EMG signals from 3 passive gel-electrodes and classifies 3 hand gestures using a Support Vector Machine (SVM) pattern recognition algorithm. Performance matches state-of-the-art high-end systems both in terms of recognition accuracy (> 85%) and of real-time execution (gesture recognition time «300ms). The power consumption of the employed biomedical SoC is below 10mW, outperforming implementations on commercial MCUs by a factor of 10, ensuring a battery life of up to 160 hours with a common Li-ion 1600mAh battery.
I. INTRODUCTION
The fast growth of miniaturized and efficient electronics enables the development of unobtrusive and portable personal health care systems. Electrode-based biosignals like ECG, EEG, EMG (ExG) resulting from the underlying physiological body activity allow to infer biomedical parameters linked to the subject health status. ExG application ranges from consumer electronics for fitness and wearable applications to medical-grade devices enabling patient monitoring and rehabilitation. For instance, ECG monitoring is employed in smartwatches for fitness heartbeat monitoring [1] , pacemakers [2] and holter devices [3] . EEG based systems are used in medical applications for the treatment of neurological disorders, like Parkinsons [4] , epilepsy [5] , spinal cord injuries [6] , [7] , [8] as well as attention loss, drowsiness [9] and autism detection [10] . Implantable intramuscular EMG recording devices have been explored in research [11] but a significant part of EMG applications are devoted to wearable surface EMG devices thanks to their unobtrusiveness, e.g., controllers of the upper limb prostheses [12] and hand gesture recognition systems intended for consumer Human-Machine interaction [13] . The surface EMG signals are the superposition of the electrical activity of underneath muscles when contractions occur [14] . Typical amplitude ranges from ±10jLV to ±10mV with a 978-1-5090-6707-7/17/$31.00 ©20 17 IEEE maximum bandwidth of 2 kHz depending on the size of the contracting muscle, on the distance between fibers and the electrodes and on the properties of the latter.
EMG signals can be processed to recognize the user's intended gestures by analyzing muscular activation patterns. Due to the intrinsic variability of biosignals, machine learning approach is mandatory to reach high recognition accuracy. However this requires substantial computational capabilities that only became suitable for wearable platforms in the last years. In fact, early studies on EMG gesture recognition were performed offline on benchtop platforms [15] , while the remarkable advances in digital low-power design and in efficient computational architectures [16] , [17] allowed to implement real-time systems that can execute algorithms like advanced filtering [18] , dimensionality reduction [17] and pattern recognition [19] . Nevertheless, the design of an efficient real-time system for low-power acquisition and EMG processing is still an open challenge that requires a multilevel approach, ranging from the design of the Analog Front-End (AFE) and of an efficient digital processor to the system level architecture and the algorithms. The development of various single purpose and general purpose biomedical ICs have been reported in literature, however none of them cover the requirements of the presented application in terms of number of channels or processing power.
In this paper, we propose the implementation of real- time acquisition, processing and classification of a 3-channel EMG signal based on a biomedical SoC [20] platform. It supports up to 9 passive electrode-based medical-grade ExG channels, each configurable to trade-off unnecessary precision and bandwidth for power. Furthermore it integrates a homogeneous multi-core digital processor where energy efficiency is increased by voltage and frequency scaling while computation power can be recovered by parallel computation. This solution provides all the hardware required for the hand gesture recognition on a single chip, while maintaining the versatility of a general purpose programmable multi-core platform. Our SoCbased platform executes the EMG classification in 560 jjS, significantly lower than the real-time requirement of 300 ms for upper limb prosthetic controllers [21] , reaching also a performance gain of more than lOx in terms of power consumption w.r.t. current commercial ST Microelectronic Cortex M4 based solutions [21] .
II. HARDWARE IMPLEMENTATION

A. The Biomedical SoC Architecture
The biomedical SoC used for this experiment is based on the design developed at ETH Zurich [20] . The chip includes a quad-core microcontroller (MCU) equipped with 9 electrodebased analog channels for ExG signal amplification with Electrode Impedance Measurements (ElM) for active leadoff detection. The parallel MCU thus provides an efficient mean to real-time process sparse biomedical signals on-chip in a concurrent fashion. The time-multiplexed ADC allows for maximum hardware reuse of constantly-biased analog circuits while using the Successive Approximation Register (SAR) architecture allows for faster analog channel switching and guarantees precise synchronization among them. On top of that 1.5 V and 3.3 V capacitor-free Low-DropOut regulators (LDOs) [22] decouple sensitive analog supplies and allow to power-gate unused (not shown in Fig. 2 ) analog circuits. A flexible clock-division and -distribution scheme allows to independently tune clock signals to lower gate switching dynamic power.
1) MCU:
The employed MCU is an implementation derived from the PULP (Parallel-Ultra-Low-Power) Platform [23] , [24] . It comes with 4 general-purpose openRISC cores sharing the level 1 (Ll) memory -tightly coupled data memory (TCDM). Ll data can be accessed by all the cores within a single cycle through a logarithmic interconnect, not requiring explicit copies of data for data exchange between cores. Unused cores are set to clock-gated idle mode to save dynamic power while if all the cores are in idle mode the whole cluster region is clock-gated as well. Leakage currents (25 jjW) are eliminated by power-gating the cluster region if in idle for more than 40 ms. Besides standard communication interfaces (JTAG, UART, SPI, I2C, GPIO) the MCU peripheral region is equipped with an ADC readout block as point-ofentry for incoming digitized signals. The raw data stream is usually buffered directly in the L2 memory, routed via a Peripheral Direct Memory Access (PDMA) without requiring core intervention. Once buffers are filled cluster and/or cores are waken up through interrupt requests handled by the event-unit. To optimize power savings the MCU cluster and the MCU peripherals belong to different domains operating at different voltages and clock frequencies. Reliable signal domain crossing is guaranteed by the insertion of levelshifters and FIFO syncronizers. At startup a boot-ROM allows to load program code from external flash memory to the L2 via SPI while applications can have memory-mapped access to all the MCU peripherals and the analog configuration registers.
2) Analog Readout: The low-power (150 jjW) Analog
Readout (AR) consist of a low-noise (IR = 1.09 jjV, BW = 150 Hz) Instrumentation Amplifier (IA) followed by a Low Pass Filter (LPF) to reduce off-band components and to prevent aliasing effect due to the subsequent time-sampling. Both the AR gain and cut-off frequencies are selectable over a wide range to cover many applications while chopping modulation of the analog signal is required to avoid in-band flicker-noise (1/ f). The ElM for lead-off detection can be activated by injecting high frequency currents into the electrodes and thus measuring the electrode voltage drop, inversely proportional to the quality of the electrode-skin interface. Field measurements with actual human-connected electrodes often experience a large (±300 mV) wandering baseline at the AR input, causing the IA to saturate. AC-coupling is not a viable solution due to large required capacitors. Our approach exploits a current-steering DAC to compensate for the large offsets with a digitally controlled servo-loop activated once IA analog saturation is detected.
3) SAR-ADC:
In this fully differential SAR-ADC design [25] , the sampling/settling time and the bias current of the analog buffers are programmed for each of the successiveapproximation steps. A majority voting mechanism after the SAR-ADC dynamic comparator is employed to reduce thermal noise floor while Cascaded Integrated Comb (CIC) filters further lower it while performing downsampling. A calibration algorithm compensates for the capacitor-array mismatches, boosting the overall ADC resolution by 1.8 Effective Number of Bits (ENOB) with minor power penalty. SAR-ADC high configurability allows for power savings by trading-off power for speed and precision to just cover the application requirements. The ADC sampling frequency (Is up to 286 kHz) and precision (ENOB up to 13.5 bits) can be thus finely controlled on-the-fly by the MCU over an asynchronous low-latency (3 clock cycles) 32-bit configuration bus. Fig. 3 shows an image of the used system platform. The board is based on a 8-layer PCB and includes the biomedical SoC with the integrated 9 channel AFE, a Bluetooth (BTIBTLE) link, consisting of a MSP430 running the BT protocol stack and a RF-transceiver for the communication with a host device. Furthermore there are a non-volatile flash memory containing the application of the biomedical SoC and a SD-Card to enable dumping data for debugging reasons and/or further processing. The board can be operated either battery powered with a supply between 3.7 and 4.2 V or by an external 5 V power supply. The on-board power management provide stable output voltages for the biomedical SoC (0.9 V, 1.5 V and 3.3 V). The board configuration specific for this application allows to acquire 3 differential EMG channels 
B. System Platform
III. BIOMEDICAL SoC FIRMWARE
The presented platform is designed to cover a variety of usecases with different requirements and applications altogether. Nevertheless, a considerable part of the firmware will be very similar throughout all applications -e.g. interfacing internal SoC peripherals -and can therefore be shared. This leads to a firmware framework as shown in Fig. 4 a) , composed of hardware related code called runtime, a generic application part shared between applications, and a specific part comprising of the application specific algorithms. The runtime provides facilitated access to the various hardware blocks of the biomedical SoC -e.g. employing the DMA or changing the gain of an AFE channel, including firmware layers directly interfacing the biomedical SoC hardware -the Architectural layer (ARCH!) and the Hardware Abstraction Layer (HAL) -as well as the drivers and their respective Application Programming Interfaces (APIs), which define simplified interfaces to hardware blocks such as the GPIOs or the various serial interfaces. By using the runtime, the user can focus on the application specific parts of the firmware and access optimized drivers through low-complexity APIs tailored to the user's needs. The generic part of the application is in charge of setting up the SoC according to the application's needs and configuring and operating the interfaces to the SoC peripherals by using the aforementioned APIs. This part is mainly deployed on one of the cluster's cores and there may be no significant difference from one application to another. The specific part represents the core functionality of the application with its bespoken algorithms and may be distributed on multiple cores. This enables parallel as well as concurrent computation and allows for highly optimized algorithms regarding core frequency and therefore power consumption.
The programming model, Le. how the parts of the firmware are deployed among the cores, is implemented as follows: the generic and the specific part of the application are strictly separated by employing them on different cores. The presented hand gesture recognition application makes use of 2 out of the 4 cores of the cluster, shown in Fig. 4 b) . The cores not used by the application are clock-gated and available for further computation. Otherwise each core starts up initializing the resources needed for the assigned tasks and stops, i.e., is clockgated, at a synchronization barrier. Once every core reached its respective barrier, the clock is enabled and the application execution resumes. The first core of the cluster -Core 0 -is assigned the specific part of the firmware that configures the SoC's processing unit as well as the AFE accordingly. Afterwards, the core configures the PDMA to transfer the acquired samples from the AFE to the L2 memory and the DMA to copy them into the Ll memory where they are distributed to separate buffers for each channel. Concurrently, the second core employed by the application is processing the acquired and buffered samples according to the algorithms described in Section N.
IV. HAND GESTURE RECOGNITION ApPLICATION
The block diagram depicted in Fig. 5 shows the computational steps of the biomedical signal processing chain, consisting of 4 main kernels, i.e. Notch Filter, Offset Compensation, Envelope Extraction and SYM.
1) Notch filter:
The processing chain starts with a IIR Notch filter that aims to eliminate the interference caused by the AC frequency of the powerlines, which represents one of the most significant noise sources in biosignal processing. In this application, a third order Notch filter is implemented, with a Q-factor equal to 50. The filter coefficients are calculated off-line and allocated in the Ll memory as constants.
2) Offset Compensation:
The input signals can be affected by a voltage offset caused by motion artifacts of the subject. This offset must be periodically evaluated and removed: hence, on each EMG channel a moving average is applied, by implementing a circular buffer to keep track of the last n samples. After an evaluation of the trade-off between performance and memory requirements, the length of the averaging window is fixed to n = 60. The output of the Offset Compensation kernel is the absolute value of the sample minus the mean value of the sliding window.
3) Envelope Extraction: Once the signal is offsetcompensated we extract its temporal envelope by calculating the Root Mean Square (RMS). The RMS is computed on the last n = 60 values with a I-sample sliding window. Thus, the output of this kernel is the feature vector, which is the input of the SYM algorithm. 
4) Support Vector Machine:
Among pattern recognition algorithms, SYM has the main advantages of being theoretically robust and efficiently implementable [26] . SYM is a supervised learning algorithm belonging to the framework of statistical learning and its goal is to find the optimal separation hyperplane between 2 classes of vectors. This decision boundary, defined during the training stage, is obtained by a subset of vectors of the input space, named Support Vectors (SVs). The length of the model is variable and depends on multiple factors such as the nature of the data, the quality of the training set and the pre-processing capabilities. Data used to test the performance of the application led to a model composed by 31 SVs. In this application, all the SVs can be contained in the fast-access Ll memory of the chip but for larger models it is possible to load the vectors from larger L2 memory using the DMA and double buffering.
V. ApPLICATION RESULTS
The performance of the proposed system was compared against the previously presented solution [12] in terms of real-time algorithm execution and of power consumption. One right-handed healthy subject with no history of neurological or psychiatric disorders participated in the experiment. The 3 fully differential EMG channels are placed on the forearm in correspondence of the flexor radialis carpi, flexor ulnaris carpi and extensor communis digitorum. Each EMG channel is sampled at 1 kS/s, which has been shown to be suitable for gesture recognition applications [27] . During the training phase, the subject executed the gestures holding the contractions for 3 seconds and separating each gesture with a 3 second muscular relaxation. Each gesture is repeated 5 times and the collected data is streamed to a PC platform where they are acquired and segmented to calculate the Support Vectors (SVs) of the SYM model. The set of the gestures includes the power grasp (closed punch), the open hand and the rest position. They correspond to the 3 classes recognized by the SYM algorithm. On the PC side, the classification algorithm is the libSYM [28] , a multiplatform implementation of the SYM, while on the embedded platform we adapted the code to enable the classification with fixed-point arithmetic. By virtue of the SYM multi-platform implementation, the accuracy of the gesture recognition is evaluated offline, reaching 88% of correct gesture classifications and showing results comparable with our previous work [27] . We also made a comparison between the fixed-and floating-point implementations that shows same performance in terms of accuracy. The real-time execution of the application on a time-line is shown in Fig.  6 , whereby overall core activity a) and respective activity of Table I .
Considering 300 ms [12] as the threshold for the real-time gesture recognition of a hand controller, it is noticeable that the proposed system is well within this requirement. Power consumption is another key point in designing an efficient wearable system. Fig. 6 (c) shows the power breakdown of the analog and digital circuitry of the SoC. The system runs with an external clock of 48 MHz provided by the control MCU of the board, the power consumption of the biomedical SoC digital part results from Pperipherals+ PIO + Pcluster. The peripherals and the cluster are supplied with 0.9V, consuming 2.47mA and 3.41mA respectively, with a total power consumption of 5.29 mW. The lOs are supplied by 3.3 V and they drain a current of 486 J-tA, leading to a power consumption of 1.6mW. The analog circuitry Panalog is supplied by 1.5 and 3.3 V with a total power consumption of 2.54mW. The measured power of the biomedical SoC results in 9.43mW. The proposed system gains more than lOx in terms of energy efficiency w.r.t. the previous work [21] , where the current consumption of the Cortex M4 is 40mA (238J-tA/MHz) at 2.5 V for the processing part with the micorocontroller clocked at 168 MHz. In addition, the power consumption of the external AFE is 15 mA at 1.2 V, resulting in a overall power consumption of 118 mW. a considering execution ofpre-processing kernels on 3 acquisition ch. b shared by multiple kernels VI. CONCLUSION A real-time implementation for the acquisition, processing and classification of EMG signals implemented on a wearable system is presented. A novel biomedical SoC combining a versatile 14-bit ADC and a multi-core digital processor based on the OpenRISC architecture is introduced as main component of this system along with the signal pre-processing and a fixed-point implementation of the libSVM algorithm for hand gesture recognition based on 3 EMG channels sampled at 1 kHz. The proposed implementation is proven to be appropriate for a biomedical application and to comply with both the power and timing constraints implied by such a system. The duration for processing one sample by the entire processing chain < 560 J-ts fully satisfies the real-time constraint and the overall power consumption of 9.43 mW including cluster, peripherals and lOs outperforms the real-time implementation on a current commercial SOA cortex M4 based platform by a factor of 10, leading to a significant advantage regarding battery lifetime. Given the system versatility and proven energy efficiency further biomedical applications can be quickly developed making use of the same platform and firmware runtime.
