A scalp-recording electroencephalography (EEG)based brain-computer interface (BCI) system can greatly improve the quality of life for people who suffer from motor disabilities. Deep neural networks consisting of multiple convolutional, LSTM and fully-connected layers are created to decode EEG signals to maximize the human intention recognition accuracy. However, prior FPGA, ASIC, ReRAM and photonic accelerators cannot maintain sufficient battery lifetime when processing realtime intention recognition. In this paper, we propose an ultra-lowpower photonic accelerator, MindReading, for human intention recognition by only low bit-width addition and shift operations. Compared to prior neural network accelerators, to maintain the real-time processing throughput, MindReading reduces the power consumption by 62.7% and improves the throughput per Watt by 168%.
I. INTRODUCTION
Brain-computer interface (BCI) [1] enables the direct communications and control using brain intentions alone, and thus offers a practical way to help people suffering from motor disabilities. Particularly, scalp-recording electroencephalography (EEG) [2] , [3] is one of the most promising solutions to implementing BCIs, due to its low-cost and portable acquisition system. When a person is intent on moving different parts of his body, the EEG signals from his scalp fluctuates in different modes. In this way, human intentions can be recognized by decoding EEG signals. EEG-based BCI has been widely adopted in controlling wheelchairs, prosthetics and exoskeletons [4] .
However, recognizing human intentions by decoding EEG signals is challenging. EEG-based BCI systems suffer from inevitable noises [3] , due to human physiological activities, e.g., eye blinks and heart beats. Moreover, the correlations [3] between EEG signals and their corresponding brain intentions are not straightforward. To denoise EEG signals and detect human intentions, prior works [5] , [6] create neural networks consisting of multiple LSTM and convolutional layers that obtain high recognition accuracy (e.g., 98.3% [5] ). Because of the 128Hz raw EEG signal sampling rate [5] , to recognize intentions in real time, a BCI system processes the inference of a typical EEG neural network [5] under the throughput of 128 times per second. For 64-channel EEG signals, the BCI system has to support a ∼100M-FLOPS throughput, which is difficult to be delivered by mobile CPUs and GPUs [7] under the tight power constraint and the temperature budget of a 2 • C increase [8] for most bio-embedding applications. * Qian Lou and Wenyang Liu contributed equally. This work was supported in part by NSF CCF-1908992 and CCF-1909509. Wenyang Liu and Weichen Liu were supported by NAP M4082282 and SUG M4082087.
The essential computing effect of the EEG-based intention recognition makes mobile CPUs and GPUs [7] hardly meet the real-time processing requirement under the power and temperature constraints.
Although FPGA [6] , ASIC [7] , ReRAM [9] , and even photonic [10] neural network accelerators are proposed to process neural network inferences in an energy-efficient way, it is still difficult for the BCI system to adopt these solutions, because of its tight power budget and real-time requirement. The CMOS-based FPGA [6] and ASIC [7] designs cannot maintain reasonable battery lifetime when processing neural network inferences. For instance, the battery of Google Glass using an ASIC accelerator stands for only 45 minutes [11] when tracking consecutive object actions. The power-hungry CMOS analog-to-digital converters dominate > 80% of the total power consumption of the ReRAM-based accelerator [9] and hence becomes the obstacle to this accelerator's fast adoption in the wearable BCI systems. Inspired by the low power photonic network-on-chip [12] , a recent work [10] creates a photonic accelerator to significantly improve the inference throughput per Watt of convolutional neutral networks by compact optical micro-disks. But the eDRAM and optical adders in the photonic accelerator consume 79.1% of its total power and prevents it from achieving higher power efficiency.
To process the real-time EEG-based human intention recognition more efficiently under tight power and temperature constraints, in this paper, we propose an ultra-low-power photonic accelerator, MindReading, for the wearable BCI system. Our contributions can be summarized as follows.
• We present universal logarithmic quantization to quantize not only weights but also activations of convolutional, LSTM and fully-connected layers into the data representation of power-of-2 with trivial accuracy degradation. In this way, expensive floating point matrix-vector multiplications can be replaced by low bit-width addition and shift operations. • We build a novel photonic human intention accelerator, MindReading, to process the neural network composed of power-of-2 quantized weights and activations by onchip photonic low-bit adders and shifters. Particularly, we create a photonic activation unit to directly quantize the outputs of various activations, i.e., T anh, ReLU and Sigmoid, to power-of-2 representations. • We evaluated and compared MindReading against the state-of-the-art CPU, GPU, FPGA, ASIC, ReRAM, photonic neural network accelerators. Our experimental results show that to maintain the real-time processing throughput, MindReading reduces the power consumption by 63% and improves the throughput per Watt by 978-1-7281-4123-7/20/$31.00 c 2020 IEEE
464
7B-1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S32 S33 S34 S35 S36 S37 S38 S39 S40 S41 S43 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 ... t=N ...... S1 S2 S3 0 S 6 S7 S39 S40 S4 0 S5 S8 S9 S10 S41 S42 S11 S12 S43 S13 S14 S44 S15 S16 S17 S18 S19 S20 S21 S46 0 S30 S38 0 S31S32 S33 S34 S35 S36 S37 S1 S2 S3 0 S 6 S7 S39 S40 S4 0 S5 S8 S9 S10 S41 S42 S11 S12 S43 S13 S14 S44 S15 S16 S17 S18 S19 S20 S21 S46 ... S1 S2 S3 0 S 6 S7 S39 S40 S4 0 S5 S8 S9 S10 S41 S42 S11 S12 S43 S13 S14 S44 S15 S16 S17 S18 S19 S20 S21 S46 0 S5 S8 S9 S10 S41 S42 S11 S12 S43 S13 S14 S44 S15 S16 S17 S18 S19 S20 S21 S46 170% over a recent photonic accelerator.
II. BACKGROUND

A. Electroencephalography Signal Recognition
The recognition flow of EEG signals is shown in Figure 1 . The EEG-based BCI system uses a wearable headset with 64 electrodes to capture EEG signals [5] . The raw data from 64 electrodes at time-step t is a 1D data vector with the size of 64. For instance, when t is 0, the 1D raw data is
To model the position information of electrodes, the 1D raw data vector is converted to a 2D 10 × 11 data matrix according to the 64-electrode placement map shown in Figure 1 . And then, human intentions can be recognized by decoding EEG signals with high accuracy (98.3%) using EEG-NET [5] composed of convolutional, fully-connected, LSTM and sof t max layers. To recognize human intentions in real-time, EEG-NET has to process 128 2D data matrices per second, since the EEG sampling rate of the BCI system is 128Hz [5] . To reliably adopt a batterypowered real-time BCI system [1] , [2] , [3] in real-world applications, a low-power human intention recognition hardware accelerator becomes a must. 
C. Long Short-Term Memory Layer
D. Logarithmic Quantization
To reduce the computing overhead, Power-of-2 Quantized Neural Network (P2QNN) [10] , [13] is proposed to quantize weights of convolutional layers to their power-of-2 representations. In this way, expensive multiplications can be replaced by cheap binary shift and linear accumulation operations. As Figure 2 shows, P2QNN linearly accumulates 16-bit fixed point inputs to compute a convolutional layer. To further reduce the accumulation overhead, the logarithmically accumulated P2QNN (LogP2QNN) [13] is presented by quantizing inputs, weights and even the activations of convolutional layers to their power-of-2 data representations. In Figure 2 , the logarithmic accumulations can be done by lower bit-width (e.g., 4-bit) adders, indicating lower power consumption. Compared to the full-precision model, LogP2QNN decreases the inference accuracy by ∼ 1% [13] . However, applying LogP2QNN on LSTM layers is not trivial, since compared to convolutional layers relying only on ReLU , they have more types of activation function including Sigmoid and T anh. In this paper, we propose an universal logarithmic quantization to quantize activations of LSTM layers with little accuracy degradation. 
E. Photonic P2QNN Accelerator
A recent work [10] proposes a photonic accelerator, Holy-Light-A, to process P2QNN quantized inferences by microdisk-based adders and shifters. It achieves the state-of-the-art inference throughput per Watt, since micro-disks have ultralow power consumption, and high switching frequency.
HolyLight-A adopts a 16-bit ripple-carry adder consisting of 16 1-bit full adders, each of which can be viewed in Figure 3 (a). To perform a N -bit addition of A + B, the carry (C i ) and sum (S i ) bit calculation are summarized as 
Because the critical path of an N -bit carry-ripple adder is determined by the sequential carry bit calculation, so only the carry bit calculation is implemented by photonic micro-disks, while the other parts, i.e., P i &G i , are caculated by CMOS transistors [14] (∼ 10ps). Two carrier waves (CWs) are injected to a full adder. Only a CW carries the signal C i−1 . Both CWs are divided into half by splitters. The electrically computed signals G i and P i are applied on micro-disks to modulate the passing lights. By tuning the phase and intensity [14] , one optical combiner is served as an XOR gate to produce the sum bit, while the other is used as an OR gate to generate the carry bit. The 16-bit adder performance is mainly decided by the modulation speed of micro-disks on the critical path. When micro-disks run at 5GHz, a 16-bit adder can be reliably operated at 4.3GHz.
For shift operations, HolyLight-A uses a crossbar composed of 16 × 16 micro-disk-based crossing switching elements (CSEs). Figure 3(b) shows a 4-bit crossbar doing a 1-bit logical right shift operation. By configuring the ON or OFF state of the micro-disk, the passing light can turn its direction by 90 degrees. A 4-bit crossbar can implement any i-bit right/left binary shift operation by configuring the micro-disk states in the crossbar. If no light is detected by a photodetecter (PD), the output (e.g., a 1 ) is 0. The frequency of a 16-bit shifter is decided by the micro-disk switching speed (4.3GHz).
III. MOTIVATION
To achieve the real-time processing throughput, a human intention recognition accelerator needs to perform 128 EEG-NET inferences per second (IPS), since the EEG sampling rate of the BCI system is 128Hz [5] . We customize the original HolyLight-A to a low-power real-time configuration shown in Table I by reducing the unnecessary computing components and lowering the operating frequency. More details can be seen in Section IV-IV-B3. As Figure 4(b) shows, the customized HolyLight-A can achieve exactly 128 IPS when processing P2QNN quantized EEG-NET. However, the power consumption of the customized HolyLight-A is still significant for a battery-powered real-time BCI system, due to its power hungry eDRAM buffer, bus, and 16-bit photonic adder. As Figure 4(a) shows, in the customized HolyLight-A, the eDRAM, bus and adder consume 71.7%, 12.1% and 7% of its power consumption, respectively. The adder is used for 16-bit accumulations, while the bus and eDRAM are used to transfer and store 16-bit accumulated intermediate results.
To further reduce the power consumption but maintain the same real-time processing throughput, from the algorithm perspective, we propose universal logarithmic quantization to quantize both activations and weights for convolutional, LSTM, and fully connected layers in EEG-NET, so that we can replace the 16-bit accumulations by cheaper 4-bit accumulations with little accuracy degradation. From the hardware perspective, we present a photonic accelerator to process the neural network composed of power-of-2 quantized weights and activations by on-chip photonic low-bit adders and shifters. 
4-bit log2I
4-bit
IV. MINDREADING
A. Universal Logarithmic Quantization
Since the quantization of LogP2QNN [13] is intended for CNNs that only have ReLU activations, we cannot simply apply it on EEG-NET that includes other types of activations, e.g., T anh and Sigmoid. As Figure 5 shows, we propose an universal logarithmic quantization (ULQ) method to quantize Sigmoid, T anh and ReLU activations to the powerof-2 representations. The ULQ adopts the same method as LogP2QNN [13] to quantize weights. Similarly, to quantize a Sigmoid activation, we can use the ULQ described in Equation 4 and 5. The Sigmoid activations fall in the range of (0, 1). The min and max values in the clip() function for Sigmoid activations are β − N and β, respectively. β decides the range of quantized Sigmoid activations. We set the default β value as 1.
SigmoidLogQuant(I, N) = 2 I
(4)
To quantize a non-negative ReLU activation, we can adopt the ULQ in Equation 6. Since the range of ReLU (x) is in [0, x) and the distribution of ReLU is different from those of Sigmoid and T anh, its I can be computed by Equation 7 . The default θ value is 0. In short, our proposed ULQ can quantize T anh, ReLU and Sigmoid activations to power-of-2 representations with negligible accuracy loss. Specifically, 4-bit ULQ-quantized EEG-NET has 97.6% accuracy, degrading the inference accuracy by only 0.7% over the full-precision EEG-NET.
B. MindReading Photonic Accelerator
1) Architecture: The overall architecture of MindReading is shown in Figure 6 . The chip node relies on an eDRAM buffer to store EEG signals and intermediate results generated by Photonic Processing Unit (LogAccu unit). The LogAccu unit is responsible to calculate binary logarithms and logarithmic accumulations of ULQ-quantized EEG-NET mainly by using photonic adders and shifters. The chip node adopts electrical nonlinear units for EEG-NET activations including ReLU , T anh and Sigmoid.
2) MindReading LogAccu Unit: As Figure 6(b) shows, the MindReading LogAccu unit is in charge of processing the convolutional, LSTM and fully-connected layers of ULQ-quantized EEG-NET. The weights are quantized during training and can be fetched to eDRAMs. The EEG input signals and activations are quantized at run-time by ULQ. During EEG-NET inferences, inputs/activations and quantized weights are read from the input buffer and allocated to the LogAccu unit. The inputs/activations are ULQ-quantized by a photonic Log 2 unit. And then, two 4-bit photonic adders and a Bshifter in the LogAccu unit collaboratively compute the accumulations in logarithmic domain. The intermediate results of the LogAccu unit are cached in an output buffer for the next-layer processing.
LogAccu unit Components. We implement each component of the MindReading LogAccu unit as follows:
• Photonic Log 2 unit. We build a photonic Log 2 unit shown in Figure 6 (c) to accelerate binary logarithm computations.
where m is inputs/activations and weights, and mapped into (1, 2] by multiplying 2 k using a photonic shifter, so that −k, Log 2 (2 k × m) are the integer part and fraction part of Log 2 (m). The integer part, −k, is determined by checking the result after each 1-bit shift until m is mapped into (1, 2] . Since outputs of each layer are normalized into the range of (-1,1) by the non-linear activation functions, e.g. Sigmoid, T anh, the integer part −k can be determined in one cycle. The fraction part is returned by searching a tiny look-up table (∼ 8KB) in eDRAM storing the log 2 values between (1, 2] . Finally, two parts are summed to obtain Log 2 (m) using a 4-bit photonic adder. • eRound and eClip. We use CMOS eRound and eClip units to facilitate a photonic Log 2 unit to construct the ULQ-quantization LogQ unit, where the Log 2 computation is the most time-consuming step. • Photonic 4-bit Adder: We adopt the same photonic ripple carry adder design from HolyLight-A [10] . • Photonic 4-bit Bshifter. To compute bitshif t (1, B) , we propose a low-cost photonic 4-bit Bshifter shown in Figure 6(d) by micro-disk-based parallel switching elements (PSEs). As Figure 23 shows, LogP2QNN only requires the values of bitshif t (1, B) during convolutions. Hence a general photonic 4-bit shifter is not considered for saving the power and energy. In addition, both PSEs and CSEs can change the direction of waves, but PSEs have a more compact size and less insertion loss. Our ULQ also shares the same principle to process convolutional, LSTM and fully-connected layers. By configuring the MDs into ON or OFF states, Bshifter can shift the input 1 by B bits. Figure 6(d) shows an example of Bitshif t (1, 2) , where the second MD, MD 2 , is set to ON state. LogAccu Pipeline. To implement ULQ quantization, shift and accumulation operations, LogAccu unit requires 9 cycles to derive O p from weight W i and input/activation I i . As Figure 6 (b) describes, 1 W i and I i are fetched from eDRAM buffer using one cycle. 2 5 cycles are required to calculate LogQ(I i ) and LogQ(W i ). These 5 cycles are for integer part computation, fraction part computation, sum between those tow parts in Log 2 unit, eClip() and eRound(), respectively. 3 In the 7th cycle, the sum LogQ(I i )+LogQ(W i ) is calculated. 4 Bshif ter outputs bitshift(1, LogQ(I i ) + LogQ(W i )) in the 8th cycle, meanwhile, the last time-step of O p is loaded from eDRAM buffer. 5 4-bit adder2 sums the last time-step O p and bitshift(1, LogQ(I i ) + LogQ(W i )) in the 9th cycle. The accumulation using 9 cycles will be constantly performed until one entire convolutional result, O p , is generated. After that, the generated O p will be be activated using activation functions, e.g. ReLU and T anh, for the next-layer processing. The loop of accumulation in log-domain and activation won't stop until the entire EEG-NET inference is finished.
3) Low Power Real-time Hardware Customization:
The design goal of the human intention recognition accelerator is to minimize the power consumption while maintaining a 128 IPS throughput. To use HolyLight-A to process EEG-NET, we scaled its frequency down and adjusted the number of its hardware resources, e.g., photonic adders and shifters. We found that one 16-bit adder and one shifter operating at 4.3GHz are enough to make HolyLight-A to achieve the real-time processing throughput of EEG-NET. We call it the customized HolyLight-A. We construct the baseline of Min-dReading (MindReading-B) by one 4-bit adder and a shifter operating at 4.3GHz. As Figure 4 To build MindReading, we modeled and adopted optical splitters & combiners, photodetectors and micro-disks from [10] . To estimate the MindReading area, we used a systematic analysis tool, CLAP [16] , that provides detailed structures of various optical devices.
V. EXPERIMENT METHODOLOGY
Workload. MindReading recognizes human intentions by accelerating EEG-NET [5] with ultra-low power. We trained EEG-NET with PhysioNet EEG Dataset [17] using PyTorch-v0.4. EEG-NET consists of 3 convolutional, 2 fully-connected, 2 LSTM with 30 time-steps and 1 softmax layers. More EEG-NET details can be viewed in Table II . Compared to the fullprecision EEG-NET with accuracy 98.3%, the ULQ-quantized EEG-NET degrades only 0.7% inference accuracy.
Accelerators. We compared MindReading against 7 counterparts shown in Table III Accelerator modeling. A heavily modified deep learning accelerator simulator FODLAM [20] is used to study the accelerator performance and power. FODLAM has been correlated and validated by physical accelerator chips such as ShiDianNao. Based on a user-defined accelerator configuration and EEG-NET, it can generate the performance, power and energy details of each accelerator. We implement the microarchitectural pipeline of MindReading in FODLAM.
VI. EVALUATION
Power. The comparison of power consumption of various accelerators is shown in Figure 7 . The ASIC-based ShiDian-Nao has less power consumption than CPU, GPU and FPGAs when processing 128 EEG-NET inferences per second since it is highly specialized for network inferences. The emerging ReRAM-based accelerator ISAAC reduces the power consumption by 59% over ShiDianNao, because its ReRAMbased dot-product engines are more efficient. MXBCNN consumes less power than ISAAC when achieving 128-IPS, but has lower inference accuracy, due to its 4-bit binarized weights and activations. HolyLight-A significantly decreases the power consumption by 97% over MXBCNN, since its photonic devices are highly power-efficent. However, it still requires 57.71 mW in which 79.1% is consumed by a 16bit adder and 256KB eDRAM. On the contrary, MindReading requires only a 4-bit adder and 64KB eDRAM. So it reduces the power consumption by 62.7% over HolyLight-A. 
VII. CONCLUSION
In this paper, we present an ultra-low-power photonic accelerator, MindReading, to accelerate real-time human intention recognition. Compared to prior works, MindReading reduces the power consumption by 62.7%, improves the throughput per Watt by 168%, and meets the same real-time processing requirement.
