ABSTRACT Human Body Communication (HBC), which utilizes the human body as a communication channel, is a promising communication method for wireless body area networks. In this paper, we use the deep-learning based approach to design and implement a new optimized architecture for HBC system with scalable date rates feature. The proposed transceiver is completely implemented using two deep neural networks, one represents the autoencoder for the transmitter and receiver, and the other for frame synchronization. The proposed autoencoder-based HBC improves block error rate by 2 dB compared to the conventional HBC design. In addition, low complexity modules for CRC encoder and decoder, Scrambler and Descrambler, and Preamble/SFD generator are proposed. Implemented under 45nm CMOS technology, the core size of the proposed design is 0.116 mm 2 , and the estimated power is 1.468 mW with a peak data rate of 5.25 Mbps. The energy efficiency (E b ) of the proposed design is 280 pJ/b that is over 3.5x better than the conventional HBC designs in literature. 
I. INTRODUCTION
Wireless Body Area Networks (WBANs) are emerging technology that has several valuable medical applications. WBAN consists of low-power devices called nodes that monitor the human body vital physiological signs. The first WBAN standard, IEEE 802.15.6, supports a single MAC layer and three physical (PHY) layers; a Human Body Communication (HBC), an Ultra-Wide Band (UWB), and a Narrow Band (NB) [1] - [3] . The HBC PHY layer utilizes the human body as a communication channel, so it has many adjunctive features compared to the other PHY layers. In addition, the HBC is the best energy-efficient solution for WBAN because it enables implementing sensor nodes without using RF components and antennas. Electrostatic Field Communication (EFC) technology with capacitive coupling was exploited in HBC PHY layer [1] , [4] - [6] . According to IEEE 802.15.6 standard, both the transmitter and the receiver have
The associate editor coordinating the review of this article and approving it for publication was Kok Lim Alvin Yau. two electrodes, one is placed on the human body (signal electrode) and the other is a ground floating electrode as shown in Fig. 1 . The signal electrode of the transmitter induces an electric field that propagates through the body and detected by the receiver's electrode. The ground floating electrodes represent the return paths of the electric field. The conventional HBC transceivers, that are compatible with IEEE 802.15.6 standard, exploit Frequency Selective Digital Transmission (FSDT) in the transmitter part. While in the receiver part, the signal can be detected using comparator and clock and data recovery (CDR) in the AFE. The detector computes the hamming distance between the hard decision (HD) bit stream from the AFE and pre-stored 16 candidate Walsh codes, and then finds the candidate Walsh code that has minimum hamming distance [1] , [4] . Previously, several studies have been conducted on the design and implementation of HBC transceivers [7] - [11] , however, most of them focused on the Analog Front End (AFE) of the transceiver. In addition, many design issues have not been addressed well in previous designs, such as the frame synchronization at the receiver side. Therefore, it is a great significance to develop an energy efficient hardware architecture for HBC transceiver.
On the other hand, recent advances in Machine Learning (ML) for wireless communication systems have attracted much attention due to their superior improvements in performance and adaptability. ML has been applied to enhance certain function of the conventional system such as channel encoding and decoding [12] , channel estimation [13] , signal demodulation [14] , and beamforming design for MIMO systems [15] . In a recent study, the whole conventional system was replaced with an end-to-end reconstruction task using autoencoder [16] . To clarify this approach, the transmitter, receiver, and channel are represented as a single Deep Neural Network (DNN) layer to achieve an optimal end-to-end performance. The autoencoder-based communication system shows competitive Block Error Rate (BLER) performance compared with the traditional communication systems such as Hamming coded BPSK scheme and Uncoded BPSK. The concept of autoencoder-based communication system is not only applied to wireless networks but also to optical communication system design [17] , [18] . The autoencoder-based communication system offers the opportunity to fundamentally reconsider the HBC system design. However, the efficient design and implementation of autoencoder-based transceiver in real time hardware, especially for energy constraint wearable devices, represent a major challenge because it requires significant amount of power.
In this paper, the advantage of the ML for PHY layer has been exploited to develop a new autoencoder-based HBC transceiver architecture. The proposed transceiver is completely implemented using two DNNs, one represents the transmitter and receiver, and the other for frame synchronization. Moreover, it supports scalable data rates with range from 164 kbps to 5.25 Mbps at a clock rate of 42 MHz. The training process to find the parameters of the DNNs is preformed offline using supervised learning. During the training and testing process, the HBC channel model CM3 provided in [19] has been used. Furthermore, a low-power synthesis technique is exploited to implement the proposed autoencoder-based HBC transceiver in 45 nm CMOS technology. In addition, a detailed performance analysis of the proposed design with low-power optimized hardware implementation has been provided.
The rest of this article is organized as follows. Section II presents the packet structure and the background of deep learning. In Section III, detailed description of our proposed autoencoder-based HBC transceiver is presented. Following that, the system performance results and discussion are presented in Section IV. Finally, Section V provides concluding remarks.
II. BACKGROUND
In this section, we present a background for this article including packet structure of the HBC PHY layer based on the IEEE 802.15.6 standard and the basics of deep learning as follows. 
B. DEEP LEARNING BASICS
Deep feedforward network or multilayer perceptron (MLP) is one of the most important artificial DNN, which is known for its simplicity. In this type of DNNs, there are no feedback connections between layers and the signal moves in one direction from the input layer to the output layer through the hidden layers. It maps the input vector x 0 to an output vector x L using the weights (W ) and biases (b) of each neuron as follows
where x l , W l , b l and σ l are the output, the weight matrix, the bias vector and the activation function of the L-th layer, respectively, while x l−1 are the outputs of the (L − 1)-th layer. The sigmoid and ReLU functions are commonly used as activation functions to introduce non-linearity VOLUME 7, 2019 between layers. In this work, we used the ReLU function because it has constant gradient leading to quick learning. The ReLU function equates the negative values to zero and keep the positive values, i.e ReLU output = max(0, ReLU inpu ). In order to find the optimal values of weights and biases of each layer, a supervised training manner can be exploited by using a training set of input vector x 0 and the desired output vector x L . Thus, values of weights and biases of each layer can be found to minimize the loss function using the backpropagation algorithm and the gradient descent optimization methods.
III. THE PROPOSED HBC-PHY DESIGN
We implement the entire HBC system including transmitter, HBC channel, and receiver as a single end-to-end autoencoder DNN that produces output equal to the input value. Our scope in this paper is the design and implementation of the digital baseband of the autoencoder-based HBC transceiver, however, the AFE implementation is beyond the scope of the paper. Fig. 3 shows the complete architecture of the transmitter as well as the receiver that attempts to achieve optimal performance with low power consumption. The encoder in the transmitter side, the HBC channel, and the decoder in the receiver side (green blocks) represent the autoencoder's structure. Moreover, the proposed architecture contains additional DNN (red block) to perform the frame synchronization. In addition, two controllers are proposed to control all the modules in the transceiver, one for the transmitter and the other for the receiver. The resolution of Analog-to-Digital Converter (ADC) and Digital-to-Analog Converter (DAC) is assumed to be 8-bit. In the following, we explain the main modules of the proposed system in details.
A. AUTOENCODER NN Fig. 4 describes the autoencoder structure for the proposed HBC transceiver that consists of transmitter, channel, and receiver. Each four bits of the input m ∈ {1, 2, . . . , 16} is encoded into one hot vector t m ∈ R 16×1 of size 16, i.e. the m-th message is represented by a 16-bit zero vector except for the m-th element equal to 1. The one hot encoding is a very common method for representing categorical values in ML [20] .
1) TRANSMITTER
The encoder at the transmitter side encodes the 16-bit one hot vector to 1024, 512, 256, 128, 64, or 32 bits corresponding to the selected raw data rate of 164, 328, 656, 1312, 2624, or 5250 Kbps, respectively. For instance, if the proposed transceiver operates at a data rate of 1.312 Mbps, in this case each 4-bit message (16-bit one hot vector) is transmitted as 128-bit because the clock rate is 42MHz. Thus, the encoder consists of a single dense layer that has adjustable number of neurons N ; 1024, 512, 256, 128, 64, or 32 neurons corresponding to 164, 328, 656, 1312, 2624, or 5250 Kbps raw data rates, respectively. This layer has learnable parameters W 1 ∈ R N ×16 and b 1 ∈ R N with ReLU activation function; and its output h 1 ∈ R N can be expressed as
The final layer of the transmitter is a normalization layer that satisfies the power constraint of the output data.
2) CHANNEL
The next part of the autoencoder is the communication channel which is the HBC channel. The structure of IEEE 802.15.6 standard for HBC is based on the channel model provided in [19] . The HBC channel composes of a channel filter representing signal attenuation as it propagates through the human body, and Additive White Gaussian Noise (AWGN) that represents the interference signals as shown in Fig. 5 . The impulse response of the channel filter is a function of the size of the ground plates of the transmitter (G T ) and receiver (G R ), the distance between the transmitter and the receiver through body (d body ) and the distance between the transmitter and the receiver through air (d air ). The impulse response is represented by the following equations
where A v is a random variable to represent the signal loss, which differs from user to user because each user has different physical parameters such as fat and muscle. The other parameters A, t r , t 0 , x c , and ω have constant values as illustrated in Table 1 . The encoder output signal is filtered using a bandpass filter (BPF) with frequency range of 5 MHz to 50 MHz. The BPF can be modeled as linear stages of a NN layer by multiplying the input signal with a chosen matrix. Similar to the fading channel in [21] , [22] , the BPF output is convolved with the impulse response in (3) and the AWGN is added. 
3) RECEIVER
The channel's output y is fed to the dense layer 2 in the decoder that has parameters W 2 ∈ R 16×N and b 2 ∈ R 16 with ReLU activation function. The output of this layer, h 2 ∈ R 16 can be expressed as
The output layer has a softmax activation function and its output is a probability vector P ∈ (0, 1) 16 ; that has the same dimension as the one-hot vector, which is provided by
and d is defined as d = W 3 h 2 + b 3 ∈ R 16 , where W 3 ∈ R 16×16 and b 3 ∈ R 16 are the Weights and biases of this layer. The decoded message m is the index of the highest probability element of P. At this stage, BLER P e , as an indicator of the system performance, is calculated as
4) TRAINING AND IMPLEMENTATION
The aim of the training process is to select the weights and biases of all the layers that make the output of the softmax equal to the one-hot input vector. In this work, the proposed autoencoder is trained offline over the HBC channel at a learning rate of 0.001 as suggested by [18] , [23] , [24] , and a fixed value of E c /N 0 = 1 dB. The training set as well as the validation set have one million randomly messages produced over the HBC channel. The weight matrices are initialized using Glorot/Xavier initialization [25] , whereas the biases are initialized with zero. The batch size and number of epochs are 250 and 15, respectively. In order to eliminate the multiplication process which consumes huge amount of energy, all the weights have been constrained to be power of two during the training process, while maintaining the performance of the autoencoder. The autoencoder has been trained six times corresponding to the six raw data rates that are supported by the proposed transceiver. We save each trained model and then load it separately during the testing phase. We do not perform any hyperparameter optimization to select the autoencoder architectures, mini-batch size, activation functions, training E c /N 0 , etc., since hyperparameter is beyond our scope. A fixed set that achieves good results after trying different architectures is simply chosen. Since all weights are power of two, we use multiplierless neuron that replace all multipliers with Shift and Add (SA) units as shown in Fig. 6 . The multiplierless neuron has n shift (Sh) blocks which represent the absolute value of neuron weights, and perform shifting operations for its input. The weights and biases are loaded from a memory according to the input data rate. Then, the outputs of the Sh blocks are fed sequentially to an accumulator through a multiplexer. Sign block is utilized to add the sign for the elements with negative values by correct the 2's complement representation. VOLUME 7, 2019 Furthermore, the hard-max function is used instead of SoftMax function during testing and implementation phase since in our case we are interested in a sparse vector of output.
B. FRAME SYNCHRONIZATION NN
To guarantee the success of the reception process, the receiver must detect the incoming packet and define the start bit index of the received data precisely. In this work, we proposed multiple-layer NN to achieve the frame synchronization by using the even indices of the SFD (256-bit). Table 2 illustrates the architecture of the proposed frame synchronization NN. The first dense layer has two neurons with 514 learnable parameters and ReLU activation function, while the second dense layer has two neurons with 6 learnable parameters and SoftMax activation function. The output of each dense layer h 3 ∈ R 2 and U ∈ (0, 1) 2 can be given by where g ∈ R 256 is the received even indices of the SFD with length of 256, z is defined as z = W 5 h 3 + b 5 , W 4 ∈ R 2×256 and W 5 ∈ R 2×2 are the weights for the first and second hidden layers of the frame synchronization NN; respectively, b 4 ∈ R 2 and b 5 ∈ R 2 are the bias vectors for the first and second hidden layers of frame synchronization NN; respectively. The start bit index of the incoming packet is detected, only if the first element of U has the highest probability. The training set composes of three categories; SFD, shifted version of SFD, and random noisy data. All of them are generated over HBC channel at fixed value of E c /N 0 = 1 dB.
C. PREAMBLE/SFD GENERATOR
The two training sequences, SFD and PLCP preamble, are utilized to aid the receiver in timing synchronization and packet detection. Two different 64-bit gold codes are exploited to construct these sequences. Based on IEEE 802.15.6 standard, the XORed output of the two polynomials F 1 (x) and F 2 (x) are used to generate both SFD and preamble
In the case of SFD, the initial states for these polynomials are 352 and 34 for F 1 (x) and F 2 (x); respectively, while in the case of preamble, the initial states are 145 for F 1 (x) and 250 for F 2 (x). Two 10-bit LFSR with modulo-2 adder circuit are used to generate the preamble and SFD. Afterward, the output of the LFSR is spread by re-representing each bit 
D. SCRAMBLER AND DESCRAMBLER
The function of the scrambler module is eliminating any long sequence of 0's and 1's in the PSDU that can make synchronization issues at the receiver. The scrambler preforms XOR between the input data and the polynomial Z [n] which is given by
We implement a serial scrambler by using 32-bit LFSR which is easy to implement in hardware as shown in Fig. 7 . The MAC layer sets the Scrambler Seed (SS) value in the PHY header and the corresponding initial values of the LFSR as illustrated in Table 3 . At the receiver side, SS value in PHY header is used to determine the initialization values of the LFSR in the descrambler module. 
E. CRC ENCODER AND DECODER
According to the IEEE 802.15.6, 8-bit CRC is computed over the HBC PHY header to ensure the integrity of the PLCP header. The CRC decoder may detect the errors in the received header by validating the recently calculated CRC with the received CRC. In order to perform the logical operations very quickly and calculate the CRC in one clock cycle, the parallel CRC using an XOR tree is utilized [26] , [27] to design and implement the CRC polynomial G(X ) = x 8 +x 7 + x 3 + x 2 + 1.
IV. RESULTS AND DISCUSSION
To the best of our knowledge, our implementation is the first proposal for an efficient deep-learning based WBAN HBC-PHY transceiver. Moreover, the proposed transceiver supports the peak data rate of 5.25 Mbps that exceeds the 1.312 Mbps peak data rate requirement of IEEE 802.15.6 standard. The proposed autoencoder-based HBC transceiver has been efficiently designed to achieve the desired performance perfectly. The system level data flow for the floating-point and fixed-point is modeled using MATLAB. To demonstrate the effectiveness of the proposed autoencoder-based transceiver, the fixed-point model is utilized to develop the Verilog RTL modules that describe our proposed transceiver. We simulate the proposed RTL modules using Modelsim. Finally, we implement the proposed design in ASIC with 45-nm CMOS technology following the conventional design flow that includes logic synthesis, floor planning, clock-tree synthesis (CTS), and Place&Route. We perform miscellaneous MATLAB simulations to verify the performance of the proposed autoencoder-based HBC Transceiver. Fig. 8 shows block diagram of communication system in MATLAB Software to validate the functionality and performance of the proposed autoencoder-based HBC transceiver. The proposed transmitter encodes the generated data, and then the encoded message is transmitted through HBC channel. Afterward, the analog part of the receiver (ADC) processes the received noisy data, and the output is fed into the digital part of the receiver to recover the data. Fig. 9 compares the BLER of the proposed autoencoder-based HBC against the BLER achieved by conventional HBC employing the Frequency Selective Digital Transmission (FSDT) technique with the hard-decision decoding [1] , [28] . The simulation for both proposed design and conventional HBC systems is made in similar conditions. We can observe that the proposed design improves the BLER by around 2 dB compared to the traditional HBC design. This is because the proposed autoencoder-based design optimizes the entire communication system, while the conventional designs optimize each block independently that does not guarantee optimal performance in the entire system. To specify the appropriate ADC resolution, we obtain the simulation results of the BLER for the proposed autoencoder NN using three different DAC/ADC resolutions, as shown in Fig. 11 . The performance of 8-bit resolution is slightly degraded, but has advantage in terms of power consumption. The overall performance in terms of the Packet Error Rate (PER) for some supported operating data rates is shown in Fig. 12 . To specify the appropriate learning rate, we carried out a simulation of BLER at different learning rates as shown in Fig. 13 . The results show that the learning rate of 0.001 archives the best performance. In terms of hardware implementation, the proposed autoencoder-based HBC transceiver has a core area of 0.116 mm 2 and consists of 86.38 kGates as shown in Fig. 14 . In order to further optimize power consumption of the proposed design, two methods have been utilized. First, the proposed systems is designed to be deeply pipelined that saves the cost of the hardware. Moreover, the pipeline technique also increases the maximum clock frequency and decreases the critical path. The other method is the Clock Gating (CG) approach, which is utilized to reduce the dynamic power consumption by cutting off the idle clock cycle when the transmitter and the receiver are in the idle state. The power analysis of the proposed design has been performed at 42 MHz using Cadence Joules RTL Power Solution. Table 4 Table 5 . In Table 6 , the proposed HBC transceiver has been compared with the existing traditional HBC transceivers from literature that used FSDT technique. We compare our proposed autoencoder-based technique with FSDT technique because no modulation is required in both techniques. Implementation of the HBC transceiver under 65-nm technology with high clock rate of 210 MHz is presented in [8] . Furthermore, it focused on the design of the analog part of the transceiver. The transceiver given by [9] does not include timing synchronization module. The proposed design in [10] provides an HBC transceiver with a 130-nm implementation and the power consumption of TX/RX is 2.3 mW . However, they use an ineffective method to detect the packet by comparing the received signal with stored value without using threshold detection technique. The design described by Cho et al. [11] includes the analog part, so the power consumption is relatively high. The work in [30] is dedicated to design the digital part of HBC transceiver based on FSDT technique with hard decision decoder, which worsens BLER by 2 dB compared to the proposed design. In addition, the maximum data rate of the HBC transceiver in [30] is 1.312 Mbps. It is noted that the energy efficiency (E b ) of the proposed design is 280 pJ/b that is over 3.5x better than the traditional HBC designs in literature. Because these designs use different process technologies and in order to achieve a fair comparison, the energy efficiency results of these designs were scaled into the same process technology (45 nm) by using the scaling equations from [31] . It can be seen that our autoencoder-based HBC transceiver's implementation achieves the best energy efficiency among the publications under FSDT technique.
Our future work will include a re-training (fine-tuning) of the receiver's decoder when deployed on real hardware to consider the mismatch between the approximated HBC channel model [19] and the real HBC channel that includes the hardware effects (similar to work in [23] ). To clarify this approach, we have performed a simple experiment that utilizes the human body as a communication channel to prove the concept and validate the proposed design at low data rates. The digital RTL and AFE of the transceiver are implemented on the PC as shown Fig. 15 . The output of the proposed digital RTL of the transmitter (Modelsim) is fed to AFE that is implemented in the MATLAB [32] . Then, the signal is transmitted through the human body after converting it to analog domain using a DAC (MCP4725 I2C DAC). In the receiver part, the received signal is converted to digital domain using ADC and fed to the AFE (MATLAB) of the receiver. Finally, the proposed digital RTL of the receiver (Modelsim) decodes and recovers the data. The UART and I2C are utilized to connect the PCs and DAC/ADC. The training process of the proposed HBC transceiver is divided into two phases. In the first phase, the proposed HBC transceiver was designed and trained using HBC channel model CM3 [19] . The second phase of the training process is re-training (fine-tuning) of the receiver part. This is done by transmitting a large number of messages (10 6 messages) over the real HBC channel. The corresponding received signals and the corresponding message indices are used as a data set for retraining (finetuning) of the receiver part. We are able by using this setup to transmit text, image and sound files correctly between the two devices (PCs).
V. CONCLUSION
This paper has introduced an autoencoder concept to design an efficient HBC transceiver for WBAN with scalable date rates feature. The proposed transceiver is completely implemented using two DNNs, one represents the autoencoder for the transmitter and receiver, and the other for frame synchronization. We show that by designing the HBC transceiver as end-to-end autoencoder, the BLER is decreased by 2 dB compared to a conventional HBC systems. To demonstrate the effectiveness of the proposed autoencoder-based transceiver, Verilog RTL modules to describe the design are developed, and then we synthesize, place and route using 45nm CMOS technology. The design core area is a 0.116 mm 2 . When clocked at 42 MHz, the proposed design achieves a peak data rate of 5.25 Mbps, and consumes only 967µW for RX side and 501 µW for TX side. The performance analysis and the implementation results have verified that the proposed HBC transceiver has satisfactory performance in comparison with conventional HBC architectures in the literature.
