Nowadays a diverse range of physiological data can be captured continuously for various applications in particular wellbeing and healthcare. Such data require efficient methods for classification and analysis. Deep learning algorithms have shown remarkable potential regarding such analyses, however, the use of these algorithms on low-power wearable devices is challenged by resource constraints such as area and power consumption. Most of the available on-chip deep learning processors contain complex and dense hardware architectures in order to achieve the highest possible throughput. Such a trend in hardware design may not be efficient in applications where on-node computation is required and the focus is more on the area and power efficiency as in the case of portable and embedded biomedical devices. This paper presents an efficient time-series classifier capable of automatically detecting effective features and classifying the input signals in real-time. In the proposed classifier, throughput is traded off with hardware complexity and cost using resource sharing techniques. A Convolutional Neural Network (CNN) is employed to extract input features and then a Long-Short-Term-Memory (LSTM) architecture with ternary weight precision classifies the input signals according to the extracted features. Hardware implementation on a Xilinx FPGA confirm that the proposed hardware can accurately classify multiple complex biomedical time series data with low area and power consumption and outperform all previously presented state-of-the-art records. Most notably, our classifier reaches 1.3× higher GOPs/Slice than similar state of the art FPGA-based accelerators.
Introduction
Advent of technologies such as wearable sensor systems could be an answer to the rising issues such as increasing individuals with critical medical conditions, providing quality care for remote areas and methods to maximize the participation of disable patients [1] that healthcare system struggle with. Chronicle electronic health data that can reformed to the time series in machine learning tasks are prominent information should be sensed and analyzed using human biologically activities [2] . The interest for wearable systems originates from the need for monitoring patients over extensive periods of time [3] . Wearable activity systems mainly include sensors such as accelerometers, gyroscopes or magnetic field communication/chemical sensors [4] , communication systems and process systems for analyzing generated signals. Smart wearable sensors are effective and reliable for preventative methods in many different facets of medicine such as, cardiopulmonary, vascular [?] . Further, the use of wearable sensors has made it possible to have the necessary treatment at home for patients after heart-attacks and diagnosis of some heart diseases such as cardiac [3] .
Regardless of these achievements, most contemporary commercial products only can measure simple metrics such as heart beats or steps. In addition, high computational requirement to classify high dimensional, ordered attributes time series of interest makes it practically impossible in real-time. Compare to traditional time series classifiers deep learning algorithms, armed with multiple layer of feature hierarchical, capable of extracting temporal dependencies in time series and more powerful processing capacities in wearable systems pave the way for performing more data analysis on-node and in real-time. This capability to perform more complex data analysis on the wearable device/node provides the opportunity to decrease transition data from device to host, or on the other word save data bandwidth link. The bandwidth saving is more exposes itself in the heart disease patients who should continuously monitored and classified using ultrasound machines or the victims such as cardiovascular disease does not have access to health care service , even if the doctor, relatives are not near the patient and also during the non-availability of the cellular network [5] . However, full hardware implementation of deep neural networks still challenging for designers on wearable sensors and embedded platform due to memory bandwidth and energy inefficiency of high computational units.
Recent studies on the development of deep learning hardware accelerators mainly have tried to achieve highest throughput, keeping up with real-time demands of complicated and embedded machine learning algorithms, led to intricate systems with a large number of Multiply Accumulate (MAC) processors. As a result, when it comes on the hardware realization regarded systems consume large silicon area (∼600 mm 2 ) and power (∼500 W) [6] [7] [8] . Based on the observation that most biological time series signals have small rates of frequency (0-500 Hz), an alternative approach, by trading off throughput and hardware complexity using sharing resources is proposed. In this paper we propose a generalized time series classifier that implements both feature extraction and sequence learning respectively through an CNN and an LSTM network. The architecture applied to multiple biomedical disease database and various tradeoffs will be investigated. It will be also shown that the proposed system is compact, portable as well as accurate. In contrast to muscle signals which can be classified using features such as amplitude of signals, heart signals classification needs to highlight more subtle features. These feature could be extracted through training an CNN. The advantage of doing so compared to other classical feature extraction methods is the ability of learning new features. The proposed generalized architecture can be reconfigurable and trained for different applications.
Hardware-oriented Time-series Classifier
The overall architecture of the proposed time series classifier is shown in Fig. 1 . In this approach, feature extraction is carried out using CNN blocks and then data is entered to the RNN blocks for sequence learning. The sequential target replication technique inspired from [9] is used during the training phase. However our approach is slightly different. In the proposed architecture, during forward path and for all steps, the same output label is used to calculate the output error. The error is stored q times in memory in order to calculate the gradient descent during the backward path. In the backward pass, the RNN is unrolled back in time and the weights are updated. This technique forces the network to better memorize the previous sequences of the input windows [10] . It should be also noted that during the training phase loss values are averaged over all steps, while at the inference (test) time, the output at the final step is chosen as the actual classification value. Here is a detailed explanation of the two main blocks in the system:
CNN Blocks
CNNs have shown remarkable performance in image processing tasks such as object detection [11] , face recognition [12] and are normally composed of two types of layers: pooling layers and convolutional layers, where in this work just later layer is utilized for developing proposed architecture. Each convolutional layer is responsible for three dimension calculation of inner product of input window and weights, which referred as kernels. In contrast to regular convolution which determines output using whole input, in machine learning applications, this is done through regional products of input with a single filter. Each filter is responsible for extracting a feature from input signal. In our case, the input is a 1D time-series array, therefor the CNN filters are also 1D. The resulted output is denoted by feature map. The 1D convolution operation can be represented as follow:
where W cnn and b are weight and bias of each channel, l represents index of layers, m is the length of each 1D filter, x and z are respectively the input and output of the network.
The output of each filter in an CNN layer is rectified using an activation function called ReLU which is mathematically described as follows:
The output of the activation function in the last layer is fed to a fully connected network as given here:
where W f is weight of the full connection layer, f is the number of filters per layer and n is the length of the output feature map. Current state of the art CNN networks such as ResNets [13] or GoogLeNet [14] have privilege of utilizing depth layers for achieving higher accuracy in the image related tasks. With increasing number of layers process of convergence becoming harder again due to the exploding/vanishing gradients. Techniques such as normalization layers [15] [16] enabled designing networks with depth layers. In addition, these networks are vulnerable to the problem of accuracy saturations when with decreasing energy of system, accuracy does not improve [13] . For dealing with this issue ResNet or GoogLeNet exploit structures such as inception module or residual learning. In our simulations we faced both problems of hampering convergence and accuracy saturations. To address these problems we used residual learning technique which is more hardware friendly and straightforward compared to the other structures. In Fig. 1 typical structure of residual module have been shown, where CNN is chosen to be two or three layers.
RNN Blocks
LSTM networks are very powerful Recurrent Neural Networks (RNN) that explicitly add memory gates [17] . This makes the training procedure more stable and allows the model to conveniently learn both long and shortterm dependencies. There are some variations on the LSTM architecture, however in this paper we use the following model [18] :
where xx n = [h n , (x n + P n )] and h n and c n are the output and cell state vectors respectively at discrete time index, n. The operator • denotes the Hadamard element by element product. The variables h One of the main bottlenecks for the hardware realization of RNNs and convolutional neural networks (CNNs) is the large memory size and bandwidth required to fetch weights in each operation. To alleviate the need for such high bandwidth memory access, we investigate two quantization methods (binary and ternary) introduced in [19] [20] to quantize weights embedded in the network architecture. As the changes during gradient descent are small, it is important to maintain sufficient resolution otherwise no change is seen during the training process, therefore the real-valued gradients of weights are accumulated in real-valued variables. We also set the bias values to zero to achieve further efficiency in hardware realization while delivering an acceptable classification accuracy for the experimented biomedical case study. Such quantization methods can be considered as a form of regularization that can help the network to generalize. In particular, the binary and ternary quantization are a variant of Dropout, in which weights are binarized/ternarized instead of randomly setting part of the activations to zero when computing the parameter gradients [19] .
The quantization of weights in the forward path must be also reflected in the calculation of the gradient descent. Here, we use the version of the straight-through estimator introduced in [19] that takes into account the saturation effect. Consider the sign and round functions for binary and ternary quantization respectively as follows:
and assume that estimators g q b and g qt of the gradients 
This implies that the gradient is applied to the weights if their real values are between 1 and -1 otherwise the gradient is cancelled when r is outside the range.
Hardware-oriented Simulations
In order to measure the performance of each quantization method, first we synthetically produce and classify a number of time series as a preliminary proof of principle. In this example, we generate synthetic time series data from well known chaotic dynamical systems in different parameter regimes. In all parameter regimes, these systems have chaotic attractors. The unique and isolated parameter regime corresponds to discrete classes. Only the RNN network (CNN is not included in Fig. 1 and the input time-series is directly connected to the LSTM network) is tasked to classify the resulting time series as belonging to a unique class. The considered dynamical systems are the logistic (discrete time) and Lorenz systems (continuous time). The logistic map is given by: where x is the state variable and r is a parameter [21] . The Lorenz system is given by:
where x, y and z are state variables and σ, ρ, β are parameters [22] . By sweeping r in the Logistic map and σ in the Lorenz attractor, various responses can be observed ( Figure  2A ). The time series data generated by the chaotic dynamical systems must be similar between the classes in terms of both time and frequency features so that the signals are not easily distinguishable by the classifier. To determine how similar the synthetic data classes are and visualize our synthetic data set in a simple way, we computed a distance matrix in the Fourier domain as
where P (ω) is the logarithm of the power of the Fourier transforms of x i (t)/20, for i = 1, 2, . . . N S realizations for a sample of N S = 100 realizations per discrete parameter class. The factor of 1/20 multiplying x i (t) bounds the trajectories in the unit interval for subsequent learning in the LSTM network. Then, we looked for a set of points in R 2 , (x i , y i ) ( Figure 2B ) such that
by stochastically minimizing the sum:
This approach finds low dimensional (in this case 2D) manifolds that the data may lie on. Alternatively, one may also use the singular value decomposition, however we do not take that approach here. The stochastic minimization occurs by initializing the (x i , y i ) from a joint uniform distribution on [0, 1] 2 , randomly perturbing every point to compute E. The perturbations were drawn from a normal distribution with mean 0 and standard deviation η = 10 −3 . At each time step, the network computes E after (x i , y i ) have been perturbed and compares E to the smallest value of E so far, E * . If E < E * , then we set the new E * = E and keep the perturbed (x i , y i ). If E * < E, we disregard the perturbation and iterate. The results of this process are shown in Fig. 2A -C for the Lorenz system without noise. The results demonstrate that the data has no readily visible clusters in a 2-dimensional projection, however clustering may appear in a higher dimensional projection. In order for the network to generalize the input features better, a uniform random noise is added to the training and test data. Since binarization is a form of regularization [20] , we do not use other regularization methods such as Dropout. All weights are initialized by random numbers with normal distribution. The same analysis applied to the Logistic map showed that the generated time series are separable in the Fourier domain, however still difficult to visually classify in the time domain.
To compare the performance of the network with different weight precision, experiments with different free parameters were performed on the database. The free parameters are defined as follows:
• Window (ω s ): the length of each part of the input time series fed to the network to be classified in the output. The length of input signal (u) is equal to M · ω s where M is the dimension of the input signal. For example, M for data collected from a three dimensional gyroscope is 3 as the input signal is presented to the system by 3 independent time series. A highlighted window sample is shown in Fig. 1 .
• Iteration (q): the number of successive windows that must be introduced to the network sequentially so that the network can classify the input signals properly. The partition of the input signals into window sizes and then introducing them to the network sequentially would allow the RNNs to use recurrent feedback and internal memories to make decisions, leading to a significant reduction in hardware area consumption.
• Output Class (N y ): the number of classes that the network must classify based on the input signals. We can specify this by considering more discrete parameter sets in our chaotic systems.
• Hidden neurons (N h ): the number of neurons embedded in the network. Accuracy increases with the number of hidden neurons at the expense of higher hardware cost. After a certain point, there are diminishing returns in increasing the number of hidden neurons.
In these experiments, the LSTM network takes a sequence of continuous/discrete arrays defined by the ω s size as input, and after q steps classifies it into one of the output classes. The training objective (loss function) is the cross-entropy loss over all target sequences as follows:
where k is the array of class scores for a single example and p is a vector of the output normalised probabilities. In order to backpropagate the output error to all layers, δLi δf k can be derived using chain rule as follows:
Adagrad is used as the learning algorithm with learning rate of 5e-2 [23] . The weights are randomly drawn from a uniform distribution in [-0.01,0.01]. After each iteration, the gradients are clipped to the range [-5,5] . The results of sweeping on the free parameters of the test synthetic database for the binary, ternary and full precision networks are shown in Fig. 3 . As can be seen in Fig. 3 (a) and (e), by increasing q, the accuracy of the classifier increases at the expense of longer latency and higher power consumption for the hardware realization. It can be also observed that the quantized networks with the same number of neurons (64) can classify the input signals with a lower accuracy rate. However, this reduction in the accuracy can be compensated by increasing the number of hidden neurons. For example, 128 neurons in the quantized network have similar performance compared to 64 neurons in a full precision network. Although requiring more neurons in a quantized network, a significant hardware efficiency improvement can be still seen. It is also observed that the ternary network possesses better accuracy performance compared to its binarized counterpart thereby confirming the results in [20] . The effect of varying window size ω s on the accuracy for all networks are shown in Fig. 3 (b) and (f). Increasing ω s , increases the accuracy for all networks but this imposes a higher hardware cost in terms of area and power. Thus, by increasing the length of the scanned input signals, the number of operations and the memory bandwidth per input increase. Moreover, these plots show that the accuracy drops as quantization is applied to the weights. Again, this reduction in accuracy can be compensated by increasing the number of hidden neurons while still achieving better hardware area performance compared to the full precision. Fig. 3 (c) and (g) show that the accuracy of the classifier drops in all networks if the number of output classes increases. However, similar to the previous experiments this reduction in accuracy can be compensated up to a certain point by increasing the number of hidden neurons as observed in Fig. 3 (d) and (h).
Then we investigate the impact of adding the feature extractor layer (CNN) to the classifier. We use a simple case study as follows:
where j is the number of output classes which is 5, α is a constant value which is 3 and β is the frequency difference between each channel ranging from 0.1 to 0.01. The reason why we choose such a separable case study is that the similarity index (β) between the channels can be linearly altered. The smaller β, the higher the similarity between the classes and consequently more difficulty to differentiate the output classes. As can be seen in Fig. 4 , by reducing β, the accuracy for all networks with different precisions drops. It is also shown that the due to the added feature extractor to the network the performance of the proposed CNN-LSTM architecture is always higher than the corresponding network without CNN layer. The performance of the proposed network is also compared without CNN layer for a inseparable dynamical system's time series with certain parameter sets extracted from Lorenz attractor [22] as the similarity index cannot be swept due to a chaotic nature of the dynamical system. Results shown in Table 1 confirm that the proposed CNN-LSTM architecture achieves higher performance with various precisions compared to the only LSTM classifier.
Hardware Finite State Machine
The proposed hardware classifier functions as a finite state machine that iterates through eight states and only one state is active at a time. This structure can also be implemented in a pipelined form with multiple active states and higher throughput at the expense of increased power and area consumption. The general functionality of each state is briefly described as follows: State 1: After initialization at the start state, the first input transmission according to the defined ω s is carried out and the system enters the first state where max(0, x) ) is also applied which adds (ω s − m + 1) × f 32 operations overhead.
State 2: Then, according to (3) the FC layer is implemented in this state using 32 MAC operations in parallel and the result is added to x as seen in Fig. 1 . State 4: At the end of the calculations, the system enters the second state. In this state, according to (2), nonlinear functions (σ(.) and tanh(.)) are applied to the previous values fetched from memories. After N h clock pulses, the system enters the next state.
State 5: In this state, using two embedded multipliers, the value of variable c is calculated in N h clock pulses. The critical path of the proposed architecture is limited by this state which can be alleviated by using pipelined or serial multipliers at the expense of increased latency and hardware cost.
State 6: As the variable c is provided, the system enters this state where the tanh(.) function is applied to the previous values, taking N h clock pulses, then the system enters the next state.
State 7: In this state, using one of the two embedded multipliers, the value of variable h is calculated in N h clock pulses and the system enters the next state.
State 8: Finally, if the number of scan times is equal to s, by calculating W T y h in N h × N y clock pulses, the system determines the classified output and exits, otherwise enters State 1. This process is repeated for each window of the input signal(s) and managed by a master controller circuit, embedded into the system.
Hardware Architecture
In this section, hardware description of the proposed classifier is presented. The main goal of the design is to exploit the slow nature of physiological signals in particular heart activities for reducing hardware complexity and cost. In principle, in such systems, given that the classification rate is low, a few calculations per high speed clock (100 Mhz) are enough to handle the computational burden of the classifier. This architecture would also allow us to actively and efficiently reconfigure the system according to the user's specifications for the set number of neurons, window size, iterations and input/output classes.
Other design strategies such as implementing a large number of MAC units in hardware do not apply here as high throughput is not demanded. The architecture of the proposed system is shown in Fig. 5 in which the hardware modules (maximum 64 operations per clock cycle) are shared through a 96-bit bus. It should be stressed that since the accuracy of the ternary network is generally higher than its binary counterpart as shown in Fig. 4 , in the hardware implementation, the ternary quantization is used and two bits are allocated for storing each weight value. In the following, the architecture of each hardware module is explained in detail:
WBs (Weight Banks): This block contains five sets of buffers to store the truncated 2-bit weights (W f , W i , W o , W c , and W cnn ), and two sets to store full precision weights for full-connection layers (W f and W y ). The W Bs module is able to read/write maximum 64 bits in each clock cycle. The utilised volume of each buffer is defined by the user which depends on the number of hidden neurons, f , ω s and etc. However, the maximum volume of these buffers must be selected based on the available resources on the FPGA. The greater the volume size, the wider the range of flexibilities for the network/input size. The reading process of this block is controlled by the M C unit and the block is only activated upon its use. It should be noted that, as the proposed architecture is implemented on a Xilinx FPGA in this work, the buffers are realised using block RAMs and the address of each reading operation is provided on the negative clock edge by the M C module.
IMs (Internal Memories): This block contains sets of buffers to store 12-bit values produced by the intermediate stages for both CNN and LSTM. The maximum and utilised volume of the buffers are again determined by the available on-chip memory and the user's specifications respectively. The writing and reading of this block is also controlled by the M C unit and the reading address is provided on the negative clock edge as they are implemented using block RAMs. The maximum data bandwidth of this module is 48 bits and the module is active in almost all states. MACs (Multply-Acummulate): This blocks contains 32 parallel MAC units with full precision. Each unit computes the product of 12-bit and 2-bit or 12-bit signed numbers and adds the product to an accumulator. The number of iterations that this unit needs to operate is defined by the user and assigned by the M C block.
NFs (Nonlinear Functions) : This block is responsible for the calculation of nonlinear functions (σ(u) and tanh(u)) and ReLU employed in the hardware state machine where u is the input value of these functions. We store N discrete values of each nonlinear function in look-up tables with 10-bit length. The quantity u[i] is the ith stored value where i ∈ N ≡ 0, 1, ..., N − 1. The address of each stored value is defined as:
where ∆u = umax−umin N
. If the u max and u min parameters are orders of 2, the division in (11) can be easily performed by arithmetic shifts. Therefore, the values of these parameters are respectively chosen to be 8 and -8 for σ(.) function and 4 and -4 for tanh(.) function. According to (11) , by preparing the address, the corresponding output value can be fetched in one clock pulse from the look-up table. In our experimental setup, the value of N is considered to be 64, providing enough accuracy for the calculation of nonlinear functions.
MC (Master Controller) : According to the defined parameters by the user, this block manages and controls all resources used in the architecture through a shared bus and controlling signals (En M ACs , En N F s , En IM s and En W Bs ). In other words, this block actively changes the state of the FSM by assigning proper tasks to the hardware modules and actively turning off the unused modules. 
Model
Learning 
Hardware Results
To verify the validity of the proposed hardware classifier, the architecture designed in the previous section is implemented on a Genesys 2 development system, which provides a high performance Kintex-7 (XC7K325T) FPGA surrounded by a comprehensive collection of peripheral components. The device utilization for the implementation of the proposed hardware is summarized in Table 2 along with other state of the art implementations. The focus of all other implementations is mainly on the hardware realization of CNNs; however, as the nature of computations in all deep learning algorithms is the same, for the sake of comparison the implementation results of those studies are included here. The results of hardware implementations show that the proposed classifier reaches 1.3× higher GOPs/Slice than similar state of the art FPGA-based accelerators. Obviously, less power consumption is also achieved as the number of FPGA slices used in the proposed system is lower than in other state of art hardware. Such a trade off constrains the GOPs factor, which is not critical for most slow biomedical applications. The required response time of the system must be seriously considered upon such modifications. For example, by adding layers to CNN or LSTM, the amount of calculations is increased, therefore, the number of parallel MAC processors in the MACs module must be increased to keep the response time of the systems constant.
Surface Electromyography (sEMG) Case Studies
To test the proposed architecture without the use of CNN layer, we use CapgMyo [28] , a hand gesture timeseries database recorded by instantaneous surface electromyography (sEMG) as activity patterns in such signals can be detected using amplitude of signals. The data was collected by a non-invasive wearable device consisting of 8 acquisition modules. Each module contained a matrix-type (2 × 8) electrode array with an inter-electrode horizontal distance of 7.5 mm and a vertical distance of 10.05 mm. The 128 sEMG time-series were band-pass filtered at 20-380 Hz and sampled at 1,000 Hz with a 16-bit ADC conversion. Two different experiments were tested. In experiment 1, each one of 18 subjects performed 8 basic isometric and isotonic hand gestures including thumb up, extension of index and middle, flexion of the others, flexion of ring and little finger, extension of the others, thumb opposing base of little finger, abduction of all fingers, fingers flexed together in fist, pointing index and adduction of extended fingers. The result of this experiment is termed as DB-a in the database. In experiment 2, each of the 10 subjects performed 12 gestures performed the maximal voluntary contraction (MVC) force hand gestures including index flexion, index extension, middle flexion, middle extension, ring flexion, ring extension, little finger flexion, little finger extension, thumb adduction, thumb abduction, thumb flexion, thumb extension. The result of this experiment is termed as DB-c in the database. First, we aim at evaluating the system by classifying the DB-a actions. Therefore, the output classes are separated into 8 different actions, and the envelope of the EMG signals (using a Hilbert Transform) are extracted and applied to the networks as inputs. The classification training loss and test accuracy rates of various networks with different sizes along with hardware results are shown in Table 3 . The table shows that the hardware classification rate obtained from the proposed structure is similar to the performance in [28] . It should be noted that 150 frames, (equivalent to 150 ms) is the window size suggested by several studies of It is evident that the ternary network converges to its final value slower that full precision networks. pattern recognition based prosthetic control [28] . Therefore various options for q and ω s can be considered while the multiplication of both these parameters should be no more than 150. As q · ω s is the latency of the system to make the final classification decision, the hardware can be efficiently used if ω s takes the lowest possible value while keeping q · ω s fixed by increasing q. In this case, the network with ω s of 10 is chosen to be implemented on hardware. Similar experiments are performed on DB-c actions and results are reported in Table 4 , however the classification rate of this network is not reported in [28] . Here, again the network architecture with narrower length of input is chosen to be implemented on hardware. Results from both tables show that the proposed hardware can achieve an accuracy comparable with a full precision network with about 40% and 30% more neurons respectively for DB-a and DB-c experiments. Although the number of neurons in the quantized ternary networks is higher than in the full precision ones, a significant hardware efficiency improvement is still seen in the quantized networks. Training loss traces for both experiments with different network structures are shown in Fig.7 . Results show that the loss function in the ternary network reaches to the required minimum value, albeit slower than the full precision networks in both DB-a and DB-c experiments.
Note that, this would only create delay in the training phase which is not critical as the network is trained once for every application. Considering the sampling rate of 1000 Hz and according to ω s , 1 1000 × 10 = 10 ms per input window is the required response time for the system in order to operate in real-time. According to Table. II, the required operations per input window for the DB-a experiment is (5 × 128 + 250) × 250 =∼ 220 K operations which can be delivered in ∼ 35 µs by the hardware classifier and is negligible compared to the required response time (10 ms). These operations may take longer for the DB-c experiment as more neurons are embedded in the network. According to Table. III, the required operations per input window for the DB-c experiment is (10 × 128 + 350) × 350 =∼ 570 K operations which can be delivered in ∼ 90 µs by the hardware classifier and again is negligible compared to the required response time (10 ms).
The confusion matrices extracted from CapgMyo dataset for DB-a and DB-c are respectively illustrated in Tables 5 and 6 . In these experiments, the trained classifier is run 200000 times on DB-a and DB-c database. The confusion matrices compare target and predicted hand gesture classes during the test stage to identify the nature of the classification errors, as well as their quantities. The correct predictions for each output class are bolded in the tables. According to the similarities of the hand gestures, the tables highlight the occurring misclassifications accordingly. For example, in Table 5 , class 1 (Thumb up) is misclassified 0.43 % as class 6 (Fingers flexed together in fist) which is the closest gesture in the dataset compared to class 1. The same applies in Table V , where class 6 (Ring extension) is misclassified 3.04 % as class 7 (Little finger flexion). Fig.? ? illustrates the response time of the proposed hardware classifier for various input window size and hidden neuron (N h ). The response time for the employed datasets (DB-a and DB-c) is shown with red square box. It should be noted that the proposed architecture can be conveniently modified for larger networks while delivering enough response time.
Heart-Related Case Studies
To test the proposed generalized time series classifier we conduct our simulations through three datasets related to the heart diseases extracted from well-known UCR datasets [29] and PhysioNet 2016 and 2017 challenges [30] . UCR datasets recorded heart activities by use of electrocardiography (ECG)device. Mean and variance of UCR datasets are near to zero and unit respectively. ECG5000 dataset originates from [31] , the BIDMC congestive heart failure database, consisting of records of 15 subjects, with severe congestive heart failure (NYHA class 3-4). Records of each individual recorded in 20 hours, containing two ECG signals, sampled with rate of 250 Hz, with 12 bit resolution and over range of (-10-10) mV. ECG200 was formatted at [32] including two datasets, normal heartbeat and a Myocardial Infarction, the dataset is subset of [33] , which contains 35 halfhour records and sampled with rate of 125 Hz. In PhysioNet 2016 [34] , heart sound recordings have been collected from several contributors around the world, gathered at either a clinical or nonclinical environment, from both healthy subjects and pathological patients. The Challenge training set consists of five databases (A through E) containing a total of 3,126 heart sound recordings, lasting from 5 seconds to just over 120 seconds. All recordings have been resampled to 2,000 Hz and have been provided as .wav format. Each recording contains only one PCG lead. PhysioNet 2017 challenge data sampled and stored as 300 Hz, 16-bit A/C conversion with bandwidth (0.5-40 Hz) and (-5-5 mV) dynamic range. It should be noted 70 percent of online provided dataset allocated for training set and the rest for testing set. All of the datasets extracted from one channel. However, our model simply is able to handle multivariate time series by adding another dimension to the convolution layers. To evaluate our algorithms four experiments are performed. In table 7 learning parameters along with characteristics of each network for different datasets are represented. It should be noted no preprocessing was performed on the datasets, and the CNN network automatically extract the important features from the input signal. CNN has two layers including 10 and 30 filters with the size of respectively 1 × 5 and 1× 3. As mentioned in Section III, the size of hidden neurons is chosen to be 350 for ternary precision experiments. Simulation were performed using both Python and MATLAB and results are demonstrated in table 8 confirming this fact that full-precision version of the proposed model outperforms all presented state-of-art records. In addition, quantized models could achieve acceptable accuracy compared with full-precision implementation and even better accuracy on some benchmarks.
Training loss traces for all case studies with various weight precisions are shown in Fig.7 . Results show that the loss function in the ternary network reaches to the required minimum value, albeit slower than the full precision networks in both all experiments. Note that, this would only create delay in the training phase which is not critical as the network is trained once for every application. where HN is the number hidden neurons and ω s is the input window size. MAC Operations: In deep learning algorithms, MAC operation unit is normally quite dominant compared to other processing part, therefore, lower number of such units would save a significant area and latency in the design. Here, we first estimate the number of MAC operations required by the classifier, and then we will accordingly calculate the latency of the proposed hardware classifier based on the design specifications. Table 9 illustrates memory and MAC estimations for for all case studies with various architecture and weight precisions. It should be stressed that the considered weight bit length for full precision (FP) is 32. As shown in the table, the required memory size for T-CNN-LSTM classifier for all case studies is lower than the FPGA Block RAM's capability implying that the model can be conveniently implemented on the FPGA. On the other hand, according to the number of required MAC operations and GOPs of the proposed hardware classifier (see Table  2 ), the response time is less than 0.2 mS per input window which is quite fast compared to the sampling frequency of the input heart signals. It is also evident that the ternary network converges to its final value slower that full precision networks.
