Empirical mode decomposition (EMD) is a data driven method for nonstationary data analysis which has proven to be extremely useful in diverse applications in biomedical engineering. In its original formulation, the EMD method works collectively on a batch of data and hence is not suitable for online applications that demand continuous data processing capability in real-time. To cater for such applications, few complete field-programmable gate array (FPGA) based designs for EMD computations have emerged in recent years, yet they have found limited practical applications. The main reason could be that existing complete FPGA based architectures invariably work on integer data sets thus ignoring the sensitivity EMD to truncation errors. Moreover, these architectures mostly implement simplistic linear interpolation technique for sifting process within EMD, thereby compromising its accuracy. To that end, we develop a complete FPGA-based architecture for EMD which works on fixed point data formats thereby alleviating the main source of error in existing EMD based architectures. We also implement well established and accurate cubic spline interpolation technique in addition to giving provision to linear interpolation within the sifting process in EMD, which further improves its accuracy. Our proposed pipeline design ensures that despite implementing computationally expensive EMD related tasks, the proposed architecture exhibits lowest execution time among all existing EMD architectures. We provide examples of computation of EMD decomposition through proposed architecture for both synthetic and real world biomedical signals to demonstrate the prowess of our work.
I. INTRODUCTION
Empirical mode decomposition (EMD) [1] is a data driven method for time-frequency analysis and decomposition of nonlinear and nonstationary signals. Unlike conventional multiscale methods, such as windowing Fourier transform and wavelet transform, EMD does not make any a priori assumptions on input data. It uses sifting process to decompose data into its intrinsic multiple scales, known as intrinsic mode functions (IMFs). IMFs have been designed to be roughly zero mean oscillatory components which admit well-behaved Hilbert spectrum; the resulting transform is known as Hilbert-Hang transform (HHT) [1] .
The strength of EMD lies in its ability to process non-stationary signals effectively owing to its data driven nature. As a result, EMD and the associated HHT have been widely used in many real world applications [2] - [6] .
The associate editor coordinating the review of this manuscript and approving it for publication was Jing Liang . Specifically, EMD has been found useful in a wide range of biomedical applications: some of those applications include EEG motor imagery data classification [7] , EEG based epileptic seizure detection [8] , automated estimation of fetal cardiac timing events from Doppler ultrasound [9] , heart rate variability (HRV) signal processing for locally anesthetized patients [10] and analysis of cardiotocographic signals for classification of vaginal vs C-section delivery [11] . However, computation of EMD decomposition is a time consuming task, mainly due to its inherent spline interpolation step and recursive operation. A software based approach to achieve high-speed EMD computation with high accuracy is difficult. Moreover, in many real world biomedical applications, both online and real-time computations are required, e.g., online optical dialysis monitoring [12] , monitoring the effects of hypoxia in bowel dialysates in metabolites [13] and online motor imagery data monitoring [14] .
To achieve that, an efficient hardware architecture for the computations of EMD is required which is both online and VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ can operate in real-time. Earlier attempts to obtain architectures for faster implementation of EMD included a graphics processing unit (GPU) based parallel implementation of the algorithm by Waskito et al. [15] . The architecture could achieve up to 28 times speedup as compared to the serial C-based implementation of EMD. A field programmable gate array (FPGA)-based accelerator for HHT was reported in [16] . In that design though, the core function of EMD was performed by a software and FPGA only served as an auxiliary accelerator which limited its overall performance. FPGA and a digital signal processor (DSP)-based architecture of EMD was presented in [17] . The FPGA was used as a controller, and the EMD computations were carried out on a DSP chip. However, a limitation of that design was that it was not able to process the signals with frequencies higher than 1 kHz. Moreover, the design was restricted in its performance, as the core EMD algorithm was not computed on the FPGA. More recently, EMD architecture employing cubic spline interpolation (CSI) was presented in [18] with limited performance capability. The first complete FPGA-based architecture for EMD computation online and in real-time was proposed in [19] . In that design, the interpolation of minimum and maximum points to compute lower and upper envelopes was performed in parallel for real-time processing. In addition, multiple modules were cascaded to enable serial pipeline structure to calculate each iteration of an IMF. Due to the pipeline structure and linear interpolation, computation time had been decreased and clock frequency was increased up to 12.5 MHz. However, the architecture used a simple linear interpolation scheme instead of CSI, resulting in sub-optimal IMFs. Recently, an online hardware architecture for EMD computation has been proposed that achieves high throughput with low memory requirements [20] . That is achieved by using online interpolation technique with data reuse capability and decomposition of iteration loop. Similarly, an FPGA based architecture for bivariate extension of EMD was presented in [21] that achieved the maximum clock frequency of 24 MHz. This design was implemented on Xilinx Kintex 7 FPGA and used linear interpolation technique for the interpolation of upper and lower envelopes with pipelined structure. Both of the above designs used parallel architecture albeit offered limited performance improvement due to linear interpolation technique.
More recently, Chen et. al. have presented a silicon intellectual property (SIP) core for EMD which has been parametrized by using auxiliary software system [22] . Three designs with different interpolation techniques namely linear, Hermitian and cubic spline interpolation (CSI) have been used. A major drawback of their design is that it employs batch processing with CSI which results in high execution time for IMFs generation and limited application solely for offline EMD computation. All designs were implemented on FPGA with maximum operating frequency of 50MHz; the execution time to perform the EMD operation on 1024 samples with CSI was 560µs. Another architecture based on Xilinx System Generator (XSG) has been presented in [23] for implementation of EMD on FPGA using linear interpolation scheme. The design was able to achieve maximum frequency up to 33MHz and the input signal was sampled at 50 kHz. The design took 16ms to perform EMD on 8192 sample.
Despite recent efforts to develop a real-time hardware architecture for EMD computation, existing online EMD architectures have found limited real world applications owing to their accuracy constraints. That could be attributed to the fact that all current complete FPGA based architectures either use simplistic linear interpolation scheme in EMD or work on integer data set only. In this article, we specifically show that: i) existing EMD architectures are highly sensitive to both round-off and truncation errors; ii) EMD output depends heavily on interpolation scheme being used in the sifting process; iii) existing EMD architectures exhibit low throughput. To counter the above challenges, it is important to develop an EMD architecture that can operate on fixed point data inputs. Moreover, while CSI is widely used in software based EMD computation for batch data processing, it is yet to be employed for online and real time computation of EMD using FPGA based architectures. To that end, our contributions in this article are specifically listed below: i) a low-cost FPGA based hardware architecture for computation of online EMD has been presented that uses fixed-point architecture; ii) we implement both CSI and linear interpolation for the sifting process within EMD; iii) the proposed design achieves highest throughput among all existing EMD designs. Our proposed design exploits inherent parallelism in EMD operation and propose an efficient pipelined architecture to execute the sifting process within EMD. That ensures significantly higher throughput for the computation of IMFs at the cost of slight increase in hardware resource utilization, using our proposed design as compared to existing EMD architectures. That is despite the fact that we implement computationally expensive tasks including CSI and fixed point architecture in our design. Our design, therefore, offers increased accuracy for EMD computation owing to the CSI scheme and fixed point architecture which exhibiting higher throughput among all available FPGA based architectures of EMD. The proposed designs also handle online data which in combination with its accuracy and computational efficiency should pave the way for its wider use in diverse engineering and biomedical applications. We provide a detailed comparison of our design with the existing state-of-the-art designs for EMD.
The rest of the paper is organized as follows. The details of the EMD algorithm are presented in Section II. Section III describes the implementation of the proposed design. The simulation results and discussion is presented in section IV. Finally, conclusions are given in section V.
II. REVIEW OF EMD ALGORITHM
Empirical mode decomposition is a data driven method used for the decomposition of input data into its inherent scales, known as intrinsic mode functions (IMFs). EMD employs an iterative sifting procedure to decompose input data x(t) into multiple IMFs, c m (t), as follows:
(1)
where r(t) denotes the monotonic signal which represents the trend in input data. The IMFs c m (t) have been designed to fulfill the following two properties: I. The difference between the numbers of extrema points and zero-crossings must either be equal or differ at most by one. II. The value of the local mean envelope that is defined by local minima and local maxima must be equal to zero. The sifting process involves the computation of maximum and minimum local envelopes of a signal which are averaged to obtain the local mean. It is then iteratively subtracted from the signal to yield high oscillatory component of the signal. If the high oscillating signal fulfills the IMF criteria, it is designated as an IMF, otherwise the sifting process continues till an IMF is extracted. Once extracted, an IMF is subtracted from input signal to yield the residual signal which then becomes input to the sifting process for the extraction of next IMF. This process is stopped when all oscillatory components are extracted from the signal.
Algorithm 1 gives step-wise details of the EMD sifting process. Several criteria have been proposed in literature to stop the sifting process and detect an IMF [24] - [26] . Reader are referred to [25] for further details. In our work, a simple S-number criterion is employed which stops the sifting process after fixed S-number of iterations [25] .
Algorithm 1 The Sifting Process in EMD
1: Find the maximum and minimum location of input signal x(t); 2: Calculate the upper e max (t) and lower e min (t) envelop by respectively interpolating the maxima and minima of x(t) using cubic spline interpolation; 3: Compute the local mean of the signal:
(e min (t) + e max (t)); 4: subtract the local mean m(t) from the input signal x (t) to yield
5: If h (t) satisfies the IMF criteria as mentioned above, then c (t) = h(t) is an IMF, otherwise set x (t) = h (t) and repeat from step 1.
A. CUBIC SPLINE INTERPOLATION
Cubic spline interpolation of extrema is a critical and the most time consuming step in EMD computation. Here, we give a theoretical background of cubic spline interpolation (CSI) process. That is followed by the illustration of our proposed architecture to implement CSI in the next section. Given n function values (x 0 , f 0 ), (x 1 , f 1 ), . . . (x n , f n ) n lower-order polynomials q 0 (x), q 1 (x), . . . q n−1 (x) could be used for interpolation over n subintervals (x 0 , x 1 ), (x 1 , x 2 ), . . . (x n−1 , x n ) respectively. This way, a single interpolation function g(x), called a spline, can be composed such that
Note that, g (x) = q 0 (x) in the interval x 0 ≤ x ≤ x 1 and g (x) = q 1 (x) in the interval x 1 ≤ x ≤ x 2 and so on.
In other words, spline interpolation is piecewise polynomial interpolation. Cubic spline refers to g(x) which is continuous over the whole interval (a, b) and has continuous first and second order derivatives, in addition to satisfying (2) . Moreover, between adjacent nodes (x i , x i+1 ), g(x) is specified by a polynomial of degree 3 or less. In many cases, we also require the tangents (derivative of g(x)) be specified at end points i.e.,
which results in a unique cubic spline g(x). Let c j = 1 h j = 1 (x j+1 −x j ) , where j=0, 1, 2 . . .n − 1 and let k 1 , k 2 , . . . , k n−1 be the derivates of individual cubic polynomials at nodes i.e., q j x j = k j and q j x j+1 = k j+1 , then the unique cubic polynomial q j (x) satisfying (3) is given by [27] 
We can simplify (4) by using:
to get the following (n − 1) equations,
A free or natural conditions can be used for the practical implementation of cubic splines:
The linear system of equation given in (5) under natural conditions (6) , can be written in the matrix form as follows:
where
Note that the matrix A is tridiagonal and can be easily solved by using the tridiagonal matrix algorithm (TDMA), also known as the Thomas algorithm [28] .
The solution of (7) yields the k vector which in turn could be used to obtain the coefficient of q j (x) given below:
The coefficients a j , b j , c j and d j are given by
It is clear from the above equations that the CSI implementation within EMD is a computationally expensive task. A computationally lighter alternative of CSI could be the linear or sawtooth interpolation (STI) which can be considered as a special case of CSI by keeping c j = d j = 0 in (8), resulting in q j (x) which is given as follows:
Note that, in linear interpolation scheme, the derivative k j is effectively set to k j = 0, causing discontinuity of q j (x), at nodes that propagates throughout the sifting process thus resulting in some loss of accuracy. Having said that, the loss of accuracy due to linear interpolation is significantly less in the online EMD computation since only finite number of extrema points are considered in the formulation of spline; this is in contrast to the standard EMD where batch processing is performed and all extrema points are used only once for spline computation. Choosing higher-order interpolation scheme (such as spline) becomes critical in those cases and that has been demonstrated in many studies.
III. FPGA BASED IMPLEMENTATION OF EMD WITH FIXED POINT ARCHITECTURE
In this section, we describe the proposed FPGA architecture for online computation of EMD. Firstly, a top-level block diagram of the proposed architecture is given followed by details of subsequent individual modules of the proposed design.
A. TOP-LEVEL ARCHITECTURAL DESIGN OF EMD Fig. 1 illustrates the operational flow of the proposed architecture for EMD. Here, the input data x(t) is a 16-bit signal that is obtained from an analog to digital (A/D) converter which has the sampling frequency equal to the operational frequency of our system i.e., f max = 46.44 MHz.
The signal x(t) is fed to the IMF generator module to produce the first IMF c 1 (t); a residue r 1 (t) is also generated by subtracting c 1 (t) from the delayed version of input signal. The residue is given as an input to the second stage IMF generator to produce the second IMF c 2 (t) and so on. An IMF generator block consists of upper envelope and lower envelope computation modules with an adder, a divider and a delay module (denoted by z −d 2 ). The upper and lower envelope of the given input signal are first generated in parallel, then their average is computed to obtain the local mean. The local mean value is then subtracted from the delayed input signal to yield the output h 1 1 (t), which becomes input of the next level IMF generator block for the computation of second stage local mean and the output of the second iteration h 2 1 (t). Since we employ S-number stopping criterion [25] to generate an IMF, the proposed architecture computes fixed S number of iterations to yield a set of outputs h j 1 } s j=1 for the first IMF. The last iteration output h s 1 (t) is taken as the first IMF i.e., c 1 (t) = h s 1 (t). Next, the residue signal,r 1 
is computed that is given as input to the next block for the generation of second IMF c 2 (t) by using the same process as shown in Fig. 1 . In this way, M number of IMFs c i } M i=1 are obtained in addition to the monotonic residue signal r(t) = r M (t).
In our design, there are two types of delays, denoted by d 1 and d 2 , which are required for the implementation of the proposed architecture. The first delay d 1 is needed to generate the residue signal r 1 (t) for input to each IMF generator block (as shown in Fig. 1 ), whereas the delay d 2 (as shown in extended lower part of Fig. 1 ) is needed to obtain at least 3 extrema values before proceeding to calculate h 1 1 (t). Since S number of iterations are required to calculate each IMF, so the total delay d 1 associated with each iteration block to generate an IMF is
and k is the initial delay that will be explained in next section.
B. ENVELOPE GENERATION
Envelope generation is an important step in the sifting process within EMD. Two envelopes (upper and lower) are required for the computation of local mean of the input signal. In the proposed architecture, cubic spline interpolation (CSI) is used for the generation of envelopes which is one of the most widely used approaches to generating envelopes in EMD.
There is also a provision to use linear interpolation scheme which has been implemented as a special case of CSI in our design. Fig. 2 shows the block diagram of the proposed envelope generation module. The module consists of three basic blocks: i) extrema identification block for the detection of extrema (maxima and minima) of the signal; ii) controller module to control the data flow (extrema points and location) for interpolation; and iii) interpolation block to perform cubic spline interpolation or linear interpolation on extrema points to generate upper and lower envelopes.
The input data values, denoted by x in , are sent to 16-bit Registers A, B and C sequentially at each clock cycle. This way, three consecutive data values of x in are given as input to the extrema identification (EI) block. The EI block detects extrema points of a signal and stores their associated indices. The extrema points are detected by comparing the values of Register A, Register B and Register C by using comparators.
We used an improved extrema detection architecture, from our previous published work [21] , which is robust and can also detect extrema on flat signal regions. Whenever the EI module detects the extrema points, it generates a trigger Trg signal and subsequently three tasks are performed in parallel: (i) the extrema value is moved to a circular buffer M by the trigger enable pin (En A) of a tristate buffer, (ii) the location of extrema is stored in a circular buffer P, and (iii) the address location (j) of the circular buffer is incremented to point next storing location.
A controller block takes extrema values and locations from the circular buffers M and P respectively and passes those to the cubic spline interpolation (CSI) module. The CSI module requires at least three extrema points for interpolation. The controller, therefore, has to wait initially for the three extrema. To achieve that, an initial delay of k clock cycles is added to the output.
After the delay of k clock cycles, the controller checks the total number of extrema count from the i Register, which holds the address of current extrema points. If i > 3, the three extrema values M i , M i+1 , M i+2 and their associated position indices P i , P i+1 , P i+2 are extracted from the M and P circular buffers respectively and transferred to the CSI module for interpolation. Once the CSI module completes an interpolation step for every missing point in the range of (P i − P i+1 ), it generates a status signal ''CSI-done'' to the controller and the value of Register i is incremented.
In the next round, the new extrema values and their corresponding positions P i+1 , P i+2 and P i+3 are sent to the CSI block for interpolation. Sliding window of length equal to three is operated on the M and P buffers to fetch the input for CSI block for each EMD round. The CSI block computes the interpolation value at each clock cycle while the extrema identification module also checks for the new extrema in parallel. The envelope generation block is also used, in parallel, for the generation of upper and lower envelope by using the extrema point of x in respectively.
1) SYNCHRONIZATION BETWEEN ENVELOPE MODULES
Local mean estimation within our architecture depends on the proper synchronization between the modules computing the upper and the lower envelopes. That is challenging since EMD is fully data-driven in nature with locations and values of maxima and minima varying with the input signal. VOLUME 7, 2019 To address this problem, we propose an ad hoc solution that is based on the following two operations: i) Firstly, the starting point of the signal is considered as both maxima and minima in proposed design. That is really important to ensure that the maximum and minimum envelopes are both calculated from the signal starting point. Had we not take starting point as both minimum and maximum, the two envelopes would not be synchronized at the start and we would not be able to define the local mean for some portion of the signal especially at the start; ii) Secondly, in the proposed architecture, the delay of k = 30 clock cycles has been used to make sure that adequate extrema locations and corresponding points are obtained to initiate the envelope generation process. As mentioned before that EMD algorithm, by its design, is dependent on extrema points. Therefore, any architecture that computes EMD must first wait for the arrival of adequate extrema points to start envelope generation for EMD computation. In our design, the value of k = 30 is chosen based on our observation regarding many real world signals. The size of the circular buffers M and P is also set to depth (30) × width (20) which provided a good compromise between the flexibility of the design and the required computational performance.
Having said that, for some low frequency signals, k = 30 clock cycles may not be adequate in which case the architecture would fail to synchronize the two envelop generations. In that case, higher values of k could be used.
C. PROPOSED CSI HARDWARE IMPLEMENTATION
In our proposed architecture, since only 3 extrema points are employed for spline formulation, we have implemented both linear interpolation and spline interpolation for online EMD computation; the user can therefore choose the appropriate scheme depending on application at hand.
The architectural block diagram of the proposed CSI implementation is shown in Fig. 3 . CSI block is a combination of three modules: Tridiagonal matrix algorithm (TDMA), CSI parameter and Spline formulation module to calculate x out = q j (x) using (8) . The CSI module have a latency of one clock cycle. The proposed CSI module takes three extrema values for interpolation thus producing an A matrix of size 3 × 3. Using the natural conditions as given in (6), the TDMA block computes the parameter k 1 (k 0 = k 2 = 0) by solving (7) ; specifically, computation block for k 1 is shown in the lower part of Fig. 3 .
The CSI parameter module calculates the coefficients a j , b j , c j , d j by using (9) . The spline formation module takes the above parameters in addition to x and I as input from the controller to perform the interpolation. Hence, x denotes the location (position index) of the value that needs to be interpolated, whereas I denotes the user-defined parameter that selects between the linear and CSI interpolation. Specifically, if I = 0, CSI interpolation is performed by computing (8); if I = 1, linear interpolation is performed by calculating (10) .
The proposed design takes one clock cycle to estimate a single missing value through interpolation. Connecting this CSI module with the overall system design shown in Fig. 2 , we note that the controller transfers extrema points M i and their associated position P i to the CSI module for interpolation.The CSI module generates the required function q j (x) (8) after computing its relevant parameters. In parallel, the envelope generation block computes new extrema values and stores to circular buffer M for the next cycle calculations.
D. DISCUSSION ON CONCURRENCY AND DEPENDENCY OF THE PROPOSED DESIGN
A discussion on the proposed parallel and pipelined architecture of EMD is in order. Being a recursive algorithm, there is a lot of scope for pipelining in the EMD architecture which has been exploited in our proposed design. Moreover, local mean estimation in EMD relies on two independent processes, i.e., the construction of upper and lower envelopes, which in turn depend on maxima and minima identification. Being independent processes, maxima and minima identification and consequently upper and lower envelope construction are implemented in parallel in our design.
To illustrate the overall concurrency and pipelining in the proposed architecture, a block diagram of the different steps in EMD algorithm and their dependencies are shown in Fig. 4 . The computational modules in EMD have been divided into three levels in the proposed design: at the lowest level (Level 3), extrema identification and envelope construction is performed. Note that the proposed design computes maxima and minima and consequently constructs upper and lower envelopes in parallel using the CSI module. However, there is a pipelining involved between the two processes since extrema locations and values are required for envelope construction. At level 2, multiple iterations of the sifting process in EMD are performed: owing to the inherent dependency between the successive sifting iterations, e.g., iteration 2 requires the output of iteration 1 and so on, a pipelined architecture has been employed. Finally, at Level 1, multiple IMFs are obtained through a pipelined architecture. Again, construction of nth IMF requires the output of n − 1th IMF and so on.
Another important step of the EMD algorithm (step 3) is the computation of local mean m(t) of a signal that is achieved through the division of the sum of upper and lower envelopes with a constant 2. In the proposed design, we replace the division by a right shift operation that increases the speed of the proposed architecture. Overall, the high throughput is achieved due to concurrent (parallel) implementation of extrema identification and envelopes construction with multi-level pipelining in the proposed design.
IV. RESULTS AND DISCUSSION
The performance comparison between the proposed architecture of EMD and the available state-of-the-art designs is evaluated in terms of accuracy of the decomposed modes (IMFs) and computational timing analysis. To verify the decomposition mode accuracy, experiments were carried out which involved decomposition of synthetic combination of tones and real-world EEG signal by using the proposed EMD architecture. In both case studies, the input signals had been carefully chosen so that their inherent modes (IMFs) carry some physical meaning.
The proposed architecture was implemented on Xilinx KC705 Evaluation platform (FPGA KINTEX 7, XC7K325T) using Verilog language with fixed point data format of Q16.4. The proposed EMD design was synthesized on Xilinx ISE Design Suite 14.7 and the results were gathered after performing post, place and route procedure. The sampling frequency was set to f sam = 46.44MHz. The S-number stopping criterion was used for the sifting process with the fixed value of S = 4.
A. CASE STUDY 1: DECOMPOSITION OF A SYNTHETIC SIGNAL
A synthetic signal denoted by x(t) is fed as input to the proposed system consisting of additive combination of multiple tones of 100 kHz, 300 kHz and 600 kHz. Mathematically, x(t) is defined as:
x(t) = 150 cos(2π100kt) + 150 cos(2π300kt) +150 cos(2π600kt). (12) where the sampling frequency of x(t) was taken as F s = 46.44 MHz. Fig. 5 illustrates the pictorial representation of x(t).
The input signal x(t) was decomposed into M = 3 IMFs using the proposed FPGA-based EMD architecture. Fig. 6 shows the decomposition results obtained from existing EMD architectural designs, namely CSI-Int [18] , [22] , Linear-Int [19] and the proposed design when using linear interpolation and cubic spline interpolation (CSI).
Note from Fig. 6 (a) that the IMFs obtained from the proposed architecture accurately correspond to the three tones which were present in the input signal x(t). On the other hand, there is significant mode-mixing 1 within IMFs 2 and IMFs 3 obtained from the CSI integer-based architecture, where 100 kHz and 300 kHz tones are spread across both IMFs. Similar mode-mixing problem is observed within IMFs obtained from the linear integer type architecture ( Fig. 6 (d) ). The proposed architecture with linear interpolation (shown in Fig. 6 (c) ) also outperforms and generates better signal decomposition when compared to the existing techniques [16] , [18] , [19] , [22] . Specifically, FIGURE 6. Decomposition of a synthetic signal into IMF1, IMF2 and IMF3/residue for (a) the proposed design CSI-fixed, (b) CSI-Int [18] , [22] (c) the proposed linear-fixed and (d) linear-Int [19] . mode mixing was not observed in the IMFs generated by the proposed linear-fixed point architecture. However, the linear interpolation affected the smoothness of tones in the decomposed IMFs which can be seen in IMF2 and IMF3 ( Fig. 6 (c) ). It is important to highlight that the mode mixing seriously degrades the performance in EMD related applications, including denoising [29] , fusion [30] and classification [7] , and therefore, should be avoided. The proposed design has been shown to do a good job in that regard.
From Fig. 6 , it can be observed that, in all decomposition schemes, there is an initial delay in the computation of all IMFs which is due to the latency of the architecture. The signal decomposition results, shown in Fig. 6 , highlight the superiority of the proposed fixed-point architecture for EMD implementation over existing designs. Specifically, the fixedpoint architectures have been found to be more accurate and provide better precision than the integer based architectures because of less truncation errors. These truncation errors, if present, are propagated throughout the algorithm process, thereby degrading the accuracy of the overall system. This degradation in accuracy due to integer-based architecture is depicted in Fig. 6 . To validate the accuracy of the decomposed modes obtained from the proposed EMD architecture, we give quantitative evaluation of the accuracy of the modes in terms of the root mean squared error (RMSE) of individual IMFs in Table 1 . Note that while the error in the first IMFs is comparable in all designs, which was also evident in Fig. 6 , the RMSE in IMF2 and IMF3 are significantly lower in the case of the proposed CSI-fixed point architecture. The truncation errors due to the integer-based implementations introduced higher RMSE values than the proposed fixed-point EMD architecture. The proposed CSI-fixed EMD architecture achieved the best results as compared to the other techniques including the proposed Linear-Fixed EMD architecture. This could be attributed to the higher modeling accuracy of lower-order cubic spline polynomials as compared to the linear interpolation function.
B. CASE STUDY 2: DECOMPOSITION OF BIOMEDICAL EEG DATA
The performance of the proposed architecture in terms of accuracy of decomposed modes was also evaluated for a real-world electroencephalogram (EEG) signals. The EEG signals were acquired to measure the brain activity during i) eyes-closed (relaxed state) and ii) eyes-open (attention state). During the acquisition, the subject was first asked to close his eyes in the relaxed state for 5 seconds duration followed by the eyes open (attention state) for the next 5 seconds.
The EEG data was acquired from the Open BCI Cyton board. We were interested to decompose alpha wave (in the range of ∼10Hz) from the input EEG dataset by using EMD. The input EEG data was decomposed into M = 4 IMFs using the proposed architecture.
The acquired EEG signal along with its power spectral density (PSD) estimate are shown in Fig. 7, whereas Fig. 8 shows the IMFs obtained from the proposed EMD architecture. Fig. 7 (bottom) , it can be easily observed that a significant 10Hz signal component is present in the PSD of the input EEG data which corresponds to the alpha wave due to the brain activity during relaxed (eye-closed) state. Therefore, this 10Hz activity is the desirable (or target) component in the EEG signal, which needs to be decomposed by the EMD. This mode was captured in the 3 rd IMF obtained from the proposed EMD architecture as shown in Fig. 8 ; the other IMFs correspond to noise, and other artifacts present in the EEG signal. However, there are few instances of mode-mixing across IMFs in the decomposition result as well, which is due to the inherit limitation of the EMD-based methods, especially in the case of real-world signals. We also show the IMFs obtained from the EMD architecture using CSI and integer data type, as shown in Fig. 9 . Unlike the proposed architecture involving CSI with fixed data type, it can be observed in this case that the 10Hz component, corresponding to the alpha wave, was spread across multiple IMFs (c 2 (t), c 3 (t) and c 4 (t)) due to severe mode-mixing arising from truncation errors.
From the
EMD is widely used in many biomedical applications including those involving EEG signals. However, mostly those applications involve offline computations due to lack of efficient online hardware architectures of EMD. The proposed architecture can bridge that gap and provide a good platform for online biomedical applications of EMD.
C. HARDWARE RESOURCE UTILIZATION AND COMPUTATIONAL TIME ANALYSIS
In this section, hardware resource utilization and computational timing analysis of the proposed design are presented. The proposed architecture is synthesized using Xilinx ISE design suite for Xilinx KC705 Evaluation platform (FPGA Kintex 7, XC7K325T). The information associated to the number of resources used, the maximum operating frequency, and the number of slices (required area) used by our architecture is given after performing the post place and route procedure. Table 2 illustrates the device utilization of the proposed design. Note that a small number of slice look up tables (LUTs) were utilized by the proposed design, which ensures less silicon area usage. Only 13% slice registers, and 18% of LUTs with 13% DSP48E1 slices from the available resources are used in the proposed EMD architecture. Table 3 summarizes the timing details of the proposed architecture. The proposed design achieves maximum operating frequency of 46.44MHz. It requires latency of at least three extrema points to detect before starting computational process, so there will be some initial delay about 0.56 µs to fill the pipeline stages. The computation time for the first IMF is 24.19 µs for 1000 input samples. The proposed system merely consumes 32.21 µs to complete the decomposition of input signal into four IMFs and a residue for 1000 samples. Furthermore, the proposed system achieves a high data throughput of 620 Mbps.
The above resource utilization report has been given for the proposed architecture using CSI and fixed point data format. The same design but using linear interpolation requires less resources as expected due to very simple mathematical operations required to compute (10) . Specifically, the CSI based design requires six division operations as opposed to a single division operation required for linear interpolation in each cycle.
D. COMPARISON WITH THE STATE-OF-THE-ART DESIGNS
In Table 4 , we give a detailed comparison between the proposed architecture and existing state-of-the-art designs [16] - [19] , [21] - [23] in terms of maximum operating frequency, throughput, hardware platform, envelope generation technique, architecture design and used data format. Since existing FPGA-based architectures for EMD have been implemented on Xilinx Spartan 6 (XC6SLX150), Altera Stratix III (EP3SL150F1152C2) equivalent to Xilinx Virtex-5 (ML506) and Kintex 7 platforms, we also implemented our proposed design on all the above platforms to give a fair comparison with state-of-the-art designs.
According to the Table 4 , the proposed design achieves significantly high throughput to compute the IMFs as compared to the conventional designs. Specifically, the proposed design achieves 620Mbps using the Xilinx FPGA Kintex 7 XC7K325T.
The maximum operating clock rate and sampling frequency achieved by our proposed design is 46.44MHz which is lower than the operating frequency of designs proposed in [16] , [17] and [22] . Having said that, the throughput of the proposed design is significantly higher than other architecture owing to its efficient pipelined architecture that consumes less number of clock cycles to compute the EMD decomposition at higher clock rate.
The EMD accelerator presented in [16] implemented a hardware and software co-design (a partial hardware realization) to improve operating frequency up to 100MHz but that design consumes large number of clock cycles due to only partial hardware implementation leading to a higher execution time (equivalent throughput: 0.045Mbps) than the proposed architecture. The FPGA + DSP design [17] achieves the maximum frequency of 50MHz. This combination of FPGA and DSP based design used CSI for interpolation but only managed the computation time of 2.78 seconds (equivalent throughput: 0.0083Mbps) which is quite high thus prohibiting its use in any meaningful real-time and online application of EMD. The CSI-partial [18] has low operating frequency (i.e., 8.9MHz) that limits its practical applications in high speed systems. The Lin-Int [19] has a limitation of using linear interpolation for envelope generation. This design [19] was the first complete hardware realization of EMD so its performance was better than [16] - [18] , but it still has significantly lesser throughput and lower operating frequency than the proposed design.
In Table 5 , we give a detailed comparison between the proposed architecture and the existing FPGA based designs [19] , [22] in terms of resource utilization. To compare with the architectures of [19] and [22] fairly, we have implement our proposed architecture on Altera Stratix III (EP3SL150F1152C2) and Xilinx Spartan 6 (XC6SLX150) respectively in addition to Kintex 7 platform. Furthermore, in Table 5 , we provide resource utilization of the proposed design using two different variants: a high-speed design employing advanced division module and its low-resource variant using fewer resources. Note that the hardware resource utilization of Lin-Int [19] and the proposed design are comparable except the higher number of DSP slices used in the proposed design (as shown in Table 5 ). The proposed high-speed design uses higher number of DSP slices but achieves almost 2.5 times higher throughput than Lin-Int [19] . However, the resource optimized based proposed architecture is equivalently faster as Lin-Int [19] with less number of resources used to compute EMD.
It is important to emphasize here that the proposed design performs computationally expensive spline interpolation instead of linear interpolation in [19] for higher accuracy. Still, the proposed architectures compare well when compared with [19] in terms of resource utilization.
The design CSI-Int [22] shows the best performance when compared to the previous designs in terms of throughput. The design uses multiple envelope generation techniques within sifting process and has a throughput of 58.50Mbps mainly due to the large number of clock cycles used to implement the CSI. The design uses Xilinx FPGA Spartan 6 (XC6SLX150) for hardware implementation. For a fair comparison with the proposed design, we have implemented our design on the same FPGA platform and the results show that the proposed design outperformed CSI-Int [22] in terms of throughput.
The proposed system achieved maximum operating frequency of 23.3MHz on Spartan 6 FPGA, which is lower than the operating frequency of CSI-Int [22] , but the proposed design exhibits much higher throughput when using both CSI and linear interpolation i.e., 310Mbps, thus achieves 5× speed up when comparing to the design proposed in [22] . Table 5 shows that the proposed high-speed design achieves higher throughput (i.e., 310Mbps) than CSI-Int [22] at the cost of hardware resources. However, the resource optimized version of the proposed design yields comparable performance to the CSI-Int [22] in terms of hardware resource utilization. The resource optimized version of the proposed EMD architecture also achieves 1.3 times speedup over CSI-Int [22] while keeping higher data rate (i.e., 75Mbps).
It is important to highlight that CSI-Int [22] implements CSI only for the offline computation (using batch processing of input data) of EMD; for online EMD computation, only suboptimal linear interpolation scheme has been implemented in [22] that prohibits the use of CSI-Int [22] in any meaningful online application of EMD. The proposed architecture on the other hand implements CSI (and linear interpolation) for online EMD computation that makes it suitable for real-world online application of the algorithm. In fact, to our knowledge, the proposed architecture is the only complete hardware-based design available for online EMD computation using CSI. Moreover, the efficient pipelined structure in our design renders it with high throughput for IMF computation among all available EMD hardware designs as shown in Table 4 .
Finally, the proposed design is compared with XSG [23] and BEMD-Lin [21] by implementing it on the same FPGA platform (i.e. Kintex XC7K480T/XC7K325T). The XSG [23] uses Xilinx system generation(XSG) tools for implementation of EMD. The design uses linear interpolation scheme and achieves the maximum frequency of 33.3MHz. The execution time for 8192 samples is 16ms and throughput is 12.28Mbps which is lower than our proposed design. BEMD-Lin [21] is based on bivariate extension of EMD but can also handle single channel datasets. This design uses linear interpolation technique for envelope generation and its throughput is higher as compared to [16] - [19] , [22] . However, the proposed design performs better than [21] since it has higher clock rate of 46.44MHz with higher throughput of 620Mbps which is equivalent to around 2× speed as compared to [21] .
The main difference between EMD and BEMD algorithms is in the process through which local mean of a signal is estimated. Contrary to BEMD, which requires multiple projections to compute the local mean, EMD requires only two projections to do the same task. The architecture presented in [21] used linear interpolation scheme for generation of envelope, whereas, the proposed design implements cubic spline interpolation for envelope generation to increase the accuracy of the resultant decomposed signals. However, CSI is implemented on extrema value using sliding window operation on circular buffers M and P to generate envelope in parallel. The throughput of the proposed EMD design is higher than [21] due to fact that the proposed design only uses two projections (upper and lower envelope values) to compute the local mean, therefore, the local mean is calculated by using only shift by one bit instead of division operation which takes higher processing time. On the other hand, the previous architecture [21] takes multiple projections due to algorithmic difference so the division cannot be replaced with shift operator, therefore, the design [21] consumes higher computational time for each iteration which eventually decreases the overall throughput of the system. Moreover, our proposed CSI block used less number of clock cycle for generation of spline which contributes to its improved performance.
Another important highlight of our design is its ability to process fixed point data formats. As highlighted in Table 4 , no other complete FPGA-based design for EMD works on fixed-point data sets. Since EMD is known to be highly sensitive to truncation errors, lack of fixed point architectures for EMD could be main reason why online applications of EMD are almost non-existent. Moreover, the two versions of the proposed design are presented: (i) high speed proposed design for achieving higher throughput and best for the high speed real-time online applications and (ii) low resources based proposed architecture for such low speed real-time online applications where resource usage is more crucial. Our proposed design is unique/novel in that sense and is expected to gain widespread use in online applications of EMD.
V. CONCLUSION
In this paper, we have demonstrated that cubic spline interpolation and employing fixed point data format within EMD computation is crucial for obtaining accurate EMD modes (IMFs). Consequently, a fully FPGA based architecture for the online computation of EMD modes (IMFs) have been presented that can handle fixed point data inputs thus avoiding any errors due to inherent sensitivity of EMD to truncation errors (noise). To our knowledge, no existing complete FPGA-based architecture of EMD works on fixed point data which could be the main reason why EMD based architectures have not gained considerable success in online practical applications. Furthermore, our architecture implements cubic spline interpolation (CSI) within the sifting process of EMD in addition to providing user with the choice of using linear interpolation. Despite using fixed point architecture and computing envelopes with the CSI scheme, our proposed design has been shown to achieve high throughput as compared to other architecture to obtain a given number of IMFs. The clock rate of the proposed design (46.44MHz) is expected to handle high-speed data emerging from many real world applications. We have shown through carefully designed experiments involving synthetic signals that the proposed architecture yields accurate (free of artifacts) IMFs through the proposed architecture which is not possible to obtain from existing architectures due to their inherit source of errors involving integer data format and linear interpolation. An avenue for the future research could be to test the proposed architecture in some real world online application of EMD. Our case study involving the separation of alpha waves from EEG during the relaxed behavior of a subject provides a good option. Moreover, the proposed architecture can only process single-channel data sets, yet many real world biomedical applications demand simultaneous decomposition of multiple channels from multivariate data e.g., EEG and ECG. To that end, hardware architectures for online computation of multivariate extensions of EMD are required.
