ABSTRACT This paper presents an efficient implementation of multivariate empirical mode decomposition (MEMD) algorithm, a multivariate extension of EMD algorithm. Analogous to EMD, MEMD decomposes a multivariate signal into its intrinsic mode functions using joint rotational mode. The algorithm is computationally intensive because of its recursive nature and any increase in input data size results in a nonlinear increase in its execution time. Therefore, it is extremely time-consuming to obtain a decomposition of signal, such as EEG into its intrinsic modes using MEMD. As the interest in applying MEMD algorithm in various domains is increasing, there is a need to develop an optimized implementation of the algorithm, since it requires repeated execution of the same operations and computationally extensive interpolations on each projected vector. This can be done using GPGPU, because it has the power to process similar function on different blocks of data. We have compared the optimized implementation of MEMD, using GPU, with the MATLAB implementation for hexa-variate and hexa-deca-variate data sets, and observed that the GPU-based optimized implementation results in approximately 6× ∼ 16× performance improvements in terms of time consumption.
I. INTRODUCTION
Graphical Processing Units (GPUs) are being used exceedingly to accelerate the performance of signal processing applications because of their ability to perform parallel processing efficiently [1] - [3] . A few years ago, GPUs were expensive and were not readily available. However, nowadays they are readily available in devices ranging from home computers to mobile phones. Thus, any algorithm optimized for GPU can be executed efficiently on a variety of devices.
The computational power of a GPU can be best utilized for tasks requiring parallel processing. The throughput of compute intensive tasks or algorithms can be increased by transforming their structure. Transforming their structure from sequential to parallel may result in an efficient execution on GPU. Therefore, GPU implementations of most signal processing algorithms focus on making the algorithm execution parallel [4] - [7] .
Recently, a signal processing algorithm titled Multivariate Empirical Mode Decomposition (MEMD) has been proposed by Rehman and Mandic [8] . It is a multivariate extension of Empirical Mode Decomposition (EMD) algorithm [9] . The fact that MEMD algorithm is slow, because it is highly recursive [8] , limits its application to large data sets, especially image and video data sets. The interest in using MEMD algorithm for signal processing applications is evident from its recent use for fusion of images from different sensors acquired at different exposures and times [10] - [12] . Although a sequential MATLAB based implementation of the algorithm has been made available by the authors but the available implementation is slow and hence not practical when processing large images and videos including hyperspectral satellite images or ultra high-definition video.
The main contribution of this work is the proposition of an efficient GPU-based implementation of MEMD algorithm. As a first step, an efficient CPU-based implementation of the algorithm was developed. The proposed implementations were compared with the original implementation (available in MATLAB from authors website). 1 The tests were conducted 1 http://www.commsp.ee.ic.ac.uk/mandic/research/emd.htm using hexa-variate and hexa-deca-variate data sets, and the accuracy and speed of execution were compared for each type of implementation.
The rest of the paper is organized as follows: in Section II, we provide the theoretical foundations and the algorithm for multivariate empirical mode decomposition, including some important aspects of CUDA programming model for GPU. Our main contributions are presented in Section III. In particular, in Section III-D we describe the parallel implementation with optimization on GPU. In Section IV, we report comprehensive experimental results to validate our study followed by discussion in Section V. Finally, in Section VI conclusions and future research directions are outlined.
II. RELATED WORK A. EMPIRICAL MODE DECOMPOSITION
Empirical mode decomposition (EMD) [9] is a data-driven technique to decompose non-linear and non-stationary signals into a time series of high-frequency and low-frequency components called intrinsic mode functions (IMFs). It should be highlighted here that unlike wavelet transform, which uses a basis function for frequency domain analysis, the EMD analysis is data dependent and basis function independent [13] . The IMFs obtained using EMD give meaningful instantaneous frequencies after subsequent application of Hilbert transform [9] . The algorithm can be used for pre-processing (denoising, orthogonalization) and timefrequency analysis of real world signals [10] . Although EMD has been used extensively in signal processing related applications it cannot be used for analysis and processing of multivariate signals [8] , [14] - [16] . For multivariate signals, EMD can calculate IMFs of each signal separately with no frequency alignment between each channel. To solve this problem of frequency alignment in a multivariate signal, a multivariate extension of EMD was proposed by Rehman and Mandic [17] that aligns intrinsic joint rotational modes of a multivariate signal.
Consider the application of EMD algorithm on a signal x of length N pts resulting in M oscillatory modes called intrinsic mode functions (IMFs). An IMF represents the inherent temporal modes that describe the input data [11] . The last component without full mode in it is called the residue. EMD represents the sum of all extracted IMFs and residue [11] . The input signal x can be written as:
where C j (k) are IMFs, r is residue and M is the number of IMFs extracted from the input signal.
An IMF is defined by a function that satisfies two conditions:
1) The number of extrema and the number of zerocrossings must be equal to or differ by one. 2) The mean value of envelope defined by local maxima and envelope defined by the local minima is zero.
EMD algorithm requires no information about the input data or the basis functions like Fourier or wavelet type transforms [9] , [18] . MEMD basis is adaptive and sparse [10] . To practically calculate IMFs of an input signal, EMD algorithm applies sifting process till the number of extrema is equal to, or differ by one from the residue of the previous iteration [9] . During the sifting process [9] , we calculate extrema of the signal. Then we interpolate maxima and minima to the maximum length of signal. After that, we take mean of signal by using these interpolated signal and subtract it from input of sifting process. For an input signal x, IMFs using EMD can calculated using Algorithm 1. [8] 1: Initialize the algorithm by setting x (k) = x(k) 2: Find extrema locations of x (k) 3: Interpolate the minima and maxmia to obtain lower and upper signal envelopes, emin(k) and emax(k) respec-
Algorithm 1 Empirical Mode Decomposition (EMD)
as an IMF, otherwise set x (k) = s(k) and repeat Steps 1 to 6
B. MULTIVARIATE EMPIRICAL MODE DECOMPOSITION
In MEMD algorithm, a uniform sampling scheme based on hammersley sequence is used to calculate direction vectors [8] . We take projections of the direction vectors to the input signal [10] . After taking the projections extrema are located. The signal projections result in a multi-dimensional vector and hence extrema calculations result in multiple envelopes. These envelopes are estimated using cubic spline interpolation. We calculate the mean using these mulidimensional envelopes. The difference of the mean of all the envelopes with the original signal results in one IMF. This process is repeated either to get sufficient number of IMFs or when the stopping criterion is met. For stopping, we have extrema less than 3 or squared amplitude of mutivariate signal is less than predefined value.
C. CUDA ARCHITECTURE
Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model developed by NVIDIA for graphics processing units (GPUs) [19] . It enables a tremendous increase in the compute performance by exploiting GPUs raw computational power. The platform is scalable supporting future generations of NVIDIA GPUs. It is built as an extension of C programming language that allows the host processor (the CPU) to launch parallel compute kernels on the coprocessor (the GPU) [19] . These compute kernels are launched with specific launch parameters like grid size and thread block size [20] . Before launching the 3: Find the extrema from the set of projected signal 4: Interpolate to obtain multivariate envelope curves 5: For a set of K direction vectors, the mean m(t) of the envelope curves is calculated as
If the vector meets the stopping criterion then it is a multivariate IMF, otherwise repeat Steps 2 to 5 kernels memory is allocated from GPUs global memory and the input data is transferred to it from CPU main memory. The GPU memory is mapped to CUDA using a specialized hierarchical memory model with GPU global memory accessible to every thread block. The read/write access times to, and from this memory is highest; however, there are other faster memories available for instance read-only constant memory (L1 cache) and texture memory. Moreover, there is also shared memory (L2 cache) which is shared among the threads in a thread block, as well as, registers and local memory associated with every thread block active on a streaming multiprocessor (SM). The maximum number of active blocks depends on available resources on GPU and the number of registers and memories directly affects the SM occupancy. Therefore utilization of too many resources by a single thread can limit the occupancy. Block size also directly affects the occupancy. Change in block size changes the number of blocks and resources required by a block leads to change in the maximum number of active blocks. So we had to choose the block size according to our desired application.
III. CUDA IMPLEMENTATION
We exploit the raw computational power of GPUs by parallelizing MEMD algorithm using CUDA. In the parallelized implementation, the host processor (the CPU) invokes all the CUDA kernels and performs all memory transfer operations. The parallel MEMD implementation involves recursive sifting process, which is carried out until the stopping criterion is met. Every pass of this sifting process results in an IMF as shown in Figure 1 . The parallel MEMD algorithm starts off with calculating hammersley sequence to calculate direction vectors [21] , [22] . Other uniform sampling and random sampling techniques give noise and aliasing [21] , and hence, Rehman and Mandic [8] , [17] propose a quasi-random sampling technique to generate sampling points on a sphere. As mentioned in [8] and [17] , the number of direction vectors should be at least twice the number of input data samples. We adhered to this suggestion by the authors and generated hammersley sequence on GPU taking into account the number of inputs and the number of direction vectors [8] . The implementation in C++ and CUDA was quite different from MATLAB. We got a significant gain in absolute terms but in the overall algorithm it is the smallest execution part and executed once in MEMD. So, in the overall algorithm has very less gain.
We calculated direction vectors using the generated hammersley sequence. We took projections of the input signal along the direction vectors to get the projected signal [8] . After projection, in sifting process, we performed peak detection, ensuring mirror symmetry at edges and got the corresponding input data for each channel. Next, we applied spline interpolation on that data using corresponding maxima and minima [23] . We calculated envelope mean and amplitude using the maxima and minima interpolated signal as explained in [8] . By using the envelope mean and amplitude, we checked the stopping criterion of sifting process [8] .
Here, we treated all the projected vectors in sifting process independently [8] . Once the sifting process was completed, we extracted the detail signal from our input signal and repeated the same process till the signal does not have sufficient mode in it, or the number of frequency components required was achieved [8] .
A. PREPROCESSING
We generated the hammersley sequence on GPU and stored it on the device memory. To increase memory coalescing and reduce overall branch divergence between threads, each thread in a thread block calculated a unique point-set on a hemisphere for an input channel. For instance, we took the number of blocks equal to the number of input channels, whereas the number of threads in a thread block was set to the number of directions, as shown in Figure 2 . We have maximum active blocks with no branch divergence in it. We also have un-coalescing write operation in memory as well.
After calculating the sequence, we generated direction vectors equal to the number of point-sets per channel and its length equal to the number of channels using hammersley sequence. To reduce global memory accesses, we fetch the input in shared memory and calculate projection vectors in shared memory as well.
B. SIFTING PROCESS
Taking the projections as input for the sifting process, we located the maxima and the minima of the input. Moreover, we ensured sufficient input length and mirror symmetry for the subsequent interpolation to get maxima and minima multiple envelopes. These resulting envelopes were then used to get the mean envelope, as shown in Figure 3 .
In Step 1, the CUDA kernel located local peaks in the input projected signal, which was obtained from the previous pre-processing step. The data was organized in global memory such that each column represents one unique projected vector, or in other words each row represents corresponding projected vectors data. This was done to reduce the total number of global memory accesses, as well as, to avoid branch divergence when making global memory accesses. The number of thread blocks was set equal to the number of direction vectors divided by the block size. Each unique thread, corresponding to a direction vector, read the data for the entire length of the input signal calculating its maxima and minima locations, and storing true at these locations in the projected vector, as shown in Step 2. In Steps 4 and 5, the locations were interpolated to get minima and maxima envelopes that were stored as 2D grids having dimensions equal to the signal length and the number of projections divided by the block size.
C. POST-PROCESSING
In post-processing, a stopping criterion based upon the amplitude of mean-envelopes and the number of extrema was used. If there are less than 3 (three) extrema in the input, or if the summation of the squared signal values was less than a predefined threshold then the stopping criterion was reached. The sifting process was stopped and the averaged envelope was an IMF. If the stopping criterion was not reached then a difference of mean envelopes and the input signal for the current pass was taken as an input for the next pass of the sifting process. All these steps were repeated until an IMF was extracted.
The post-processing block as shown in Figure 4 , comprised of three CUDA kernels: the first calculates mean of the envelope of the entire projected signal having 2D grid of size (N pts /blocksize) × (N dim /blocksize). Here N pts was the input signal length and N dim the number of input channels. The second kernel calculates the average amplitude of projected signal and the third kernel evaluated the stopping criterion having a grid size (N pts /blocksize). 
D. OPTIMIZATIONS 1) MAPPING STRATEGY ON GPU
To eliminate per-transfer overhead, rather than using many small CPU-GPU transfers for transferring every channel from the input signal to the device memory, we made one larger transfer by arranging the 2D multichannel input signal into a continuous 1D array. Furthermore, it is also possible to overlap input data transfers with kernel executions, for instance with the calculation of direction vectors on GPU.
2) MEMORY ACCESS OPTIMIZATION ON A GPU
Global memory accesses are a performance bottleneck in multi-threaded CUDA computations. Such accesses should be kept minimal and should be coalesced across a thread block; that is, group together all contiguous global memory accesses in a thread block (more precisely thread warps) into a single global memory access. In the proposed parallel implementation, most CUDA kernels ensured data locality in order to maximize overall memory throughput. For instance, threads in a thread block of the next kernel access the stored data in a contiguous manner.
3) MEMORY LAYOUT
Modern GPUs have several levels of local caches such as shared, texture and constant memories. Using these on-chip caches generally gives exceptionally fast access times compared to that of global memory. In our implementation, we use shared memory to eliminate redundant memory accesses. Since it is shared within a thread group, all required elements are loaded only once, and then operated on by fetching them from the faster shared memory rather than accessing global memory. Moreover, shared memory is also used to avoid unnecessary global memory accesses; that is, to store the partial results into shared memory instead of storing them in global memory.
4) DATA TRANSFERS BETWEEN CPU AND GPU
There is a huge disparity between peak bandwidths between device memory and GPU (144 GB/s on the NVIDIA Tesla C2050) [24] compared to between host memory and device memory (8 GB/s on PCIe x16 Gen2). To minimize the adverse effects of this disparity on the overall performance of the parallel implementation, we minimized the amount of data transferred between the host memory and the device memory. Only the input data is transferred between the two, but all the direction vectors, as well as, the projections of the input data along all these direction vectors were calculated on GPU's memory accesses; that is, to store the partial results into shared memory instead of storing them in global memory.
5) DIVERGENT FLOW CONTROL IN PARALLEL THREADS
GPUs are built on single instruction multiple threads (SIMT) execution model, hence same instruction is executed for each and every thread in a warp. It is logically not possible to execute different instructions on different threads on the same warp at the same time. In case if the whole warp does not follow the same path, then there is divergence, resulting in a penalty [25] .
In the proposed parallel MEMD implementation when generating the hammersley sequence, the ''if statement cuts'' are multiples of warp size, so that full warps of threads execute the same instructions. Furthermore, during the sifting process, interpolation is performed in every direction, even if there is not enough mode present, in order to avoid divergent control flow.
6) MAXIMIZE THE OCCUPANCY
On current GPUs, warps of threads map to compute units. They reside on SM until all of its threads complete execution. To increase the occupancy, all threads in a warp should be executing the same instruction and have enough memory resources for it. To reduce memory access and instruction latencies, if there are enough hardware resources, multiple thread warps can be mapped to a single compute unit at the same time increasing warp occupancy. In the implementation, the grid and block sizes selected are multiples of warp size, multiples of available SMs. Moreover, the number of active threads per SM are increased by using lesser registers per thread and lower usage of shared memory per thread block. For instance we are storing direction vectors in the shared memory on same locations as pre-calculated cosine values were present.
IV. EXPERIMENTAL RESULTS
We tested the serial C/C++ implementation of MEMD algorithm on Intel Core i5 3 rd generation CPU with 8GBs of system memory. The parallel implementation of MEMD algorithm was evaluated on an NVIDIA GeForce GTX 450 GPU with four SMs (192 CUDA cores in total) and 1GB of global memory [26] . Moreover, we also tested the implementation on an NVIDIA GTX 980 GPU with 4GBs of memory and 16 SMs with 2048 CUDA cores [27] .
A. TESTING CASES
The parallel MEMD implementation was tested using a synthetic hexa-variate signal. In total, 64 direction vectors were used to get projection vectors of the input signal. The result VOLUME 5, 2017
FIGURE 5. Test Case I. Decomposition of a synthetic hexa-variate signal (U,V,W,X,Y,Z) into IMFs (U1-U7), (V1-V7), (W1-W7), (X1-X7), (Y1-Y7), (Z1-Z7).
was a perfectly decomposed input signal into different modes and with all IMFs of different channels aligned as shown in Figure 5 .
In addition to testing the parallel implementation using the hexa-variate signal, we tested the implementation with do-deca-variate (twelve), hexa-deca-variate (sixteen), and do-triaconta-variate (thirty-two) channel EEG data. Moreover, we also increased the number of projection vectors up to 512. The resulting IMFs of hexa-deca-variate are shown in Figure 6 . Table 1 summarizes the number of blocks and the number of threads per block used by the GPU kernels for hexavariate and hexa-deca-variate input signals. Our implementation yields the best performance in this setting. In the first kernel, to compute hammersley sequence, we took the number of threads per block equal to the number of projection vectors and the number of blocks equal to the number of channels in the datasets used. For the rest of the kernels, we computed block and grid sizes based on the number of channels and projection vectors used.
B. KERNEL CONFIGURATION

C. PERFORMANCE EVALUATION
We compared C/C++ and CUDA implementations using hexa-variate and hexa-deca-variate datasets with the number of direction vectors equal to 64. The parallel implementation also performed additional memory transfers between RAM to GPU global memory and vice versa. In the serial implementation, the most expensive part in terms of execution time was the spline interpolation. Thus, we primarily focused on this function in the parallel implementation. We ended up getting 5.98× gain for hexa-variate dataset and 11.21× for hexa-deca-variate dataset as shown in Table 2 .
Our results for occupancy scores for different kernels reveal that most of them contribute to achieving the overall 8696 VOLUME 5, 2017
TABLE 2.
Processing times (in milliseconds) and speedups achieved for the GPU implementation of the algorithm on the CPU-GPU platform.
parallel implementation speedup. We note that spline interpolation takes most of the execution time so our main focus is to optimize spline interpolation. Hence, to optimize computing interpolation, we divided the task into four separate kernels. Here, the second and third kernels involved recursion to smoothen coefficients using forward and backward filtering. We used 1-D thread blocks to update these coefficients that were precomputed in the first kernel. The result is less TABLE 3. Processing times (in milliseconds) and gains achieved for the initial and optimized GPU implementations of the algorithm on the CPU-GPU platform, tested with the hexa-deca-variate input signal.
occupancy for these kernels as compared to the first and fourth kernels using 2-D thread blocks.
We also evaluated the timings of several CUDA kernels with sixteen channel input data before and after applying different GPU-related optimizations. These optimizations include (a) memory coalescing of read/write operations in CUDA kernels, (b) exploiting faster memory caches to avoid global memory accesses, (c) maximizing GPU occupancy by making better choices for grid and block sizes, and (d) reducing the memory usage per thread block. Table 3 shows the execution times before and after optimization of GPU MEMD implementation.
The performance of parallelized version of MEMD algorithm was evaluated using hexa-variate and hexa-deca-variate channel EEG data. The input data length was 1001 data points, while the number of projections was set to 64. The mean execution times were computed over 20 iterations of both the serial and the parallelized MEMD implementations tested on two different GPUs, NVIDIA GeForce GTS 450 and NVIDIA GeForce GTX 980. The resulting execution times are shown in Figure 7 .
Furthermore, we also compared the execution times for CPU and GPU MEMD implementations with projections computed along the different number of direction vectors (64, 128, 256, 384 and 512 direction vectors). The results from Figure 8 show that GPU attains approximately 6× ∼ 16× performance speedup.
D. MEMORY USAGE
Hammersley sequence and direction vectors require same memory in C++ and CUDA. The kernels that belong to sifting process require more memory in GPU than in CPU due to parallel processing of data. GPU occupies the number of direction vectors times the actual memory was required by CPU as shown in Table 4 . As we take more projection vectors, memory requirements for GPU implementation become more significant.
In GPU implementation, most of the time is consumed on parallel computations rather than on transferring the data from the host (CPU) memory to the device (GPU) memory. Therefore two time measurements are performed as shown in Table 5 : (1) time taken exclusively for MEMD computation by the kernel, (2) time taken by the kernel and data transfer time (transferring input data from the host memory to the device memory and copying back the results from the device memory to the host memory). The results from Table 5 show that GPU attains approximately 6× ∼ 16× performance improvements in terms of time consumption by kernels for MEMD algorithm exclusively, and approximately 6× ∼ 16× performance improvements for kernel as well as data transfer; this clearly demonstrates the efficiency of GPU over CPU because of parallel executions.
E. ACCURACY EVALUATION
The original signal can be recovered precisely by summing each all the resulting IMFs of each channel [9] . A difference between the recovered signal and the original signal gives an absolute error measure, which is marginal as shown in Table 6 . Moreover, all correlation coefficients measured between CPU and GPU implementations give perfect match up to three decimal points; that is, they are equal to 1.000.
1) THEORETICAL ANALYSIS WITH P PROCESSORS
In pre-processing and sifting stages, execution time is inversely proportional to the number of processors till the number of direction vectors. For instance we have P processors, number of input channels N dim and number of direction vectors N dir .
where T is the execution time for the two stages. Furthermore, there is an increase in performance for other kernels of the parallel algorithm. For instance, in hammersley sequence kernel, there is a gain in performance linearly till N dir × N dim , and gain in performance in projections kernel till N dir × N pts .
In spline interpolation, there is more parallelism such that a gain in performance linearly till
where T spline is the execution time for spline interpolation. Moreover, in the last kernel of spline interpolation there is an increase in performance of spline interpolation till
In post-processing, there is an increase in performance till N dim × N pts in calculating the mean of all the interpolated envelopes, taking the difference between signal and computing IMF to get the residual signal for the current iteration. In amplitude computation and stopping criterion kernels we also get an increase in performance till N pts .
V. DISCUSSION
GPGPU has been successfully used to optimize different signal processing algorithms [2] , [3] , [28] in order to make them more efficient for practical use. In the proposed MEMD implementation, the parallel design is scalable and flexible. It can efficiently process more input channels and projection vectors by using more GPU cores. Furthermore, all projections of the input signal were done independently in many directions; that is, all resulting projected signals are independent. Each of these projected signals can be generated and processed in parallel on a single GPU, as well as, on multiple GPUs to get more parallelism. In the latter case, the input signal can be moved to all GPUs available with each GPU responsible for solving one, or several projected signals. After completion of the sifting process, the master GPU will calculate the mean envelope curves and extract the details from the input signal to get the corresponding IMFs. Moreover, for further improvement, each signal channel can be interpolated in parallel on a multiGPU machine.
In the case of input signals with either larger data lengths, or real-time streaming data from some acquisition device, there is a possibility to process the incoming signal in chunks using different approximate computing techniques. Furthermore, the extrema envelope curves can be produced using different interpolation techniques that subsequently translate into a different number of IMFs. In our implementation, we used cubic spline interpolation to calculate these multi-dimensional envelopes; that is, MEMD algorithm results in well-suited frequency resolution for different applications.
Though MEMD algorithm is designed for time varying signals; however, it can be used for image processing applications [29] , [30] . A simple technique to process such 2D signals is to convert them into 1D signals, and then apply MEMD algorithm. Moreover, a generalized MEMD algorithm can also be used to process 2D signals, while preserving the higher dimension of signals.
All in all, there are many practical applications of an efficient MEMD algorithm, for instance in fields of biomedical signal and image processing [13] , [31] . It can be used to remove artifacts from time varying signals [14] , as well as, to extract textures [15] and to remove noise [32] , [33] from image data. Moreover, MEMD algorithm can process multichannel data directly for multi-scale analysis [11] , [16] .
VI. CONCLUSION
In this paper, we started off with a serial implementation of MEMD algorithm, which was then ported onto GPU using CUDA. The parallel implementation in comparison with serial implementation resulted in average seven times speedup for two data-sets with 64 direction vectors. This speedup was further improved significantly to nine times after including different optimizations specific to graphics hardware and the CUDA programming model. In conclusion, the faster parallel version of MEMD algorithm can be efficiently used in many practical signal and image processing applications, for instance, denoising of EEG and EOG signals, and fusion of satellite images. His research interests are focused on image processing and machine learning with a specific interest in satellite image processing, high dynamic range imaging, multi-projector displays, CAPTCHA strength analysis, human activity recognition, and image processing for bio-medical applications. VOLUME 5, 2017 
