In this paper, the field programmable gate array implementation of a fetal heart rate monitoring system is presented. A least mean squares algorithm based adaptive filter (LMS-AF) is used for the purpose of fetal electrocardiogram (FECG) extraction. Two different architectures, namely series and parallel, are proposed for the LMS-AF, with the series architecture targeting lower utilization of hardware resources, and the parallel architecture enabling less convergence time and lower power consumption. The results show that it effectively detects the fetal R peaks with a sensitivity of 95.74% to 100% and a specificity of 100%. The parallel architecture shows upto 85.88% reduction in the convergence time for non-invasive FECG database while the series architecture shows 27.41% reduction in the number of flip flops used when compared with the existing methods.
INTRODUCTION
Over the past few decades, analysis of fetal electrocardiogram (FECG) has proven to be a tool of great importance when it comes to monitoring the well-being of the fetus during pregnancy and labour, unearthing vital information like fetal heart rate (FHR), heart rate variability, etc. FHR extraction using FECG recordings serves as a suitable method for mobile, lowcost, regular, real-time monitoring of the fetus. However, this ECG signal contains FECG contaminated with maternal ECG (MECG), power line interference, muscle noise, motion artifacts. Various statistical and time domain techniques [1] have been exploited to extract the FECG, namely adaptive filtering, blind source separation, wavelet transform, etc.
As the adaptive filter is an accurate method for FECG extraction and its computational complexity is relatively low [1] , a least mean squares algorithm based adaptive filter (LMS-AF) is chosen for this study. The system is implemented on an FPGA as it is a better prototyping platform for hardware implementation compared to digital signal processors (DSPs). This can also serve as a step towards the development of a low-cost FHR monitoring system as a system on chip. † corresponding author: bvasudeva@ec.iitr.ac.in Previous hardware implementations of LMS-AF based FECG extraction include [2] , which was tested only on raw synthetic signals without any preprocessing, [3] , which used analog preprocessing and FPGA based FECG extraction and [4] , wherein LMS-AF was implemented on a digital signal controller. Some other methods for FECG extraction [5] [6] [7] [8] [9] have also been implemented on hardware. Some of these works [3] [6] [7] reportedly use fixed-point arithmetic, which leads to lower precision than floating point (FP) arithmetic.
The main contributions of this paper are as follows:
• For fetal R peak detection, a norm to determine the threshold is proposed to avoid false positive detection. • A floating point unit (FPU) is developed for the FPGA implementation to support FP calculations, and hence improve the precision and accuracy of the system. • For the implementation of the LMS-AF module, two different architectures, namely series and parallel, are proposed. While the former is developed for lower hardware utilization, the latter is better in terms of lower latency and power consumption.
METHODOLOGY
In order to retain the MECG and FECG components [10] and attenuate the sources of noise, the signals are preprocessed. To remove the high frequencies, a fourth order low pass Butterworth filter is used. The cutoff of the filter is kept at 45 Hz, so that the ECG components in the signal are retained [10] . In order to supress the peak at 50 Hz due to the power line interference, a notch filter [11] centered at 50 Hz (quality factor 25) is used. A two stage moving average filter is used to obtain an approximation of the baseline wander (low frequency noise) present in the signal. To remove baseline wander, the output of this filter is subsequently subtracted from the input signal. The operations performed are summarized below:
where x is the input signal, n is the sample index, M 1 and M 2 are the first and second stage means with window sizes N 1 and N 2 , respectively. The criteria for convergence of the filter weights is satisfied around 12 000 samples. m = 19 and µ = 7 × 10 −5 .
A modified version of the Pan and Tompkins algorithm [13] is used to detect the fetal R peaks. The output of the LMS-AF is differentiated, squared, and then passed through a mean filter of length 40 to obtain the signal sdm. Since the extracted FECG contains residual maternal R peaks as well as sharper fetal R peaks, these operations enhance the fetal R peaks in sdm. In order to determine the threshold th which can be used to distinguish between the fetal and maternal R peaks, a new norm is proposed. The mean m 1 of sdm is used as a threshold to determine the local maxima present in it. The mean m 2 of the these local maxima is calculated. th is then set as the mean of m 1 and m 2 . Among the local maxima already determined, those with amplitude less than th are discarded. The maximum FHR can be 200 beats per minute (bpm) [14] which corresponds to 300 samples (for a sampling frequency of 1 kHz). For the remaining local maxima, if the immediate next local maxima lies within 200 samples, the location of the local maxima with the larger amplitude of the two denotes the fetal R peak.
The difference between the consecutive R peaks is the RR interval. The average of these RR intervals is taken, and divided by the sampling frequency to get the average RR interval length in seconds. The FHR is calculated as follows:
IMPLEMENTATION ON FPGA
For the purpose of FPGA implementation, the proposed system is divided into four units as shown in Fig. 1 .
FPU
An FPU is developed for performing arithmetic operations (addition, subtraction and multiplication) and comparison. The FP numbers are converted to their 32-bit binary representation as per the IEEE 754 standard [15] . The sign, exponent and mantissa are denoted by s a , e a , m a and s b , e b , m b for the inputs A and B, respectively. s out , e out , and m out denote the sign, exponent and mantissa of the output. The procedure followed for the FP adder is listed in Fig. 2(a) . >> denotes the right shift operation. A similar procedure is followed for the FP subtractor, except that when the sign bits are same, subtraction is performed after comparing the mantissas and when they are opposite, addition is performed.
For all the three operations, when m out is not of the form 1.f out , a repetitive process of shifting m out left by one place and subtracting 1 from e out is followed till the first bit of m out becomes 1. The procedure for FP comparison is listed in Fig. 2(b) . c out denotes the three cases, A > B (c out = 01), A = B (c out = 00), and A < B (c out = 10).
Preprocessing

Butterworth Filter
In this module, the output is obtained as follows [11] :
where I[k] is the sample value at instant k and O[k] is the output value. The constants are obtained from the transfer function of the filter. In this work, α = 0.00308, β = 3.28391, γ = −4.08689, δ = 2.28117 and = −0.48140.
Notch filter
This module works in a similar manner as the previous module, following the equation [11] :
In this case, α = 0.99405, β = −1.31278, γ = 0.99405, δ = 1.31272 and = −0.98804. Fig. 3(a) shows the structure of the two stage moving average filter. As in (1), M 1 is the average of N 1 values. In every clock cycle, the input is added to M 1 and x[N 1 − 1] is subtracted from M 1 , both after getting multiplied by 1 N1 . For the moving average operation, all the values in Memory 1 are shifted by one position, so that x[N 1 − 1] is discarded and a new value is stored in x[0]. A similar procedure is followed for calculating M 2 as per (2) . M 1 is multiplied by 1 N2 , stored in Memory 2 and also added to M 2 . y[N 2 − 1] can then be directly subtracted from M 2 to obtain the second stage mean. M 2 represents the baseline wander approximation. This output is used to remove the baseline wander from the input by performing one subtraction operation every clock cycle. The latency of these three modules is 1 clock cycle. (3), and x T [n] has shifted by one index. In the following clock cycle, error is calculated using (4), and the updated value of the first weight of the filter is also obtained. This updated weight value is stored in its position in the next clock cycle. This sequential process is repeated until all the weights are updated, which corresponds to m + 1 clock cycles. After a total of 2m + 1 clock cycles, a new input value is stored in x[0] so that x T [n] is updated. The register containing d[n] also gets updated. The required output for a particular pair of x T [n] and d[n] is obtained after 2m + 1 clock cycles.
Baseline wander removal
Parallel Architecture
In Fig. 3(c) , the Memory 1 (vector x T [n]) gets updated with the next input value in every clock cycle. Each element of x T [n] is scaled, and then multiplied with the elements from the Memory 2 (vector w[n]). These are added to obtain y[n], as in (3). The register containing d[n] is updated every clock cycle, and used to calculate the error, using (4). Since 2µe[n] is used in every weight updation, it is calculated first, and subsequently multiplied with the values from Memory 1 to update the weights, using (5) . The updated weights are stored in Memory 2. All operations are performed in 1 clock cycle.
FHR Detection
Peak Enhancement
In this module, the operations listed in Fig. 4(a) are executed in every clock cycle. cval and pval denote the current and previous input values, respectively. sdif f denotes the differentiated and squared signal, M is a memory of size P , and N denotes the number of inputs.
Detection of Local Maxima
In this module, the local maxima are determined, using m 1 as threshold. The operations executed are summarized in Fig.  4(b) . in denotes the current input, R 1 (R 2 ) is used to store the input value (location) for the next clock cycle, and R 3 (R 4 ) is used to conditionally store the input value (location). The locations and values of local maxima are denoted by pl and pv, respectively, m 1 and m 2 are initialized to zero.
Fetal R Peak Detection
In the first cycle, the inputs pl and pv are stored in R 1 and R 2 , respectively. The operations executed are summarized in Fig.  4(c) . out denotes the locations of the fetal R peaks detected.
FHR Calculation
The RR intervals are estimated using the differences between consecutive outputs of the previous module. Two registers are used for storing the current and previous input. The estimated RR intervals are accumulated and averaged out, following which FHR is obtained using (6).
RESULTS AND DISCUSSION
To test the system for real signals, non-invasive FECG (NiFECG) dataset [16] and database for identification of systems (DaISy) [17] are used. The synthetic signals were simulated using FECGSYN toolbox [18] . The thoracic and abdominal signals are shown in Fig. 5. Figs. 6(a) and (d) show the preprocessed real and synthetic signals. The frequencies between 3 and 35 Hz are retained, while the other frequencies are suppressed. The peak at 50 Hz is attenuated. The output of the LMS-AF is shown in Figs. 6(b) and (e). The signal obtained after peak enhancement (sdm) and the detected fetal R peaks (f pk), are shown in Figs. 6(c) and (f). Table 1 lists the quantitative results for the tested datasets. It is observed that the proposed norm for the determination of th results in no false positives. Table 2 summarizes the comparison of performance of the proposed work with various FECG extraction methods. This work shows a 1.34% increase in sensitivity, and 2% in accuracy for DaISy, along with a 1.02% increase in senstivity and 7.51% in accuracy when compared to works tested on both NiFECG and DaISy. Table 1 Results obtained for different datasets using the proposed approach.
Dataset FHR (bpm) Sensitivity Specificity Accuracy ecgca444 [16] 152 95.74% 100% 97.37% ecgca840 [16] 161 96% 100% 97.37% ecgca746 [16] 147 97.78% 100% 98.53% ecgca771 [16] 153 100% 100% 100% DaISy Channel 2 [17] 143 100% 100% 100% DaISy Channel 3 [17] 143 100% 100% 100% Synthetic [18] 115 100% 100% 100% The system is implemented on the Xilinx Artix-7 FPGA (XC7A100TCSG324-1). The baseline wander removal module consumes 2.691W power, and utilizes 820 LUTs and 94 FFs. The power per cycle is 89.683 µW. The detection of local maxima module consumes 0.167W power, and utilizes 45 LUTs and 34 FFs. The power per cycle is 9.278 µW. All other modules, except for the LMS-AF module, have minimal resource (∼0 LUTs and FFs) and power utilization (0.068W).
For the parallel design, the number of operations in ev- ery clock cycle is more as compared to the series design, and hence the resource utilization is greater. On the other hand, the series architecture distributes the same number of operations across more clock cycles, and hence needs more time for convergence, and consumes more power. Table 3 summarizes the comparison between the existing implementations of various FECG extraction methods on different hardware platforms and the proposed architectures after mapping the power consumption and convergence time to operating frequency 50 MHz. The power per cycle is 7.823 µW for series and 65.133 µW for parallel architecture. As per the latencies, the convergence time for the former is 39 times the convergence time for the latter. The series architecture shows 27.41% reduction in the number of FFs, whereas the number of LUTs is comparable to the other methods. The parallel architecture shows upto 85.88% reduction in the convergence time when compared with the methods [3] [6] [8] using NiFECG database. It has also been reported that implementation of FP operations on FPGA leads to excessive consumption of logic elements [22] . The use of fixed-point numbers would have resulted in a lower resource utilization and power consumption as the operations involving FP numbers are computationally intensive [2] [8] [22] . However, the use of fixed-point numbers compromises with the accuracy of the system.
CONCLUSION
In this paper, the FPGA implementation of a FHR monitoring system is presented. For FECG extraction, an LMS-AF is used, and series and parallel architectures are designed for its implementation. The precision and accuracy of the system is significantly enhanced by the use of FPU. Comparison with previous works shows that the parallel architecture requires the least time for convergence of filter weights, while the series architecture has lowest resource utilization.
