The emphasis of this project lies in the development and evaluation of new robust and high fidelity fetal electrocardiogram (FECG) systems to determine the fetal heart rate (FHR). Recently several powerful algorithms have been suggested to improve the FECG fidelity. Until now it is unknown if these algorithms allow a real-time processing, can be used in mobile systems (low power), and which algorithm produces the best error rate for a given system configuration. In this work we have developed high performance, low power microprocessor-based biomedical systems that allow a fair comparison of proposed, state-of-the-art FECG algorithms. We will evaluate different soft-core microprocessors and compare these solutions to other commercial off-the-shelf (COTS) hardcore solutions in terms of price, size, power, and speed.
INTRODUCTION
Early knowledge about a fetus status during pregnancy and labor is an essential tool for detecting likely damages to the fetus. For this purpose, the main source of information is the fetal heart rate (FHR) monitor. The noninvasive fetal electrocardiogram (FECG) detection using abdomen leads of the mother is much less risky than the ultrasound method that may damage the fetus. Reliable knowledge about a fetus status during pregnancy and labor is essential, leads to a substantial reduction in damages to a fetus, and avoids unnecessary caesarean sections. FECG system design is challenging since the signals are small (μV range), the mother ECG (MECG) signal amplitudes are ten times larger than FECG, and noise such as thermal or power-line-hum need to be eliminated. [14] [15] [16] [17] [18] An electrocardiogram (ECG) measures the heart's electrical activity and in case of the FHR it is a superposition of two signals: MECG (maternal ECG) and FECG (fetal ECG). In a standard ECG we would have from 3 to 12 such measurements taken from different locations on the body, i.e., 3 bipolar and 3 unipolar between the arms and legs and 6 around the heart. In a FECG processing, however, the arm/leg signals become very weak and even the Thorax lead signals show little to no FECG activity as the channels 5-8 from the Data Identification Systems (DaISy) database demonstrate. These thorax signals may become useful in an adaptive filter configuration, where this signals can be used as reference signals, but not in the principle component analysis (PCA) processing. Indeed we observed that using this thorax signals as PCA input will degrade the FECG due to the over emphasize of the MECG.
A healthy adult ECG signal would consist of a regular, ca. 70 beats per minute (bpm), pulses. For MECG we should expect 75-90 BPM 17 and for FECG maximum range 1.3-3.5 Hz or 78-210 BPM and 110-180 BPM normal range 18 . Each pulse consists of a pre-wave P (ca. 80 ms), a main peak QRS (80-120 ms), and a post wave T (160 ms) complex as shown in Fig. 1a . Typical amplitudes are P≤0.2 mV; Q=R/4; R=0.6-2.6 mV; S<0.6 mV, and T=R/7 for the MECG, for the FECG we should expect about 1/5 to 1/10 amplitudes. The task would be to combine the (noisy) lead signals into a reliable single signal or into 2 (or 3) if we have to monitor fetal (or twin) heart rates, too. For a fetal monitor a typical task would be to make sure that the fetal heart rate stays in the desired range between 110 and 180 BPM. An example of this challenging signal processing involved is shown in Fig. 2b . The DaISy base of length 10 seconds of the FHR monitoring is used. The original signal was sampled at 250 samples per second or a total of 2500 samples per channel, see Fig, 2a . The data set available online consists of the input time based in column one followed by the 8 channels: first 5 abdominal lead signals and finally the 3 thorax leads. The LCD display of mother and fetus BPM using the Altera DE2 FPGA board.
Mother and Fetus ECG cannot be measured separately; they are superimposed as shown in Fig. 2b . To determine the heart rates, we therefore need first to descramble the ECGs using a blind source separation (BSS) method such as PCA or independent component analysis (ICA), then determine the BPM from the QRS maxima periods and provide a reliable score for the given BPM data. Although quite old, the method of choice to reduce the dimensions of the input space is the Kahunen-Loeve transform, aka, principle component analysis (PCA). To implement PCA we build the cross correlation matrix between our input signals, compute the eigenvalues and eigenvectors, and then transform the input signals using the eigenvectors and eigenvalues. PCA will deliver a ranking of the PCs in the input data based on the eigenvalues of the cross correlation matrix. Putting this into MatLab code gives In the following sections we will elaborate in more detail this processing step and then explain HW options and performance results. 
COMPUTING DC FREE COVARIANCE MATRIX
The computation of the covariance matrix for PCA requires that the input data are DC free. Before we can therefore compute C xx we need to calculate the DC part of each channel and then subtract this mean from the lead signals. To make our discussion realistic we will use the data of the first 5 channels of the DaISy base in the following.
The first step in the processing includes reading the data from an input buffer. Depending on the HW configurations that data may have been directly written into the microprocessor memory via DMA or burst read from a FIFO, via a periodic polling of the peripheral, or through HW interrupt. We assume that the data are available in memory to the microprocessor as floating-point data. Analysis of the BPM also shows that about 1.5 seconds (or 375 samples at 250 Hz sampling rate) were enough to have good stable statistical properties. The DC or mean of the first 375 samples was rather small, see Table 1 . After the mean was removed the cross correlation was computed by building the cross vector product (aka inner product) between two channel signals and multiplication. For the first 375 samples we got the following data for the first 5 channels: 
COMPUTING EIGENVECTOR AND EIGENVALUES
Computing the Eigenvectors and Eigenvalues (EV) is one of the great challenges in matrix computation and a wide variety of methods exists [21] [22] [23] [24] [25] [26] . It can easily be argued that the cross-correlation is a symmetric matrix since x' n *x k =x' k *x n , and it follows C xx (p,q)=C xx (q,p). This symmetry will simplify the computations since for symmetric matrices the Eigenvalues and Eigenvectors are real. Again the first 5 channel data from the DaISy set are used as demonstration.
Some EV algorithms such as Oja iterative learning or the "Power method" are well suited if we only need one or two EVs, but produce prohibitive large errors if all EVs are required. 1 There are two major types of EV algorithms that are better suited if all EVs need to be computed: These are the class of QR/QL algorithms and the Jacobi method. 21, 25 The idea behind both algorithms is that we can do similar transformations such as
such that eigenvector and eigenvalues are unchanged. If finally after several iterations the matrix A becomes diagonal (the eigenvalues) and the products of the U's becomes orthogonal (i.e., the Eigenvectors) the algorithm terminates.
In the Jacobi method the matrix U is an orthogonal matrix with ones in the diagonal except for U(p,p)=U(q,q)=cos(θ)=c and U(p,q)=-U(q,p)=sin(θ)=s. The collection of all rotation matrices
will produce the Eigenvector matrix. Such elementary rotations are also sometimes called Given's rotations.
This choice of c and s will force the coefficient A(p,q)=0. Together with the condition c 2 + s 2 =1 this will give
Let us demonstrate this function with a rotation of our matrix C xx from (1). Let's assume we like to cancel C xx (1,2)= C xx (2,1)=-70.962 from the matrix. Eq. (4) gives c=0.9638 and s=-0.2665. Then the Jacobi matrix is formed with If we apply this rotation to our original matrix C xx the two off diagonal elements are eliminated: 
The overall goal now of the Jacobi algorithm is to minimize the off diagonal elements, i.e.,
Jacobi suggest that we use the largest C xx (p,q) for the fastest convergence, however, this will require a sorting of the elements after each rotation. A more practical approach is to go through all elements line-by-line to avoid the costly sorting at a little increased number of rotations required. Going through all off diagonal elements is usually called a sweep. In addition Rutishauser 10 made the following two suggestions:
In the first three sweeps rotations are done only if the coefficient
After four sweeps the rotation is skipped if | C xx (p,q)|<<| C xx (p,p)| and |C xx (p,q)|<<| C xx (q,q)|. To check the "substantial lesser" criterion << we use the floating-point arithmetic check "b +100*eps == b". If left and right compute to the same FP number then the number called "eps" is indeed much smaller than the big number "b".
The algorithm design by Rutishauser 22 is the basis for the C/C++-implementation jacobi() in the Numerical Recipes collection 25 and requires about 6n FLOPs per rotation for a n-by-n symmetric real matrix. A typical matrix needs 6 to 10 sweeps (each with (n(n-1)/2) rotations) and we arrive at a total effort of 18n 3 to 30n 3 . The algorithm provides EVs in not sorted order and a final O(n 2 ) sorting function eigsrt() is used to get the data in order.
By looking at the online available routine we found that our FSU colleague John Burkardt 26 had posted a variation (available for free under GNU public license) of the original algorithm from Rutishauser with the following two major differences:
The algorithm only works on the lower off diagonal elements and the full matrix is restored at the end for error computation.
The threshold is computed via the Frobenius norm
. This threshold is used in all sweeps not just the first 4 and the gap function "b+10eps" is 10 instead of 100 as in the NR version, i.e. "b+10*eps==b".
Overall these modifications by Burkardt will slightly reduce the number of rotations (29 instead of 33 in our n=5 case) but due to the repeated computation of the Frobenius threshold the overall computation time remain about the same.
DETERMINE BPM AND SCORE
The PCA preprocessing step should allow us to have a good separation into mother and fetus ECG. The remaining processing is needed to compute the actual BPM of the ECGs and to provide a reliable score for each PC. This peakpicking period detection task 27 that looks easy for the human eye, is more difficult in the technical implementation due to (a) the pre-wave P and post-wave T that may be considered main QRT peaks, (b) any cross-channel talk of an imperfect separation, and (c) finally the differential character of some measured signal that produce a zigzag QRS complex, see Table 2 it can be observed that the first 2 PCs display the mother BPM, while 3 and 4 represent the fetus BPM. However, the eigenvalues ratios associated with the PCs indicate that only the first 3 PCs are reliable PC 4 and 5 eigenvalues seemed too small and should not be used to determine the mother and fetus BPM. Future research will be to develop a robust procedure to determine the BPM. 
EMBEDDED SYSTEM DESIGN AND PERFORMANCE DATA
Embedded systems (ES) are microprocessor or computer systems that not necessarily look like a computer from the outside, i.e., may not have a keyboard, monitor etc. as a usual computer. 28, 29 The ES usually performs a more specialized task than a general purpose computer, but also requires typically less resources and are often low power portable devices. The design of ESs is a multifaceted task, which not only requires a detailed knowledge of the application or problem at hand, but also a wide variety of knowledge from digital logic, microprocessors, to compiler optimization and operating systems. The potential application base is in general very large. A modern household for instance may have a few general purpose computers, but may have hundreds of embedded systems, ranging from microwave oven, clocks, washer to GPS and telephones. A modern car typically has over 50 embedded microprocessors. The developing of a biomedical embedded system for example would start with a typical system specification/requirement list such as With non-recurring cost of cell-based ASICs at $4M for a 60 nm prototype, it is no surprise that Field Programmable Gates Arrays (FPGAs) have become the dominating DSP technology [1] [2] [3] [4] [5] [6] [7] [8] for real-time, high speed systems replacing cellbased ASIC 9-13 designs. FPGAs enjoy many attributes of cell-based ASICs such as security against illegal copies, reduced development time, reduced board and test cost when compared to TTL boards. Cell-based ASICs on the other side are now only used in very high volume or extreme low power applications. Modern FPGA boards are usually a good starting point that allows a rapid assembly of an embedded system, with a broad debug capability. In this project we used a Digilent ATLYS board for designs with Xilinx soft-core MicroBlaze, a ZYBO board for designs with ARM Cortex-A9 hard core processor, and a TERASIC DE2 for designs with the Altera Nios II software core processor. Table  3 compares a few features of the three boards. Assembling an FPGA-based embedded microprocessor system typically can be done with two different design strategies 28 . In the bottom-up method we start with just the microprocessor core and then add all components from component libraries that are needed in the desired system. This usually requires a substantial amount of experience both in design flow as well as with each component since these usually have many different configuration options including bit width, bus type, clock domain, enable and control signals etc. While this GUI-based configuration allows only selection of valid parameters it still is a large source of possible configuration mismatches and should be reserved for the more experience designer. Fig. 3 shows a Nios II/s design using the bottom-up approach. The system includes the Nios II/s, switches, LEDs, on-chip RAM and SDRAM, clock driver LCD, timer and a JTAG controller for communication with the host computer.
Figure 3. Bottom-up Nios II design configuration for the ECG system.
A substantial simpler approach that requires less design experience is the top-down method. The majority of FPGA boards come with a starting-point design for quick evaluation purpose. Since vendor want to demonstrate all the great features a board has, this starting-point design are usually not minimum systems, instead usually most or all peripheral component are accessed in the start-up design to make the system more appealing and demonstrate the large number of functions of the board. Since these systems are configured such that all peripheral component work ok, the mistake of a wrong configuration of a peripheral driver will not happen in a top-down design. We simply remove all design component that are not needed in the system design target application. The ATLYS start-up design we used was the AC97_EDK_demo that included a full feature MicroBlaze system and allowed to read data from AD, play back, and use buttons, switches and display the LEDs. The only addition needed was an axi_timer to measure the performance of the systems. Integration was not difficult since the component comes with a collection of small test programs that allow an easy integration of the new timer component into the MicroBlaze system, see Fig. 4 .
Adding the floating-point accelerator is substantial different in the Xilinx EDK and Altera Qsys. In the Altera Qsys the floating point hardware that improves the four standard arithmetic operations (+,-,*,/) is added to the system as a standard system component, see Fig. 3 the third system component from top. In the Xilinx EDK system the addition of ("none", "basic" i.e. arithmetic, or "extended", i.e. conversion and sqrt support) is part of the initial processor configuration, see Fig. 5 that immediately also gives an estimate of the additional system resources needed. Since a substantial part of the algorithms will need floating-point operations, we were forced to use a more powerful 32 bit microprocessor than an 8 or 16-bit low power version often used in portable systems. Even though the GCC compiler supports FP numbers it is important to find HW support for these functions; otherwise the programs will run very slow.
It is important to notice that an additional HW FP support may increases the silicon footprint; however the overall energy consumption of the augmented system will be reduced 33 . The ARM Cortex A9 has FP support for float and double precision, while basic operations (+,-,*) will run with about the same speed, for more complex FP operations such as division or square root two times the latency should be expected 32 for double precision. Nios and MicroBlaze can be augmented via custom HW to speed-up single precision FP operations. 30, 31 This will have a substantial impact on the runtime as we will see in the next section. Overall Table 4 shows for Nios and MicroBlaze the extra cost associated with the custom FP instruction support. With 13%-21% the additional silicon is reasonable in XPS design given the large speed-up factor as shown next. The Nios II floating-point co-processor need substantially more resource and the required LEs are more than twice compared to the system without FP co-processor. Finally the ZYBO synthesis results show that although the embedded processor itself does not need any logic resource, attaching fast memory to the processor requires substantial logic resource. Since a dual core is in use this come out twice as high as needed in the ECG design. The total resources used in the ZYBO design are 5443 from 17600 slice LUTs, and 6 out of 60 BRAM blocks, no embedded multiplier. Figure 6 shows the ZYBO RTL view of the system build on the FPGA. Note the substantial resources needed (almost the size of a MicoBlaze) to provide a fast memory interface for the dual core ARM processor.
iiiiim-JP . Figure 6 . The ZYBO RTL view of the system build on the FPGA.
PERFORMANCE DATA
We have first measured the FP performance of basic operations needed in the algorithms described in sections 2-4. These data are displayed in Fig. 7 . The run-times for the different parts of the ECG algorithm for the different microprocessors are shown in Fig. 8 . There are several important findings from these graphs:
The measured performance substantially lower than the data from the manual [30] [31] [32] since here C/C++ code and no assembler coding was used.
The code compilation with GCC option -O3 (most) compared to no optimization -O0 takes insignificant longer, but the performance of the code improves overall runtime by a factor of 2.3 for µB+FP2 to 15.9 for ARM.
Remember that standard math.h function such as sin(), sqrt(), or log10() are defined for argument double. Soft-core provide HW support via a co-processor only for 32 bit FP type but not 64 bit double.
MicroBlaze and Nios support float and not double types using the custom instructions. A factor 3 slower performance was measured for the double sqrt() compared with the C99 standard sqrtf() function.
The runtime for the straight forward computation of the correlation matrix C xx was substantially larger (factor 4.8 for ARM and 2.4 for µB+FP2) than the more sophisticated computation of the eigenvalues and eigenvectors.
The algorithms need to be carefully analyzed since operations such as sqrt() or log10() are already a dominating run-time factor and the use of the 32-bit float type will substantially reduce run-time, a 48% saving in run time, by using the C99 float specific functions was observed. The performance data for the Nios II show overall an improvement of a factor ca. 20 when using the FP coprocessor for the 4 basic arithmetic operations. Conversion and sqrt()/sqrtf() computations seemed not to utilize the FP enhancements.
When using the FP basic option the MicroBlaze performance improves by a factor of ca. 30. The extended FP option also conversion function (factor 7) and the 32 bit sqrtf() by a factor 17. The double precision sqrt() shows no improvement.
Clearly the best performance is observed with the ARM processor due to a substantial higher clock rate (666 MHz vs. 100/50 MHz for MicroBlaze/Nios II) and the close coupled FP co-processor in the ARM. To determine the overall computation time of the algorithm for a complete data set for 375 samples or 1.5s, from DC removal to BPM score computations we need to add up the latency of the 6 components. Usually we then build the reciprocal of this time to determine the number of 375 point frames that can be computed per second (FPS). This data are shown in Table 5 . From the data from Table 5 we can conclude that only the ARM cortex 9 has the performance to do a true real-time processing, i.e., a computation for each new sample coming in at 250 Hz. If we relax the requirement and only require a new computation every new FECG QRS pulse at 3.5 Hz then also the MicroBlaze or Nios II/s with FP can be used.
ACKNOWLEDGEMENT
U. Meyer-Baese gratefully acknowledges support for this work from the DAAD program entitled "Research stay for Professors in Germany" during summer 2015. The authors would like to thanks Altera and Xilinx for the provided hardware and software under the University programs. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the sponsors. The help of the following consultants and program manager is acknowledged: B. Esposito, S. Brown, R. Maroccia, M.Phipps (Altera) and J. Weintraub, A. Acevedo, C. Sepulveda, C. Dick (Xilinx) . Products and company names used in this article may be trademarks of their respective owners.
CONCLUSION AND FUTURE STUDIES
The study of a real-time capable, biomedical embedded microprocessor system to compute the fetal ECG has several important findings:
• Sophisticated algorithms such as Eigenvectors computations via Jacobi method require over 30 (sparse) matrix multiplications and the use of floating-point arithmetic is recommended, since scaling of fixed-point implementations will take prohibitive long.
• Any C/C++ algorithm used should not be used as black-box design. A detail analysis of C/C++ algorithm is necessary to avoid costly 64-bit double precision ANSI C87 function such as sqrt() or log10() and replace these with C99 32 bit float functions. Although only small parts of the code use this instruction due to the long runtime these become the dominant factor. Over 100% run-time deduction was achieved by using C99 functions.
• A soft-core microprocessor allows an optimization to find a power/size/speed optimum, but a hard-core microprocessor provides substantially more computing power when floating-point operations are needed.
• Only the ARM cortex A9 had enough FLOP power to provide a true real-time design; if less stringent requirement based on FECG period length are implemented than also a soft-core can be used.
The new overall system is compact, fast and has a high fidelity and will open great paths for commercialization or further refinements such as an added confidence scores. Looking at the experimental results, it is obvious that analyzing streaming data using hard core or soft core processors (on FPGAs) only leaves much room for improvements. The availability of FPGA resources, which can be included into a processor environment, opens the opportunity to further improve performance and also power efficiency. Specifically, parts of the required computations can be assigned to specialized hardware components, which can augment soft-core processors to become competitive -or even superiorto a hard-core processor. A consequent future development will thus be the analysis of the instruction stream to find candidate sequences for a hardware implementation.
