Abstract-Electrocardiogram (ECG) analysis has been established as a key element regarding the evaluation of the human health status. The computational complexity along with the strict constraints of real-time assessment of a heart beat, has made the ECG analysis flow a very challenging application for embedded medical devices. Recent advancements in cyber-physical and IoT systems are transforming medical processing towards embedded and wearable devices, thus making energy consumption a first class design objective. In this work, we focus on analysing the power, performance and energy profiles of an ECG analysis and arrhythmia detection software pipeline during its execution on a ZYNQ-based SoC. We evaluate a large set of design alternatives spanning from a pure software-only implementation to HW/SW oriented designs, in which High-Level Synthesis capabilities are utilized. Using the medically validated MIT-BIH ECG database, we examine the efficiency and the sensitivity of the design solutions in different operating frequencies and examine three Quality of Service (QoS) levels concerning the sampling rate of the ECG signal.
I. INTRODUCTION
Healthcare is one of the most rapidly expanding application areas of Big Data and Internet of Things (IoT) technology. Indeed, the remote health monitoring of patients is expected to reduce healthcare cost while simultaneously delivering highquality health services [1] . Towards this direction, there are increased Quality of Service expectations imposed by the involved users of these new medical systems, which are in turn translated to more strict requirements and constraints on all the design levels of the utilized medical devices.
To achieve the increased requirements, new sensor based medical systems are envisioned which take advantage of the distributed and heterogeneous nature of IoT systems in order to alleviate the computation burden of single processing nodes and maximize their autonomy [2] . Inevitably, these proposed system architectures dictate the development of new portable and/or wearable biomedical devices able to perform powerful analysis and facilitate communication with other nodes of the system. This embedded nature of modern biomedical devices has emerged the need for an energy aware re-design of medical analysis algorithms.
In an effort to study and quantify the power and processing requirements of such an algorithm, we utilize electrocardiogram (ECG) signal analysis as our driver application. ECG is considered as one of the most fundamental and crucial biological signals for monitoring and assessing the health status of a person due to its inherent relation to heart physiology. Its analysis and interpretation have been established as an important field in modern medicine, with the digital processing analysis of the signal gaining a leading role. Lately, ECG analysis has been dominated by machine learning techniques for purposes of deriving exact models in the effort of assessing and predicting the state of the heart [3] , [4] , [5] . One representative example is the use of Support Vector Machine classifiers which have gained popularity as a key factor of the ECG analysis due to their increased accuracy in complex, non-linear classification problems. On top of that, constant monitoring and a need for real-time heart condition assessment have imposed new requirements for acceleration and low power execution of ECG signal analysis flows.
In this work we focus on arrhythmia detection in a real-time operating ECG signal analysis flow. Our aim is at evaluating the energy efficiency of a candidate device in an IoT based ECG analysis system. Specifically, we employ Zynq-7000 All Programmable SoC device [6] for implementing an embedded ECG analysis and arrhythmia detection application which can have various roles in a future IoT system, either as dedicated ECG analysis device or node which supports computational off-loading of other less powerful wearable nodes. Emphasis is given on the differing design alternatives offered by this heterogeneous platform, i.e. pure software implementation v.s. hardware/software co-design solutions.
Our implemented ECG analysis flow includes the removal of noise from the heart signal, heart beat detection, feature extraction and finally the diagnosis of a particular heart beat using an SVM classifier. The hardware/software co-designed versions of the flow were designed using High-Level Synthesis (HLS) tools. HLS has proven to be an effective choice when it comes to quick prototyping of system designs given that its starting point is an abstract algorithmic description of the desired functionality in a high-level language instead of a detailed description in the RTL [7] . On this basis, the tool can also be used as the main component of a design exploration framework which optimizes the produced HW accelerator according to the requirements of the designer.
The IoT system under analysis, is user-oriented and consequently the Quality of Service offered is a parameter of major importance. In this case study we utilize the element of QoS levels by analyzing ECG signals of different sampling rates as proposed in [2] . The higher the sampling rate is, the higher is the resolution of the ECG signal. As a consequence, this more detailed description of a heart beat leads to more accurate analysis and diagnosis. The inherent trade-off is that higher sampling rate results in elevated number of computations needed to execute the analysis flow which in turn leads to increased power consumption.
The software-only version of the ECG signal analysis flow was implemented as a Linux application, executed on the Processing System (PS) side of our target device while the hardware/software version was obviously executed both on the PS and Programmable Logic (PL) side of the device. The utilized device was a Zynq-7000 APSoC and, more specifically, the ZC702 Evaluation Kit. One of the reasons for this selection was that the ZC702 board includes PMBus controllers which support a wide range of commands, that, among others, allow an external host to monitor the controller and measure the voltage and current in the voltage rails of the device [8] while experimenting with alterations in the operating frequency of the PS-side using cpufreq, a tool which allows the alteration of the speed of a processor at run-time.
The rest of the paper is organized as follows. In Section II related work is presented, while Section III gives an overview of the ECG analysis flow. In Section IV follows a proposed methodology for ECG analysis HW/SW co-design. Section V offers some information in terms of technical background on the utilized device and the voltage and current supplies that are necessary for the operation of the Zynq-7000 APSoC and explain how the consumed power is measured through the PMBus. In Section VI the different versions of our implementations are discussed and in Section VII the experimental results, in terms of consumed energy are provided. Finally, Section VIII concludes this paper.
II. RELATED WORK
Most biomedical devices used for monitoring and detection of abnormalities in biological signals aim to provide accurate and trustworthy results in real-time, hence, in many cases significant acceleration is essential to be achieved. On top of that, a possibly portable and/or wearable device needs to satisfy certain requirements in terms of power consumption or even the transmission of data to/from a data center where they could be evaluated.
On this ground, in [9] a multi-sensor system for human activity evaluation including a wireless module for communicating with a smartphone, exploiting the benefits of the Zynq-7000 APSoC in terms of HW/SW programmability and power management has been developed. The combination of PS and PL sides proves a lot more beneficial for the system performance when compared to previous ARM-based wearable systems, not only in execution latency but also in power consumption as frequency scaling is applied when it is not necessary for the processing system to operate in full speed.
In the HW acceleration direction, in [3] an HLS approach is adopted to the design of an SVM classifier for arrhythmia detection as a HW accelerator. This classifier is the final stage of an ECG analysis flow and for determination of an optimal solution a systematic Design Space Exploration has been applied. The writers claim to have achieved execution latency gains of up to 94% compared to the original softwareonly version of the SVM code.
As far as power consumption is concerned, in [10] and [11] the available, in some FPGA-boards, PMBus controllers are employed in order not only to monitor the instant power consumption but, additionally, to apply power management methods like dynamic voltage and frequency scaling through this controller in devices, which they define as CPU-FPGA hybrids, a characterization which could be attributed to a device like Zynq-7000.
III. DESCRIPTION OF ECG ANALYSIS FOR ARRHYTHMIA

DETECTION
Arrhythmia is considered as one of the most commonly encountered heart malfunctions. Hence, the field of detecting signs of arrhythmia in an ECG signal has been highly investigated and machine learning techniques are commonly employed for this task. Machine learning solutions require a rich data set for their training and to aid this research, many institutes have composed publicly available ECG signal databases [3] . A well-known and commonly used database is the MIT-BIH Arrhythmia Database [12] , which could be defined as a combined effort of MIT and Beth Israel Deaconess Medical Center [13] . The database consists of 48 fully annotated half-hour two-lead ECG with the collaboration of patients from different medical files and physiology characteristics. What is more, the signals of this database have been annotated by medical experts thus forming an ideal starting point for creating a training dataset for the final stage of the ECG analysis flow, i.e the classification of a heart beat.
A. Overview of ECG analysis flow
As already mentioned, the processing of an ECG signal is composed of various stages with distinct characteristics and requirements. The employed stages include Noise Removal, R peak detection, Feature Extraction and Diagnosis Classification:
• Noise Removal: The first stage of the ECG analysis is the filtering of the signal, usually using a low pass digital filter to remove noise imported on the signal due to power line interference.
• R peak detection: An R peak is a special point in the physiology of each heart beat which can be used to detect a heart beat inside a stream of incoming ECG data. Due to this significant role, R peak detectors have been widely investigated, and in the context of this work we utilize an existing one provided by the Physionet [14] .
• Feature Extraction: Having determined a new heart beat, a feature extraction process follows in order to extract its unique characteristics. We make use of Discrete Wavelet Transform (DWT) which is very popular for ECG signal analysis [5] , [15] , [16] since it is capable of efficiently capturing the signal characteristics while imposing a very small execution overhead on the flow.
• Diagnosis Classification: The final stage of the ECG analysis flow is detecting whether the heart beat exhibits arrhythmia signs or not. A classification algorithm is used to detect the pattern of problematic beats. In this work a Support Vector Machine Classifier is employed and trained for the purposes of detection. We focus on using a Support Vector Machine (SVM) classifier, with RBF as the kernel function, mainly due to its ability to support non-linear classification with efficient accuracy and computation cost. 
B. ECG Sampling Rates
We examine and evaluate the performance and the power consumption for three different Quality of Service (QoS) levels regarding the sampling rate of the ECG signal. The selected sampling rates were 180 Hz, 360 Hz and 720 Hz. In signal processing, a signal with a higher sampling rate has a higher resolution, thus more samples to be analyzed and the sampled activity is captured in more detail. In other words, a high sampling rate gives a signal of better quality. It should be anticipated that an ECG signal with more samples than another requires more time and amounts of power in order to be analyzed, however its analysis could be considered more trustworthy when compared to the analysis of a signal with lower resolution.
In this work, the ECG analysis flow was executed for all patient records included in the MIT-BIT database and aforementioned sampling rates. The number of detected pulses for each sampling rate has proven to be as anticipated in terms of input signal sampling rate and is depicted in Fig. 2 . At a sampling rate of 180 Hz the lowest amount of heart beats were detected by the analysis flow. This number increased significantly at the higher sampling frequency of 360 Hz and even more with a sampling frequency of 720 Hz. In this way, we observe how the different QoS level affect user experience, as in the expense of computational resource the input ECG signal is better analyzed. A major part of our profiling campaign is to successfully capture the energy and power consumption characteristics of the heterogeneous nature of a Zynq based system where CPUs and FPGA fabric co-exist. Consequently, the previously described ECG analysis flow is in need of re-design in HW/SW co-design perspective. The first step towards this goal is to identify which parts of the flow will be instantiated as HW components. Our choice was to define reduced execution latency as the optimization goal of our co-design and thus a quantification of the most computationally intensive part of the ECG analysis flow is needed.
A SW only version of the ECG analysis flow was executed on one of the CPUs of Zynq using as input the complete set of records of input database. The operating frequency of the processor was the maximum feasible and the results of the profiling process are illustrated in Fig. 3 where the percentages of the total execution latency for each part of the ECG analysis flow are presented.
After a significant number of executions we concluded that the dominant stages of the ECG analysis flow, in terms of latency, thus in power consumption as well, were the Noise Removal and the Diagnosis Classification stages. We may notice that the classification of a heart beat, which is implemented by utilizing a trained SVM classifier is the most dominant stage in terms of execution latency in sampling rates of 180 Hz and 360 Hz. In order to examine the HW/SW co-designed alternatives in ECG arrhythmia detection flow, we consider the employed SVM classifier as both a software module or a HW accelerated module. The introduction of HW acceleration led to a tremendous decrease in latency and power consumption as the measured execution time reached the levels of its previous stages, i.e the R peak detection and Feature Extraction. It should be noted that different accelerators were designed for each of the sampling rates following the methodology presented in [17] .
The first step in our Zynq-based power profiling campaign is to construct and tailor the different parts of the ECG analysis flow in order to be able to quantify different behaviors. The addition of the HW/SW co-design perspective, creates an even more complex design space compared to pure SW implementations. Consequently, to efficiently tackle this increased complexity, the different design steps have to be clearly defined.
Our implemented system offers three different QoS levels regarding the ECG sampling rate. An important consideration is that alterations have to be made to both SW and HW modules of the system in order to handle an ECG signal with different sampling rates. Considering 360 Hz as the default sampling rate, the system is obliged to process more samples for the 720 Hz case and less samples for the 180 Hz case. Therefore, every part of the flow is parametrized in order to readily support signals sampled at three different frequencies. For instance, during the noise removal stage the signal is windowed and filtered. However, the length of the windows cannot be identical for every sampling rate since the same time interval is described by a different number of discrete samples. The same situation applies when evaluating the models for the SVM classifier. The aforementioned issues could be handled in software versions of the flow in a relatively easy and straightforward manner. This is not the case when a software module is replaced with a HW accelerator.
As already mentioned, the SVM classifier is a dominant part of the ECG analysis flow in terms of execution latency and a possible solution would be to fashion a HW accelerator for its replacement. HW cannot be altered after its generation, thus three different HW accelerators have been tailored able to classify signals of three distinct sampling rates. For the design of those accelerators HLS is exploited. A quite extensive Design Space Exploration (DSE) follows in order to determine the optimal HW accelerator in terms of latency and utilized FPGA resources for each of the required sampling rates [3] . Before the aforementioned DSE transpires, whether it is targeted to SW or HW implementations, the designer has to define the minimum desired accuracy during the classification part of the flow because when a lower accuracy does not affect the reliability of the system, it might be preferred in comparison to more accurate, thus more expensive and power hungry configurations.
V. UTILIZED DEVICE
The target device for our implementations is the Zynq-7000 APSoC, launched by Xilinx in 2010. The Zynq-7000 device comprises the benefits of a Dual-Core ARM Cortex A9 Processing System (PS) with Programmable Logic (PL), or a dual-core processor with an FPGA core.
A. Brief architectural description of Zynq-7000
The combination of the software programmability of an ARM-based processor along with the hardware programmability of an FPGA in a single device offers to developers the capability of applying a hardware-software unified approach to embedded system designs, with a conjunction of serial and parallel processing [6] . On top of that, the Zynq-7000 APSoC is architected to deliver the lowest possible system power and system level performance through optimized architecture. Besides the software and hardware programmability the Zynq-7000 APSoC is AMBA compliant and offers a variety of interfaces. It includes a 256 KB on-chip RAM along with a 512 MB external DDR3 Memory. An important feature of the device for this work, is the available I2C interface.
The aforementioned methodology (Section IV) has been implemented utilizing Vivado-HLS and Vivado Design Suite for the SVM accelerator and system implementation, respectively. To support and configure Linux OS operated ARM cores, PetaLinux tools have been used for generating the OS image and the file system support on the platform. Our application was implemented as a user space program using libraries provided by Xilinx in order to achieve the communication among the CPU cores and the accelerators instantiated on the FPGA fabric.
B. Power monitoring on Zynq-7000
As a part of this work, we used the Power Management Bus (PMBus) controller for enabling power monitoring and voltage scaling on the targetted device. PMBus is an open-standard power management protocol that eases the communication with power converters and other devices in a system, thus allowing power monitoring and scaling. In case of the ZC702 Evaluation Kit, our target device, a TI UCD92xx PMBus controller is comprised [10] . The communication with the PMBus controller is accomplished through the employment of the I2C interface which is available in Zynq-based devices. Linux distributions commonly include drivers for accessing devices attached on an I2C bus.
The ZC702 board uses power regulators and a PMBus compliant system controller to supply core and auxilliary voltages. This is depicted in Fig. 5 . The board uses a 12V input supply to power the board. Furthermore, there are five switching regulators and a linear regulator which generate different voltages required for the operation of the Zynq-7000 APSoC as well as the on-board components. The voltage output of these regulators are monitored and controlled by three separate power controllers [8] . The instant monitoring of these output voltages is accomplished via our communication with the PMBus controller through the available I2C bus interface.
Our application, i.e the ECG analysis flow is executed in a Linux OS, hence the application for accessing the PMBus should also be executed in Linux. We take advantage of the I2C driver which is available in most Linux distributions in order to access the PMBus controller. A multiplexer is attached on the I2C bus allowing us to access each of the three PMBus controllers separately. The commands that are send to the controllers are written according to the PMBus specification [18] . Firstly, a PMBus command requests a specific voltage rail by its address in the controller, to be followed by a command that reads the output voltage or current. The measurements are repeated for a number of times in order to reach a valid final instant power value. Having measured the voltage and current on a specific rail, the computation of the instant power is easily accomplished. The measurements are taken periodically every 0.2 seconds. For a single voltage and current measurement their values are repeatedly read from the PMBus controller and an average value is computed in order to ensure the validity of measurements. The power monitoring application and the ECG analysis flow are executed explicitly on different processors to avoid situations which would increase energy consumption and execution latency.
VI. DESIGN CONFIGURATION PARAMETERS OF ECG
ARRHYTHMIA DETECTION ON ZYNQ-7000 Different versions of the ECG analysis flow were executed on our target device. A software-only version was executed on the PS-side with alterations in the operating frequency of the processing system. In addition, the code was executed for three different sampling rates of the ECG signal. Continuously, the final stage of the ECG analysis flow, i.e the classification of a heart beat was replaced with a HW version implemented on the PL-side of our target device.
A. Operating Frequencies
Zynq-7000 APSoC forms a hybrid FPGA because it combines the benefits of a Dual-Core ARM A9 PS with PL. The existence of the PS-side allows to boot the processor with a Linux Distribution and develop a Linux application. In general, the power consumption of the processing system is dominated by its operating frequency and application workload. Therefore, the power consumption could be partially controlled by lowering the operating frequency [9] . We utilize Linux Cpufreq [19] , a tool which enables the operating system to scale the CPU frequency up or down in order to save power or improve the execution time of an application. Our approach was to take advantage of this potential and experiment with operating frequencies of 666 MHz and 333 MHz. As a first approach, the frequency was set before the execution of the ECG analysis flow and kept unaltered for the whole execution time. A second approach included frequency scaling during the execution of the flow.
B. Hardware Accelerator Configurations
The derivation of the SVM accelerators has been performed through applying a systematic latency-area HLS exploration tailored to the specific kernel acceleration [3] . This exploration was supported by our DSE exploartion framework presented in Section IV. Table I summarizes the output of the DSE in terms of the configuration of SVM classifier accelerator per sampling rate and utilized FPGA resources. We notice that the HW accelerators produced for sampling rates of 180 Hz and 360 Hz are identical in terms of utilized resources, while the accelerator for a sampling rate of 720 Hz presents higher requirements. 
VII. EXPERIMENTAL ANALYSIS
In this section the impact of our design choices at the consumed energy is presented. Our results and observations rely on the execution latency of all implementations accompanied by measurements of consumed energy and instant power consumption of the system. Moreover, we focus on the number of detected pulses per patient record and notice how it affects the performance of the system in terms of energy consumption and latency. In this way, the results are correlated to the QoS level of ECG analysis.
Firstly, in Fig. 6 and 7 we may notice the average execution latency and consumed energy per patient record respectively for each HW/SW configuration. In the vertical axis we may notice the different configurations. For instance, configuration "hw sw 666 360" refers to HW/SW co-designed version executed with an operating frequency of 666 MHz for an ECG sampling rate of 360 Hz. The best configuration for every sampling rate proves to be the HW/SW co-designed version when executed with an operating frequency of 666 MHz as not only it comprises the SVM classifier in the form of a HW accelerator but additionally is executed in maximum operating frequency. On the contrary, as expected the softwareonly version operating at 333 MHz, proved to be the worst configuration in terms of energy consumption and latency. Secondly, the impact of the operating frequency both in latency and power consumption is clear. An execution with a lower operating frequency tends to be more economical in terms of instant power consumption by a percentage of 8.6%. This fact is depicted in Fig. 8 . On the other hand, the operating capability of the processing system is reduced concluding to a higher latency. In this respect, the gain in instant power consumption is not enough to compensate for the increased execution latency and consequently such a configuration is not suggested for this type of application. A better alternative might have been to switch to a lower frequency when the PSside of the device is idle or when the data processing is held by a HW accelerator lying on the PL-side. However, the nondominant parts of the flow are executed so fast that frequency scaling would make little or no difference. Additionally, a comment should be made on the relation between the consumed energy and the number of detected pulses per patient record. For this purpose, the measurements for a sampling rate of 180 Hz are employed. To begin with, it should be noted that since all patient records contain the same number of samples for a specific sampling rate and are filtered with a window size of a specific length, then the noise removal part of the ECG analysis flow bears no particular variation in execution latency. On the other hand, there is a definite dependency between the number of detected pulses for each patient record and the execution latency of the feature extraction and diagnosis classification parts of the flow. The more pulses are detected in a patient record, the more pulses have to be classified, hence leading to an ascending trend in energy consumption when more heart beats are detected as shown in Fig. 9 . The presented results are for a sampling rate The aforementioned trend is observed for all configurations, yet if we carefully notice the approximate trend for the HW/SW co-designed versions we may notice that its gradient is lower than the software-only versions. The explanation is as follows: The noise removal stage of the ECG analysis flow is executed in approximately constant times for each patient record, as already mentioned. In the HW/SW versions the SVM classifier is no longer a dominant part of the flow as its execution latency is highly diminished. On the other hand, after the power monitoring application takes a voltage and current measurement it sleeps for 0.2 seconds and then wakes up to take another measurement.
A final comment on the results could not exclude the aforementioned Quality of Service levels. As already stated, the different ECG sampling rates define three distinct QoS levels as a higher sampling frequency concludes to more trustworthy results. Though expected, after the presentation of these results it should be noted that a trusty ECG analysis system comes with higher power consumption and resource needs.
VIII. CONCLUSION
In this work we have employed the use of an ECG analysis and Arrhythmia detection flow in order to obtain the energy profile of an embedded medical device based on Zynq-7000 All Programmable SoC. In addition, the analysis flow has been enhanced by utilising the concept of Quality of Service of ECG analysis by operating on signals of higher of lower sampling frequency which is also translated into higher or lower computational requirements. To utilize the full extend of the aforementioned SoC, its FPGA fabric has been used to facilitate High-Level Synthesis derived HW accelerators which offload computations of the ECG analysis flow to the HW. Multiple configurations of this HW/SW co-design scheme and ECG analysis QoS have been examined on all available records, acquired through the MIT-BIH Arrhythmia Database, to quantify the energy profile of the target device. Results indicate that different combinations of QoS levels and analysis flow configurations have different impact both on instant power consumption and required energy for the conclusion of experiments. In general, the most power hungry configurations result in maximum instant power consumption but diminished energy requirements due to their small execution latency.
