In this article we focus on multiprocessor system-on-chip (MPSoC) architectures for human heart electrocardiogram (ECG) real time analysis as a hardware/software (HW/SW) platform offering an advance relative to state-of-the-art solutions. This is a relevant biomedical application with good potential market, since heart diseases are responsible for the largest number of yearly deaths. Hence, it is a good target for an application-specific system-on-chip (SoC) and HW/SW codesign.
INTRODUCTION
The advance in embedded systems and multiprocessor trends paves the way for the development of single-chip solutions for computationally intensive biomedical applications with potential health benefits for a large number of individuals. One important application in this respect is the real-time remote and accurate analysis of human heart activity, which has always been a challenging problem for biomedical engineers. Heart disorders like cardiovascular disease (CVD) and stroke are the leading causes of death in the world for both women and men of all ethnic backgrounds [Fuster 1999 ]. In 2003, CVD alone was responsible for 29.2% of the total global deaths, according to the World Health Organization (WHO), and this percentage is increasing every year [Fuster 1999 ]. More than 50% of these deaths can be saved with a reliable combination of cost-effective monitoring and accurate analysis [Fuster 1999 ].
Heart activity is electrically recorded as a set of electrocardiogram (ECG) signals which can readily reveal a number of heart malfunctions [Lo et al. 2005; CIMIT 2006; BIOPAC 1985] . The most reliable ECG analysis technique is the 12-lead ECG, which requires the reading and analysis of 12 different signals sensed from the patient's body. The main challenge arises from the high computational demand for processing huge amounts of ECG data in parallel under stringent time constraints, relatively high sampling frequencies, and life-critical conditions [Haraland et al. 2003 ]. The challenges become even more complex when the patient is mobile and remotely monitored (as in cases of homecare and emergency at the point-of-need) [HSFC 1999 ] because stateof-the-art biomedical equipment for heart monitoring lacks the ability to provide large-scale analysis and remote, real time computation at the patient's location. This necessitates the transmission of huge amounts of life-critical data over a communication link to a large set of computing devices on another location [CIMIT 2006 ]. In cases of mobile patients, this requires a 100% functional always-ON wireless connection, since losing even a few items of heartbeat data may be life threatening.
To overcome the aforementioned challenges and the problem of transmitting life-critical data on a wireless link (that is neither reliable nor secure enough), the solution is to parallel-process the complex biomedical computations of the 12-lead ECG on a wearable MPSoC. Hence, the solution only transmits secure remote-alarm signals and only reports on the results of the analysis. These result reports are much smaller in size (a few bytes) than the ECG data (in Megabytes), and if transmission fails, they can be retransmitted until reception is acknowledged by the healthcare remote-monitoring center, since they are saved on an off-chip memory for every analyzed ECG data chunk.
This technical objective calls for the design of special-purpose SoC architectures featuring increased energy efficiency while providing high computation capabilities. In this article we introduce a novel MPSoC architecture for ECG analysis which improves upon state-of-the-art, mostly for its capability to perform a number of real time analyses of input data with high sampling frequencies, leveraging the computation horsepower and functional flexibility provided by many (up to 12) concurrent DSPs. The proposed architecture addresses usability, security, and safety of the patients in emergency situations and long-term treatments. Comparison between our design and previous work shows the advantages of our design from an SoC performance point-of-view and from an application point-of-view. For instance, the comparison in Section 5 with the relatively old and still presently used Pan-Tompkins algorithm [Pan and Tompkins 1985] shows that the Pan-Tompkins algorithm may fail to detect the correct heart period, while our solution does not. The biochip system builds upon some of the most advanced industrial components for MPSoC design (multiissue VLIW DSPs, system interconnect from STMicroelectronics, and commercial off-the-shelf biomedical sensors), which have been composed in a scalable and flexible platform. Therefore, we have ensured its reusability for future generations of ECG analysis algorithms and its suitability for the porting of other biomedical applications, in particular those collecting input data from wired/wireless sensor networks.
The article goes through all steps of the design methodology, from application functional specification to hardware definition and modeling. System performance has been validated through functional, timing-accurate simulation on a virtual platform. A 0.13μm technology homogeneous power estimation framework leveraging industrial power models is used for power management considerations.
BACKGROUND
Biomedical sensors today exhibit increased energy efficiency, therefore prolonged lifetimes (up to 24 hours) as well as higher sampling frequencies (up to 10 kHz for ECG), and often provide for wireless connectivity [Ambu 2007 ]. Unfortunately, a mismatch exists between advances in sensor technology and the capabilities of state-of-the-art heartbeat analyzers [Harland et al. 2003 ]. The latter cannot usually keep up with the data acquisition rate and are usually wall-plugged, thus preventing mobile monitoring. We aim at using state-ofthe-art commercial sensors from the Ambu corporation's silver/silver chloride "Blue Sensor R" [Ambu 2007 ]. Moreover, most present-day solutions look at a part of the ECG signal to detect whether the heartbeat was healthy, hence make no use of the huge amount of information that may be gained from stateof-the-art sensors and the modern advance in electronics (SoC, in particular). For instance, although many hospitals use modern sensors that can give very accurate data recordings; they still use relatively old methods/platforms to analyze the recorded data. Consequently, those relatively old methods may still fail, and even if they don't, they cannot analyze the full ECG signal but only a part of the heart signal. This in turn gives only partial knowledge about the heart, thus keeping many disease cases obscure and dependent on the nurse's eyes (i.e., not as accurate as modern computer processing).
Application-Specific Background
Our application is the 12-lead ECG, which uses 9 sensors on the patient's body. With 3 of these sensors, physicians can use a method known as the 3-lead ECG, which suffers lacks of detection of many diseases or malfunctions in the heart. Using 9 sensors gives a more detailed view of the heart, which maximizes the detection of heart problems. Hence, maximizing the number of sensors maximizes the ability for disease detection. By interconnecting the 9 sensors for the 12-lead ECG, we get 12 biomedical voltage signals; this produces huge amounts of data especially when used for a large number of hours. Physicians use the 12-lead ECG method because it allows them to view the heart in its threedimensional form, thus enabling detection of many abnormalities that may not be apparent in the 3-lead technique. Figure 1 shows an example of a typical ECG signal, where the most important peaks are labeled P, Q, R, S, T, and U. Each of these peaks and their relative interpeak distances are related to a heart activity that is important for analysis. In addition, every combination of different interpeak interval values proves a type of heart malfunction. The higher the sampling frequency, the more accurate the analysis, since there are cases of diseases where two peaks are too close (especially the R and T peaks in the case of the R-on-T phenomenon [Segal 1997 ]) so that it becomes hard to detect the interpeak distances and the heart period.
Previous Work
ECG monitoring and analysis have been explored in many companies and research organizations. However, we are not aware of any single-chip real time analysis solution for the full 12-lead ECG, which is able to accurately study the heart rhythmic period and can diagnose all the peaks P, Q, R, S, T, and U and their interpeak intervals to result in a disease diagnosis. Previous work on ECG analysis can be classified into four types of solutions: (i) classical stationary-machine solutions [BIOPAC 1985 ], (ii) SoC solutions [Chang et al. 2004; Freescale 2007] , (iii) handheld device solutions [Hung et al. 2004; Jun and Hong-Hai 2004] ; and (iv) ASIC solutions [Desel et al. 1996] . The Classical solutions do not allow for patient mobility or remote analysis since they are wall-plugged, thus suffer from the need for many beds in the healthcare center. Moreover, in the classical medical technique for ECG analysis, the 12-lead signals are printed on eyeballing paper, making the check of different heart peaks and rhythms difficult and inaccurate due to its dependence on the physician's eyes. On the other hand, when using digital recording and filtering we can determine the peaks more accurately. The SoC solution in Chang et al. [2004] does not run 12-lead analyses, but runs 1 lead per SoC. Consequently, to run 12-lead analyses with this solution means using 12 chips. One commercial solution [Freescale 2007 ] takes 8 input sensor-lines, calculating and analyzing lead signals on the 1 DSP. Hence it is time consuming. It only detects whether the heart is healthy or unhealthy without analyzing diseases, since it only detects the QRS peaks without the P and T. Therefore, it is not scalable. It uses 12 bits for the signals while we use 16 bits, thus we add more accuracy to the analysis. The handheld solutions only read and transmit data. The ASIC solutions are just used for data acquisition before transmission.
THE PAN-TOMPKINS ALGORITHM
The Pan-Tompkins is the most widely used algorithm for heartbeat abnormality detection. Unlike our solution, which detects all peaks in order to check many abnormalities and diseases, the Pan-Tompkins solution only detects whether the heartbeat is normal or abnormal. Hence, the Pan-Tompkins algorithm is built to detect the QRS interval only. In what follows we discuss the PanTompkins solution, which we implemented on the basis of the already existing Pan-Tompkins algorithm [Pan and Tompkins 1985] . We do so in order to finally compare it with our solution, thereby showing the efficiency of our HW/SW codesign by having real time accurate analysis of the ECG signal via MPSoC. Many ECG instruments require the presence of an accurate QRS detector (see Figure 1 ). Such detection is usually difficult due to the presence of noise in the ECG signal. The Pan-Tompkins algorithm combines the three known types of QRS detection: linear digital filtering, nonlinear transformations, and decisionrule algorithms. The technique consists of the following stages: analog filter, band-pass filter, derivation, squaring, and windowing. The algorithm was made to run on a Z80 (Zilog) or NSC800 (National Semiconductor) microprocessor. Hence it is quite fast, but may fail. The processing is done in integer arithmetic so that the algorithm operates in real time [Pan and Tompkins 1985] without consuming excessive power. The algorithm detects QRS complexes only, using slope, amplitude, and width information.
First, an analog filter is used to band-limit the ECG signal at 50 Hz. Then, an A/D converter samples the signal at Fs = 200 Hz. The resultant vector is passed through a band-pass filter implemented using cascaded low-pass, (cutoff frequency 15 Hz) and high-pass (cut-off frequency 5 Hz) filter stages to remove high-frequency noise, P-waves, T-waves, and other artifacts. The result is a 3dB pass-band from about 5 Hz-12 Hz, which approximates the desirable pass-band to maximize the QRS energy (5 Hz-15 Hz). In this stage, not only is noise removed, but also the SNR is improved so that lower thresholds can be used to enhance detection sensitivity. The band-pass-filtered signal is then passed through a local peak detection algorithm which identifies and marks all peaks found in the signal. It also uses a set of two moving threshold parameters T 1 and T 2 (T 1 = 2.T 2) to select candidate QRS complexes. If within a certain time interval, defined as the peak-search time, no new QRS peak is found which has larger amplitude than the T 1 threshold, a search-back routine is executed with a T 2 threshold level. Both T 1 and T 2 thresholds are adaptive and thus are modified based on the amplitude of the new peak found.
The filtered signal is then sent to the nonlinear transformation stage, where the derivative of the signal is calculated. The derivative contains information about the slope of the QRS. The squaring process intensifies the slope of the frequency response of the differentiated signal to help detect false peaks, like T-waves.
A moving-window integrator obtains information about the width of the QRS complex. This result is then passed through the same local peak detection and threshold-setting algorithms as the original band-pass-filtered signal to identify QRS slope information.
All the candidate QRS peaks found in both filtered and transformed waveforms are then compared. Only those appearing in both processed waveforms are classified to be valid QRS complexes. The output is a stream of pulses indicating the locations of QRS complexes. Such an algorithm relies not only on slope information, but also on the amplitude and width information of the QRS complex. The digital band-pass filter implemented is a cascade of a low-pass and a high-pass-filter. The transfer functions of these filters are shown in Eq.
(1) for the low-pass filter and (2) for the high-pass filter.
This is the result of subtracting a first-order low-pass filter from an all-pass filter. The derivative used is a five-point derivative with the transfer function shown in Eq. (3).
The squaring function squares the signal point-by-point according to the operation in (4).
In addition to slope information, the moving-window integrator helps in obtaining information about other waveform features. This is calculated from (5). (5) where N is the number of samples in the width of the integration window, and must be approximately equal to the widest QRS complex. For the given sampling frequency of 200 Hz, N = 30 is used. Using the band-pass filter, lower thresholds can be used because of the improved SNR ratio. Two adaptive thresholds are used for QRS detection, one double the other. The higher threshold is used for the first analysis of the signal, while the other is used in the search-back technique whenever no QRS is detected within a certain time interval. The set of thresholds applied to the output of the moving-window integrator are in Eq. (6).
PEAKI is the overall peak, SPKI is the most recent running estimate of the signal peak, NPKI is the most recent running estimate of the noise peak, THRESHOLDI1 is the first threshold applied, and THRESHOLDI2 is the second threshold applied. In the case where the QRS complex is found using the second threshold, Eq. (7) is used.
The same technique is implemented for peak detection of the output signal of the band-pass filter with THRESHOLDF1 and THRESHOLDF2. Moreover, the first thresholds of the two sets (THRESHOLDI1 and THRESHOLDF1) are halved in the case of irregular heart rates. The average RR interval in Eqs. (8) and (9) corresponds to the mean of the last recent 8 RR intervals. For managing irregular heart rates, another average RR interval is calculated from the most recent 8 beats having RR intervals that fall within certain limits.
where RR n is the most recent RR interval.
• I. Al Khatib et al. where RR n is the most recent RR interval that fell between the acceptable low and high RR-interval limits in Eq. (10).
RR LOW LIMIT
If no QRS is found during the RR MISSED LIMIT interval, the maximal detected peak is considered to be a QRS candidate, and the lower of the two thresholds is applied for the search-back technique. In this way, we avoid using long memory buffers for storing past ECG data, and require less computation time to implement the search-back technique, since we only need those values within RR MISSED LIMIT. Whenever the RR interval is less than 360 msec. and the maximal slope occurring during the detected QRS is less than half of the QRS waveform that preceded it, this QRS complex is identified as a T-wave and not a QRS complex.
ACF-BASED ALGORITHM
Since MPSoC technology provides scalable computation horsepower, it allows running more computation-demanding applications [Ruggiero et al. 2005 ], which are (from an application viewpoint) more accurate analysis algorithms. Consequently, we propose the ACF-based algorithm. In what follows in this section we discuss our autocorrelation-function (ACF) based algorithm, which adds the advantage of detecting more diseases and also analyzing the whole heart signal with all peaks and interpeak intervals.
Filtering
Data provided by biomedical sensors suffers from several types of noise: DCoffset and hum noise, patient movements, and signal interference [CompanyBosch and Hartmann 2003] . To overcome all the problems related to sensor noise, we designed an IIR filter (implemented in hardware on a dedicated chip feeding an external SDRAM memory) that outputs its results in 16-bit binary format. Our IIR filter is of order 3 because it proves enough for our ECG analysis. Figure 1 shows an example of our filter results. Since we use this step of filtering, the algorithm for ECG analysis does not have to take care of filtering and suppressing noise. The filtering is discussed in more detail in Khatib et al. [2006] . It is worth mentioning that an A/D step (before filtering) is needed, and that the more data in the analog-to-digital conversion, the better the analysis, since cases like the R-on-T phenomenon require higher resolution of input data. 
Algorithm
Our proposed ECG-analysis algorithm (see Figure 2 ) is conceived as parallel and hence scalable from the ground up. Since each lead senses and analyzes data independently, each lead can thus be assigned to a different processor. So, to extend ECG, analysis to 15-lead ECG, for example, or more, then what is required is just to change the number of processing elements in the system. The program reads a data file in chunks of 4 sec. We discuss next the reason for the choice of 4-sec. chunks. The data file mainly holds values of the ECG at the lead in binary format. So by reading the data continuously every 4 sec., we would be emulating a real sensor sending continuous data to an intermediate buffer that holds 4 sec. of data sampled at a certain frequency, typically 1000 Hz. We used an autocorrelation-function-(ACF) based methodology to calculate the period and other parameters of the heartbeat, since it gives more accurate results than conventional methods to search for the distance between two peaks. The autocorrelation we use as shown in Eq. (11) has a certain number of lags (L) to minimize the computation for our specific application, as discussed next. We validated our algorithm over several medical traces [Physiobank 2000 ].
where Ry is the autocorrelation function, y is the filtered signal under study, n is the index of the signal y, and k is the number of lags of the autocorrelation. L has an effect on the performance due to the high number of multiplications. We run the experiments for n = 1250, 5000, and 50000 relative to sampling frequencies of 250, 1000, and 10,000 Hz, respectively. To run this algorithm with Eq. (11) takes around 1.75 million multiplications. To minimize errors and execution time we use the derivative of the ECG filtered signal, since if a function is periodic then its derivative is periodic. Hence, the autocorrelation function of the derivative can give the period as shown in Figure 3 . In order to be able to analyze ECG data in real time and to be reactive in transmitting alarm signals to healthcare centers (in less than 1 min.), a minimum amount of acquired data has to be processed at one time without losing the validity of the results. For the heartbeat period, we need at least 4 sec. of ECG data in order for the ACF to give correct results.
From a technical viewpoint, real time processing of ECG data would allow a finer-granularity analysis with respect to traditional eyeball monitoring of the paper ECG readout.
ALGORITHM COMPARISON
We realize a need to compare with the existing and still widely used PanTompkins algorithm since many medical institutes and hospitals depend on this solution, which we prove as failing. Our solution is not amplitude dependent, but time dependent. In this respect, we show that the Pan-Tompkins solution may and will fail in some cases, while our solution does not. The reason for this is that we implemented an algorithm that does not depend on amplitude peak values (P, Q, R, S, T, or U) for looking at the period between two peaks. Insteads, it looks at the relationship of the function repetitiveness to itself using autocorrelation characteristics. In the following we discuss these advantages.
Although the Pan-Tompkins algorithm (discussed in Section 3) consumes many computations, it is quite fast, since it detects slopes and amplitudes. However, to compare its efficiency and applicability, we ran it over the standard MIT/BIH arrhythmia database [MIT- BIH Database 1980] .
In this respect, our ACF-based solution ran over all MIT/BIH arrhythmia database cases without any failure, and detected the heart period with an average percentage error of 3%. The error was calculated as discussed in the case studies given next. However, when we ran the Pan-Tompkins implementation on our platforms, we found that the Pan-Tompkins fails and may lead to erroneous results. In the following are two examples of the failure of the Pan-Tompkins solution.
We list two cases of the MIT-BIH arrhythmia database, which failed (with a high percentage of error) when using the Pan-Tompkins solution, while our solution showed much better performance.
First Case for Pan-Tompkins Failure
The first case when the Pan-Tompkins algorithm fails is when it is run on record 217 (male, age 65) data taken from Lead II [MIT- BIH Database 1980] . The trace characteristics are: paced beats, pacemaker fusion beats, and normal beats. The periods using the eyeballing techniques, Pan-Tompkins algorithm, and our ACF-based solution are 0.5361, 0.8722, and 0.5500 sec., respectively. The eyeballing period was calculated from the first two R-peaks of the ECG data. The error when using the Pan-Tompkins solution (which is the difference between the periods calculated in Pan-Tompkins and eyeballing divided by the period using the eyeballing technique) is 62.7%. Although it converges in time similar to our ACF solution (<3.5 sec.), the results of Pan-Tompkins can be devastating for a patient. In the ACF case, the error (which is the difference 31:12 Khatib et al. between the periods calculated in the ACF method and the eyeballing method, divided by the period using the eyeballing technique) is only 2.6%, which is acceptable and still gives the physician a good margin to choose the right disease.
Second Case for Pan-Tompkins Failure
The second case when the Pan-Tompkins algorithm fails is when it is run on record 208 (female, age 23) data taken from Lead II [MIT-BIH Databases 1980] . The trace characteristics are ventricular couplets. The period using eyeballing techniques, the Pan-Tompkins algorithm, and our ACF-based solution are 0.6111, 0.2583, and 0.6556 sec., respectively. In this case, the error using the Pan-Tompkins solution is 57.8%, and the nurses and medical team can easily realize the advantages of our HW/SW solution that can give more accurate results than present-day solutions and converge in real time to the solution.
The reason for the failure of the Pan-Tompkins solution is due to its aforesaid algorithm, where it tries to find correct thresholds that will finally depend on human choice. Moreover, it detects amplitude peaks, that is, when the T and R waves are near. Like the R-on-T phenomenon, the Pan-Tompkins method cannot differentiate R from T, and an R-T interval will be confused and considered as an R-R interval, as revealed by the tests. However, using our ACF-based SoC solution, we minimize the number of computations (although in millions) to suit the HW/SW codesign and provide a method that takes the time axis as its point-of-reference considering the periodicity. Accordingly, we detect the period first, without a need for a threshold for the amplitudes.
Comparing our solution with the widely used Pan-Tompkins shows that we do not waste any power or time on looking for thresholds, but rather use a concrete form of ACF and the ACF derivative to find whether the heartbeat is periodic, and we calculate the period immediately through the ACF. In this way we leave no space for failure, since we simply do not look at the amplitude axis. At the same time, we are able to design a hardware MPSoC that can cope with the required software and algorithm.
MPSoC ARCHITECTURE
After observing and proving that the ACF-based algorithm is clearly more accurate (while also providing for more analysis, since it checks all the peaks), let us build a system that can handle its computations and let us tune the hardware platform for this algorithm. In order to process filtered ECG data in real time, we choose to deploy a parallel multiprocessor system-on-chip architecture. The key point of such systems is to break up functions into parallel operations, thus speeding-up execution and allowing individual cores to run at a lower frequency with respect to traditional monolithic processor cores. Technology today allows the integration of tens of cores onto the same silicon die, and we therefore designed a parallel system with up to 13 masters and 16 slaves (see Figure 4) . Since we are targeting a platform of practical interest, we choose advanced industrial components [Loghi et al. 2004a] . The processing elements are multiissue VLIW DSP cores from STMicroelectronics, featuring 32kB instruction and data caches. These cores have four execution unit stages and rely on a highly optimized cross-compiler in order to exploit the parallelism. They leverage the flexibility of programmable cores and the computation efficiency of DSP cores. Moreover, these features allow reusing this platform for biomedical applications other than the 12-lead ECG, thus making it cost effective. Each processor core has its own private memory (512KB each) which is accessible through the bus and can access an on-chip shared memory (8KB is enough for this application) for storing computation results. Other relevant slave components are a semaphore slave, implementing the test-and-set operation in hardware and used for synchronization purposes by the processors or for accessing critical sections, and an interrupt slave which distributes interrupt signals to the processors. Interrupts to a certain processor are generated by writing to a specific location mapped to this slave core. The STBus interconnect from STMicroelectronics was instantiated as the system communication backbone. STBus can be instantiated as either a shared bus or as a partial-or full crossbar. Thus, it allows efficient interconnect design and provides flexible support for design space exploration.
In our first implementation, we target a shared bus to reduce system complexity (again see Figure 4 ), and we assess whether application requirements can already be met with this configuration. We then explore also a crossbarbased system, which is sketched in Figure 5 . The inherent increased parallelism exposed by a crossbar topology allows decreasing contention on shared communication resources, thus reducing overall execution time. In our implementation, only the instantiation of a 3 × 6 crossbar was of interest for the experiments. We put a private memory on each branch of the crossbar, which can be accessed by the associated processor core, or by a DMA engine for offchip to on-chip data transfers. Finally, we have a critical component for system performance: the memory controller. It allows efficient access to the external 64MB SDRAM off-chip memory. A DMA engine is embedded in the memory controller tile, featuring multiple programming channels. The controller tile has two ports on the system interconnect: one slave port for control and one master port for data transfers. The overall controller is optimized to perform long DMA-driven data transfers and can reach a maximum speed of 600MB/sec. Embedding the DMA engine in the controller has the additional benefit of minimizing overall bus traffic with respect to traditional stand-alone solutions. Our implementation is particularly suitable for I/O-intensive applications such as the one we are targeting in this work.
In the preceding description, we have reported the worst-case system configurations. In fact, fewer cores can easily be instantiated if needed. In contrast, our described architectural template is very scalable and allows for further future increase in the number of processors. This will allow to run in real time even more accurate ECG analyses for the highest sampling frequency available in sensors (e.g.,10 KHz and 15 leads ). The entire system has been simulated by means of the MPSIM simulation environment [Loghi et al. 2004a] , which provides for cycle-accurate functional simulation of complete MPSoCs at a simulation speed of 200 K cycles/sec. (on average), running on a P4 at 3.5 GHz. The simulator provides also a power characterization framework leveraging 0.13μm technology-homogeneous industrial-power models from STMicroelectronics [Loghi et al. 2004b; Bona et al. 2004] . We believe that for life-critical applications, low-level accurate simulation is worth doing, although potentially slow, in order to perfectly understand system-level behavior and have a predictable system with minimum degrees of uncertainty. Each processor core programs the DMA engine to periodically transfer input data chunks onto their private on-chip memories. Moved data corresponds to 4 sec. of data acquisition at the sensors: 10kB at 1000 Hz sampling frequency, transferred on average in 319279 clock cycles (DMA programming plus actual data transfer) on a shared bus with 12 processors. The consumed bus bandwidth is about 6Mbytes/sec. which is negligible for an STBus interconnect whose maximum theoretical bandwidth with 1 wait-state memor exceeds 400Mbytes/sec. Then each processor performs computation independently, and accesses its own private memory for cache line refills. Different solutions can be explored, such as processing more leads onto the same processor, thus impacting the final execution time. Output data amounting to 64 bytes is written to the on-chip shared memory, but its contribution to the consumed bus bandwidth is negligible. In principle, when the shared memory is filled beyond a certain level, its content can be swapped by the DMA engine to the off-chip SDRAM, where a history of 8 hours of computation can be stored. Data can also be remotely transmitted via a telemedicine link.
EXPERIMENTS AND RESULTS
We ran experiments in order to check the limits to respect the time figure of merit, since our MPSoC is a real-time-application-based system. So we ran experiments to check the performance of each system design with increasing frequencies (up to 10 KHz). We also ran experiments to look for optimizing the algorithm together with the design by changing some algorithm parameters and looking into the overall performance of the specific biomedical application on each MPSoC design. We also ran experiments by distributing the application on different numbers of DSP cores for each design (shared bus, crossbar, and partial crossbar). The results of these experiments are presented next.
As a first exploration, we have compared the performance of an ARM7TDMI with that of the ST220 DSP, in order to verify the performances, of the chosen VLIW with respect to the computation kernel of our specific application. In order to have a safe comparison, we set similar dimensions of the cache memory (32KB) for the two solutions, and we run two simulations for the processing of one ECG-lead at 250 Hz sampling frequency. We run a performance comparison between two application-specific cores. We adopt this one core solution because our first aim is to investigate the computation efficiency of the two cores for our specific biomedical application, and to de-emphasize system-level interaction effects such as synchronization mismatches or contention latency for bus access. Hence, the performance of the ARM7 core serves as a reference to assess the computation efficiency of the VLIW DSP core for the same specific application. In Figure 6 , we can observe that the LX220 DSP results in better behavior, both in terms of execution time and energy consumption. In detail, the ARM core is 9 times slower than the ST220 in terms of execution time, and consumes more than twice the energy incurred by the DSP. These results can be explained based on three considerations given in the following.
(i) The ST220 has better software development tools, which results in a smaller executable code. The size of executable code for the ARM is 1.7 times larger than that of the ST220. (ii) The ST220 is a VLIW DSP core, therefore it is able to theoretically achieve the maximum performance of 4 instructions per cycle (i.e., 1 bundle) (iii) A metric which is related to both previous considerations is the static instructions per cycle, which depends on compiler efficiency and on the multipipeline execution path of the ST220. For our application, this metric turns out to be 2.9 instructions per bundle.
Let us therefore select the best processor core for our computation kernel from performance and energy viewpoints. We now want to optimally configure the system to satisfy the application requirements at minimum hardware cost. We therefore measure execution time and energy dissipation for an increasing number of DSP cores in order to find the optimal configuration of the system. Since commercially available ECG solutions target sampling frequencies ranging from 250-1000 Hz, we performed the exploration for these two extreme cases for the 12-lead ECG signal. We analyze a chunk of 4 sec. of input data, which provides a reasonable margin for safe detection of heartbeat disorders. Note that the computation workload for the processor cores increases in a polynomial manner with increasing sampling frequency (due to the specific application algorithm). Figure 7 shows that if we increase the number of processors, the execution time scales linearly, which proves that the second-order effects typical of multiprocessor systems (e.g., bus contention reducing the offered bandwidth to processor cores with respect to the requested one) have only negligible effects on system performance, proving that the system is well configured and that a single shared bus communication architecture is well suited for this application. However, this does not mean that the amount of data moved across the bus is negligible: around 100KB (at 1000 Hz). This data is, however, read by the processor core throughout the entire execution time, thus absorbing only a small portion of the bus bandwidth. In this regime, the bus performance is still additive.
Moreover, the perfect scalability of the application is also due to memory controller performance. In fact, at the beginning of the computation, each processor loads processing data from off-chip to on-chip memory, hence requiring peak memory controller bandwidth. The architecture of the memory controller proves capable of providing the required bandwidth in an additive fashion. By looking at the 1000 Hz plot (see Figure 8 ), we observe that 1 DSP is able of processing an ECG-lead in slightly more than 3 sec. Therefore, we still have about 1 sec. left (before the 4sec. deadline), which is enough to perform additional analysis of the results of the individual lead computations and converge to a decision about the heart period and malfunctions. Looking forward, we try to understand how our solution situates itself with respect to the demand for higher sampling frequencies raised by the need to perform higher-accuracy analysis and the evolution of state-of-the-art sensor technology. We therefore measure and plot the maximum sampling frequency at which our MPSoC solution can be operated while still meeting real time requirements. This frequency translates to poor scalability. The reason for this is due mainly to the interconnect performance, which no longers scales. In fact, bus busy (the number of bus busy cycles over the total execution time) at the critical frequency of 2200 Hz is almost 100% (99.95%), that is, the bus is fully saturated. This is due to the fact that the amount of data being transferred across the bus increases linearly with the sampling frequency. In order to make the platform performance more scalable, we revert to a full-crossbar solution for the communication architecture.
The benefits are clearly observed in Figure 9 , where the maximum analyzable frequency (with respect to real time constraints) amounts now to 4000 Hz, that is, nearly twice as much as the performance with a shared bus. Moreover, we observe that average bus transaction latencies at the critical frequency are still very close to the minimum latencies, thus indicating that the crossbar is very lightly loaded. Another informative metric is the bus efficiency (number of cycles during which the bus transfers useful data over the bus-busy cycles), which amounts to 71.83%.
We simulate the 12-processor system to get the upper bound on system performance and push the architecture to the limit. For the same reason, we restrict the analysis period to 3.5 sec., which is the minimum value of the input datachunk timespan derived from the biomedical algorithm. The results presented in Figure 9 show that with shared bus architecture, the maximum sampling frequency that the MPSoC can handle without going beyond the real time constraint is only around 2200 Hz.
This good performance is an effect of the lack of contention on crossbar branches, which is in turn due to the high performance of the memory controller and to the matching of application traffic patterns with the underlying parallel communication architecture. As a consequence, with a full crossbar the system performance is no longer interconnect-limited, but computationlimited. Since the computation workload of the system grows in a polynomial manner with the sampling frequency, it rapidly increases task execution times and reduces the available slack time with respect to the deadline.
We observe that the performance with a partial crossbar closely matches that of a full crossbar (less than 2% average difference), but with almost 3 times less hardware resources. We found the optimal partial crossbar configuration (5 × 5 instead of 13 × 13) by accurate characterization of shared bus performance. On a shared bus, we increased the number of processors and observed when the execution times started deviating as an effect of bus contention.
With up to 4 cores connected to the same shared communication resource, the latter is able to work in an additive regime. Although the architecture cannot work in real time at more than 4000 Hz, we wanted to measure the execution time under non realtime computation. In fact, from the execution time to process 3.5 sec. of input data, we can derive the amount of buffering that is required to store incoming data from the lead. By knowing the overall capacity of the off-chip memory, we then derive the maximum analysis time that we can afford at such high frequencies. Results are shown in Figure 10 .
Let us focus on the shared bus case, with 10 KHz. The execution time for analyzing 3.5 sec. is a little less than 1 min. (around 57 sec.), which is 16 times more than the real time constraint. As a consequence, we can still decide to perform this kind of analysis, but we need to buffer 16 input data-chunks while we are processing one chunk. Since an off-chip SDRAM memory can be 512MB, we can perform 3.5 min. of analysis before saturating the memory. With a full crossbar, this time amounts to around 14 min. 
SOLUTIONS COMPARISON
We compare our platform to the existing solutions from an application pointof-view and SoC point-of-view. From the SoC viewpoint, the advantages of our design are mainly in the performance for real-time analysis and power consumption, where our experiments show that we can converge to a full analysis in time less than the time needed to read the newcoming data from the heart. Some more results of our comparison are shown in Table I . From the application side, we are interested in the comparison with contempoary applications used on SoCs (Table I) after having compared it (in Section 5) with the presently used Pan-Tompkins algorithm.
Comparing our application-based MPSoC designs, we can choose the best architecture relative to the biomedical purpose. Hence, for a solution that competes with and performs better than existing commercial solutions [Freescale 2007 ] (input sampling frequency from 250 to 1000 Hz), we adopt our shared bus system architecture (of Figure 4) since its advantages over existing solutions are as follows.
-The available ECG SoC solution most similar to ours is the one presented in Chang et al. [2004] , but our design performs better on analysis time consumption. Our solution is easier to deploy, since the 12 leads are input to one SoC instead of having 12 SoCs (i.e., to scale with increasing leads), are less expensive. -In our designs, we can do full 12-lead ECG analysis at relatively high frequencies. Our design is optimal especially in that we can offer a choice of SoC architecture (shared bus, crossbar, or partial crossbar) based on the biomedical need of a frequency range. Table I shows the advantage of our three designs over the best available designs we are aware of in research and in the market. heart-period discovery; P,Q,R,S,T peaks detection; allows disease detection
The comparison is specifically between a ECG analysis SoC research solution [Chang et al. 2004 ], a commercially available SoC solution [Freescale] , and our three MPSoC designs: ST-PCB is partial crossbar, ST-CB is full crossbar, and ST-SB is the shared bus solution, Fs is the sampling frequency.
CONCLUSIONS
We present an application-specific MPSoC architecture for real time ECG analysis as a solution for a large medical problem (CVD and stroke) which leads yearly to the highest number of deaths. Our solution leverages the computation horsepower of many (up to 12) concurrent DSP cores to process ECG data. This solution paves the way for novel healthcare delivery scenarios (e.g., mobility) and for accurate diagnosis of heart-related diseases. We describe the design methodology for the MPSoC and explore the configuration space looking for the most effective solution, both performance and energy-wise. We present three interconnect architectures (single bus, full crossbar, and partial crossbar) and compare them with existing solutions. The sampling frequencies of 2200 Hz and 4000 Hz with 12 DSPs are found to be the critical points for our shared-bus design and crossbar architecture, respectively. We compare our solution with present-day existing solutions in research, in the market, and in many hospitals. Our solutions offer significant advances in SoC HW/SW codesign, and prove no single case of failure when run and tested on the standard MIT-BIH arrhythmia database, while the widely used Pan-Tompkins solution failed in some cases, of which we reveal two.
