Abstract. In this study, we explore sequential and parallel process- The results show that with respect to the single-core architecture, the multi-core solution consumes 62% less power for high computation requirements (167 M Ops/s), while consuming 46% more power for extremely low computation needs when the power consumption is dominated by leakage. Additionally, we show that our proposed ULP processing core, using a simplified instruction set architecture (ISA), achieves energy savings of 54% compared to a reference microcontroller ISA (PIC24).
Introduction and Related Work
According to the World Health Organization, cardiovascular and modern human behavior-related diseases are the major cause of mortality worldwide [1] . Close and potentially continuous medical supervision is strongly needed to control these types of diseases. They are thus expected to soon require healthcare costs and medical management needs that are unsustainable for traditional healthcare delivery systems. Personal health monitoring systems are poised to offer largescale and cost-effective solutions to this problem. Wireless body sensor networks (WBSNs) are the enabling technology for such personal health systems [2] , [3] . A WBSN for health monitoring consists of a number of light-weight sensor nodes attached to the human body, where each node is responsible for processing a specific low rate physiological signal. For instance, one of the most important physiological signals is the electrocardiogram (ECG), which is typically acquired at sampling rates between 125 Hz and 1 kHz to capture the often important details of the waveform. In order to monitor the heart rate for extended periods of time (up to multiple days or weeks), an ultra-low-power (ULP) design with embedded biomedical signal processing for feature extraction on the sensor node is necessary [4] to reduce the costly signal storage or transmission [5] to the essence.
An effective technique to reduce the computational power consumption is supply voltage scaling, potentially all the way to sub-threshold operation. In the literature, voltage scaling and its limitations and disadvantages such as performance loss, the risk of functional failure, performance variability, etc., have been analyzed extensively [6] , [7] , [8] , [9] and various low-power architectures have been presented. For example, Chen et al. [10] proposed a sensor platform capable of nearly-perpetual operation by using harvesting from solar cells. The proposed single processor architecture has an ARM Cortex M3 core with both retentive and non-retentive SRAM and a power management unit which controls the active and ultra low power sleep modes. In another work, Hanson et al. [11] presented a new ultra low energy processor with low voltage operations for wireless monitoring systems. They optimized the standby power consumption of the processor with the help of a new low leakage memory macros, memory size and instruction set adjustments, and power gating. However, the main issue with low-voltage operation is the performance loss, which, for a given processing requirement, can limit the degree of use of voltage-scaling. Parallel computing using multiple cores can alleviate this issue, provided that the algorithms to be executed can be parallelized. To this end, Dreslinski et al. [12] proposed a near threshold computing (NTC), cluster-based multi-processor architecture with a shared cache that operates at a higher supply voltage to be able to serve multiple cores at the same time. Finally, Yu et al. [13] introduced a sub/near threshold co-processor for low energy mobile image processing using architecture level parallelism to compensate for the performance loss.
Unfortunately, even though researchers focused on low energy solutions in both multi-core and single-core approaches individually, the two approaches have not been compared in terms of energy efficiency for the moderate workloads that are typical for biomedical applications. Thus, in this paper we propose as a main contribution a single-core and a multi-core architecture for embedded biomedical signal processing on WBSNs, where algorithms have a limited, yet, at nearthreshold voltage, non-negligible complexity and where a significant part of the processing can be done in parallel. We also propose an ULP custom processing core with minimal instruction set, which achieves up to 54% energy saving with respect to a well established instruction set architecture (ISA), namely the PIC24 ISA [14] . This custom core is used in the single-and multi-core architectures as the processing element. We explore the power/performance trade-offs between the single-and multi-core architectures for different online biomedical signal processing applications while exploiting near threshold computing. Our results show that the multi-core approach achieves 62% power saving with respect to the single-core approach for high biosignal computation workloads (i.e., 167 MOps/s), however it consumes up to 46% more power than the single-core approach for relatively lighter workloads when the power consumption is dominated by leakage.
The rest of this paper is organized as follows: first Section 2 analyses the biomedical processing features and introduces several reference benchmarks.
Next, Section 3 describes our ULP processing core as well as the single-core and multi-core processor architectures. Then, Section 4 gives the experimental setup and results. Finally, we summarize the main conclusions of this work in Section 5.
Biopotentials Processing Features
Signal processing on wearable personal health monitoring systems consists mostly of arithmetic computations with relative complexity on single-or multi-input biological signals. Hence, it has been recently shown that they can be optimized to run in real-time on typical embedded low-power microcontrollers [15] , [16] .
For instance, Rincon et al. [16] showed how delineation of multi-lead ECG signals, using a complex multi-scale wavelet transform algorithm, can be realized on a commercially available personal health monitoring system node with limited computation capability. In fact, multi-lead biological signals analysis is often needed to obtain an accurate view of biological events. However, the analysis of these multi-lead signals entails considerably parallel computation opportunities, which can be exploited on multi-core processing platforms.
In this work, we consider three different reference benchmarks: two different ECG signal compression applications and one ECG signal conditioning application. The first data compression application is based on compressed sensing (CS) theory [3] , while the second one is a discrete wavelet transform (DWT)-based data compression algorithm. Our reference benchmarks have fundamental applications in WBSN systems for automated ECG analysis as well as data compression [3] , [16] . All of our reference benchmarks are real-time multi-lead ECG processing applications that operate on 8 leads (a typical configuration) to make the system more accurate and resilient to noise artifacts. Moreover, all the benchmarks perform computations on a block of 512 samples of ECG data (sampled at 250 Hz) per lead. However, to investigate different processing requirements related to the application, we consider ECG sampling rates between 125 Hz and 1 kHz for capturing signals with quality levels from barely acceptable to excellent.
The first reference benchmark in this work is an ECG processing application which comprises two components: compressed sensing (CS) and Huffmann coding. CS [3] performs a 50% compression on a block of ECG data per lead whereas the Huffmann coding part encodes the compressed data further for wireless transmission. In CS-based data compression, only few linear random measurements are acquired from the ECG signal. The algorithm implicitly relies on the sparse characteristics of ECG signal to guarantee accurate reconstruction.
The second reference benchmark, DWT-based data compression [3] , performs a 50% compression on a block of ECG data per lead similar to the CS-based data compression. As opposed to CS, DWT-based data compression explicitly exploits the sparsity of the ECG signal by computing its sparse expansion and adaptively encoding it with Huffmann coding. The DWT-based data compression requires more complex computations than the CS-based data compression due to discrete wavelet transformation.
The last reference benchmark in this work, ECG signal conditioning, is referred herein as ECG2, and is based on the morphological filtering given in [17] .
Raw ECG signals, even when recorded in a controlled environment, contain various types of noise and baseline drifts. ECG2 performs baseline correction and noise suppression on a block of ECG data per lead. The corresponding kernel has a broad application in WBSN systems, especially in automated ECG analysis.
Processing Platform Architectures

Processing Cores
We consider for the processing platforms two different processing core architectures: Firat and TamaRISC. Firat has a well-established and extensive instruction set architecture (ISA), which is a subset of the PIC24 ISA from Microchip [14] . TamaRISC is a custom designed processor with a similar core architecture as Firat, especially regarding memory interfaces. The main differences are a minimal ISA and overall reduction of processor complexity (true RISC). The following subsections explain both processing architectures in detail. Each single-word instruction is divided into an 8-bit opcode, which specifies the instruction type (up to three different opcodes per operation), and one or more operands, which further specify the operation of the instruction. The instructions operate on either two or three operands. The core uses a dual-port data memory with 16-bit wide addresses, and can hence read as well as write one 16-bit data word each, in the same cycle. The architecture additional comprises a specific PIC24 feature that allows addressing of the register file inside the data memory address space, i.e. a mapping of all processor registers onto data memory addresses. Especially in the context of data bypassing (hazards), combined with address generation, this feature requires additional register ports.
The options for the instruction operands depend on the corresponding operation, and can be divided into 24 distinctive operand mode groups. In general, the operand modes can be described as follows. Both operands of the two-operand The ISA is a subset of the PIC24H/F ISA [14] , and includes totally 66 instructions, which is still rather extensive and complex for a RISC-like ISA. 
Processing Platforms
The single-core and multi-core configurations include the same processing unit (PU) and a data memory (DM). However, the multi-core processing platform also involves a central data crossbar interconnect (D-Xbar), connecting the PUs with the shared DM.
Processing unit: A PU comprises a processing core and a 24-bit wide instruction memory (IM) for 4k instruction words (12 kBytes) which is sufficient for many typical biomedical applications on WBSNs such as delineation and compressed sensing data compression [3] , [16] . One of the introduced processing core architectures, Firat or TamaRISC, is used in the PUs. [18] . The interconnect is intended to connect a number of processing cores (in our case 8 cores) to a multi-banked memory (i.e., 16 banks). The total memory access latency is one clock cycle, however in case of multiple conflicting requests, for fair access to memory banks, a round-robin scheduler arbitrates access and a higher number of cycles is needed depending on the number of conflicting requests, with no latency in between.
3.2.1 Single-core Processor Architecture The single-core processor architecture is shown in Fig. 2(a) . A simple selection logic (SL) connects the single PU to the individual memory banks and multiplexes the data. The system processes the 8-lead ECG signals sequentially.
Multi-core Processor Architecture
The multi-core processor design, shown in Fig. 2(b) 
Experimental Results
Our experiments comprises mainly two phases: 1) analysis of the introduced processing cores in terms of energy efficiency for biomedical signal processing and analyze the architectures with respect to the ECG sampling rates corresponding to our application requirements. All the designs (two single-core and one multi-core design) are implemented in a 90 nm low leakage process technology trading peak performance for significant leakage power reduction, especially in the memories.
Power Characterization Framework
The evaluation and implementation flow for the architectures is shown in Fig. 3 . As shown in Fig. 4(a) and Fig. 4 (b), 6 ns and 10 ns clock constraints provide an energy efficient design point for the single-and multi-core designs, respectively. The single-core design optimized with 6 ns clock constraint consumes less power than the other single-core designs for workloads higher than 11 MOps/s, and consumes only slightly higher power for the workloads lighter than 11 MOps/s. Similarly, the multi-core design optimized with 10 ns clock constraint consumes less power than the other multi-core designs for the workloads higher than 22 MOps/s, and consumes only slightly higher power for the workloads lighter than 22 MOps/s. To obtain the respective minimum power solutions for the performance range of interest we optimized the single-core and multi-core designs with clock constraints of 6 ns and 10 ns, respectively.
The occupied silicon area of the single-and multi-core design is given in Table 2 . As expected, the total area of PUs in the multi-core design is almost 8 times the area of the PU in the single-core design. However, the total area of the multi-core design is only 1.72 times of the total area of the single-core design since the DM is responsible for most of this area. cycles and 637 cycles on average per sample for CS, DWT and ECG2 benchmarks, respectively. When accounting for the 8-times parallel processing, these correspond to a penalty between 6.6% and 11.8% penalty in terms of execution time due to stall-cycles compared to the corresponding number of cycles required for a single lead in the single-core architecture. We always adjust the clock frequency of the single-and multi-core design to correspond to processing requirement. In particular, Table 3 shows the distribution of the power consump- The total power consumption of the processing cores in the multi-core design is less than the corresponding one in the single-core design since the single-core design optimized for a higher clock frequency as explained in Section 4.3.1. As seen from Table 3 , the overhead of the D-Xbar in terms of power consumption is insignificant, only 4.3% of the entire multi-core design. At the nominal voltage, the multi-core design consumes 27.9 µW leakage power whereas the leakage power consumption of the single-core design is 14.9 µW. This difference is mainly due to the individual instruction memory banks for each PUs. and 167 M Ops/s, the multi-core is more energy efficient than the single-core design, because the multi-core design can meet the workload requirements at a lower operating voltage compared to the single-core design (c.f. Fig. 5 ). In particular, to meet a high workload requirement (167 M Ops/s) the single-core design operates at 1.2 V and consumes 15.8 mW, whereas the multi-core design operates at 0.75 V and consumes only 6.0 mW. Thus, the multi-core solution consumes 62% less power than the single-core design. Even though both designs are supplied around the transistor threshold voltage level (c.f. Fig 5) at 2.2 M Ops/s, the multi-core design still consumes less power then the single-core design due to its lower dynamic power consumption (c.f. Table 3 ). However, if the required workload is light (lower than 1.7 M Ops/s) the single-core design consumes less power than the multi-core design, because of its lower leakage power consumption compared to the multi-core design. The power saving in the single-core design with respect to the multi-core design maximizes when the designs only leak, 46% power saving. 
Conclusion
Health monitoring systems require energy-efficient processing platforms due to their limited power resources as well as long operational times. Online biomedical signal processing on such systems involves relatively low complexity and highly parallel computations on a low-rate physiological data, which creates the opportunity of low voltage operations as well as parallel processing. In this paper, to address the energy efficiency and data throughput requirements for biomedical signal processing on health monitoring systems, we have proposed: 1) an ultra-low-power processing core with a minimal instruction set; 2) a single-core processor architecture and 3) a multi-core processor architecture with a data memory shared through a crossbar interconnect. Our results have shown that an instruction set architecture with only several required instructions leads to a significant energy saving for biomedical signal analysis as compared to a well established, extensive instruction set architecture (in our case up to 54% energy saving compared to PIC24 instruction set architecture). We have also explored the power vs. performance trade-offs between the single-and multi-core architectures, including near threshold voltage computing, for different target workloads of online biomedical signal analysis. Our results have shown that the multi-core approach consumes 62% less power than the single-core approach for high biosignal computation workloads (i.e., 167 M Ops/s). Moreover, as opposed to the common belief -single-core approaches are more energy efficient than multi-core ones for ultra-low-power domain since required workloads are typically light and can be handled effectively in single-core architectures -we have by an interconnect between cores and data memory, is a promising solution also in ultra-low-power parallel processing domain (in our case it achieves higher energy efficiency compared to the single-core approach for workloads as light as 1.7 MOps/s ). This is because our multi-core solution neither requires memories with large number of read/write ports (using multi-port memories reason significant memory area density, and thus high leakage dissipation) nor a higher clock frequency compared to the rest of the circuit (over clocking reasons complexity and energy efficiency issues such as need of voltage level shifters and complex clock tree scheme). However, a multi-core architecture is still penalized due to its leakage power consumption at extremely light workloads, where the power consumption is dominated by leakage (in our case 46% more power consuming compared to the single-core approach). To alleviate this issue, as a future work, our proposed multi-core processing platform will include configurability such as turning of processing cores, and memories to reduce the leakage overhead when workloads are extremely light. The issues with low voltage operations such as process variability, functional failure, operating point temperature dependency issues occurring mainly at sub-threshold voltages are not examined in this paper, however these issues will be also a subject of our future work.
port under the project number PP002-119052. We finally acknowledge the support of the ENIAC under the project JTI-END.
