Abstract-In the last decade, improvements on technology scaling have enabled the design of a novel generation of wearable biosensing monitors. These smart Wireless Body Sensor Nodes (WBSNs) are able to acquire and process biological signals, such as electrocardiograms, for periods of time extending from hours to days. The energy required for the on-node digital signal processing (DSP) is a crucial limiting factor in the conception of these devices. To address this design challenge, we introduce a domain-specific ultra-low power (ULP) architecture dedicated to bio-signal processing. The platform features a light-weight strategy to support different operating modes and synchronization among cores. Our approach effectively reduces the power consumption, harnessing the intrinsic parallelism and the workload requirements characterizing the target domain. Operations at low voltage levels are supported by a heterogeneous memory subsystem comprising a standard-cell based ultra-low voltage reliable partition. Experimental results show that, when executing real-world bio-signal DSP applications, a state-of-the-art multi-core architecture can improve its energy efficiency in up to 50 percent by utilizing our proposed approach, outperforming traditional single-core alternatives.
Ç

INTRODUCTION
T HE increasing social impact of chronic cardiovascular disorders presents a major challenge for healthcare provision [1] . In this context, wearable and miniaturized health monitoring systems, termed Wireless Body Sensor Nodes (WBSNs), offer a large-scale and cost-effective solution [2] .
Latest WBSNs are able to perform complex on-node Digital Signal Processing (DSP) routines, such as Electrocardiogram (ECG) compression [3] , automated feature extraction [4] and classification [5] . DSP applications embedded in such "smart" WBSNs greatly reduce the required transmission bandwidth, thus increasing the overall energy efficiency. In fact, in this scenario only the retrieved features, as opposed to the acquired samples, have to be sent over the power-hungry wireless link. This improvement, coupled with the advances in the design of low-voltage and low-rate Analog-to-Digital Converters (ADCs) [6] , [7] , has lead to a change of the dominant contributor to the power consumption of these platforms.
As an illustrative example, Fig. 1 (left) reports the energy breakdown of a system performing an ECG features extraction application (named 3L-MMD in [8] ). Shown data assumes that samples are acquired by the low-voltage ADC described in [7] , processing is performed by the architecture proposed in [8] and a Bluetooth Low Energy protocol is used for wireless transmission [9] . The breakdown highlights that the energy bottleneck resides in the embedded DSP stage, which is often the case in the smart WBSN scenario. To maximize the overall efficiency of smart WBSNs, signal processing must therefore be supported within a tight power budget, while at the same time respecting real time constraints. In this regard, many efforts have been made in the last years, proposing solutions ranging from ad-hoc accelerators [10] , [11] to ultra-low-power (ULP) multi-core architectures [8] , [12] .
In this work, we propose a novel WBSN multi-core architecture for bio-signals processing, which leverages the energysaving opportunities derived from real-world workloads in this domain. The platform embeds a low-overhead strategy to synchronize computing elements, and allows different execution modes operating at different voltage supplies. It allows the efficient management of Single Instruction-Multiple Data (SIMD) execution and producer/consumer relationships among processors. Moreover, it supports an ultra-low voltage sensing state in-between computation phases.
Our work is motivated by the limits of conventional Dynamic Voltage Frequency Scaling (DVFS), especially when applied to the memory subsystem. In fact, the failure probability of the conventional 6-Transistors (6T) SRAM cells increases considerably as the supply voltage is reduced [13] . This situation results in 6T-SRAM memories being the limiting factor for aggressive voltage scaling. At the same time, other low-voltage memory implementations such as Standard-Cell Memories (SCM) lead to substantial area overheads, as outlined in [14] , due to the relatively large storage requirements of biomedical DSP applications. The authors of [15] presents a comprehensive comparison of SRAM and SCM implementations, showcasing how SCM can operate at ultra-low supply levels.
Stemming from these observations, we propose a hybrid memory scheme, combining dense 6T memories with SCMs, which present an extended reliable voltage range, but are less area-efficient. By adopting this scheme, the target architecture can efficiently support two different operating modes, namely sensing and processing. These two modes are characterized by different voltage levels and working frequencies.
In sensing mode, the system works in a low-voltage/ low-power regime, where only a small memory region, mapped into SCM, can be written in order to store input samples. The vast majority of the memory cells, realized as 6T-SRAM, while not accessible at this lowvoltage supply level, still reliably retain their content. In processing mode, the system operates at a higher voltage level, so that the whole memory (and the computing elements) are active and reliable. Our strategy therefore goes beyond DVFS, by trading off the voltage supply level with the memory portion which can be reliable accessed at a given voltage level. State-ofthe-art works use multiple voltage islands to achieve lowvoltage and low-power operations in the logic, while ensuring reliable access in the memory. Conversely, our approach requires a single voltage domain, avoiding the design overheads of multi-V dd designs [16] . When processing is required on at least one core, the system is supplied at a high V dd level, while when all cores are idle a lower voltage level is selected. By coupling standard 6T-SRAM and SCM regions, it enables reliable operations in the full voltage swing, without requiring complex mechanisms for error detection and/or correction. Moreover, since the architecture conserves its state (the values stored in memories and register files) when idle, it does not require memory transfers to and from an external storage across idle periods. Hence, it supports high-frequency switches between sensing and processing with a much smaller timing and energy overhead with respect to an alternative based on power gating.
Our proposed architecture further improves its energy efficiency, when executing in the high-workload processing mode, by adopting fine-grained synchronization among cores allowing for efficient producer-consumer notification and lock-step execution of parallel algorithms with data-dependent branches. The first mechanism avoids wasteful active waiting between algorithmic steps executed in different sets of cores, while the second one allows for SIMD-like execution of algorithms that are applied in parallel over different streams of data. This last feature maximizes the synchrony among cores, which is exploited by coalescing concurrent and identical memory requests, originating from different processors, into a single access. Such strategy greatly reduces the power consumed by the instruction memory. The management of synchronization and the computing modes are concurrently implemented in a integrated hardware/software solution comprising a custom instruction set extension (ISE) and a dedicated hardware component for synchronization. Energy consumption breakdown of a smart WBSNs executing the three-input features extraction application described in [8] . Right: Scheme of a typical bio-signal DSP application (top) and its mapping on a ULP multi-core WBSN architecture (bottom).
Our approach is particularly beneficial for bio-signal DSP, where applications usually acquire multiple signals at rather low sampling rates (e.g., hundreds of Hz in the case of ECGs [3] , [4] , [5] ). The majority of run-time execution is therefore spent on acquisition alone, as opposed to data processing. As illustrated in Fig. 1 (top-right) , in most cases multiple input signals are independently processed, before combining them into a single stream for further analysis [4] . In this scenario, a multi-core architecture (such as the one depicted in Fig. 1 , bottom-right) allows to effectively distribute the DSP over multiple resources, connected to peripherals, instruction and data memory. Nonetheless, the presence of multiple cores mandates an efficient strategy to synchronize the execution of an application at run-time. The synchronization mechanism illustrated in this work allows to efficiently execute in parallel SIMD phases enforcing lockstep execution, and, at the same time, manage the producer-consumer relationships between the different phases avoiding unwanted active waiting periods. The combined benefits of efficient parallel execution and operating modes are more than additive. By effectively distributing the workload over multiple computing units (CUs), it is in fact possible to reduce the ratio between processing and sensing time, giving ample opportunity for dynamic voltage scaling.
Summing up, our work presents a novel multi-core architecture, featuring a unified strategy to support both different voltage/accessibility modes and fine-grained synchronization. It employs a hybrid memory organization, dedicated synchronization instructions and a hardware synchronization unit (SU). The main contributions of this paper are therefore the following:
We propose a ULP multi-core system for bio-signal processing, supporting ultra-low voltage operating modes by featuring a heterogeneous memory architecture. We detail a low-complexity synchronization technique, able to effectively manage operating modes, lock-step execution and producer-consumer relationships. By exploring different partitionings between the 6T and the SCM portions, we devise an optimal hybrid memory architecture, and we evaluate the performance and the power consumption of the resulting platform. The rest of the paper is organized as follows. Section 2 acknowledges related efforts in the field, while Section 3 describes the target architecture, its hybrid memory subsystem and the proposed synchronization technique. Next, in Section 4 we detail the experimental setup and the results in terms of energy efficiency. Finally, the conclusions are presented in Section 5.
RELATED WORK
Power consumption is a first-grade optimization goal in the design of digital architectures; as such, it is the focus of a vast body of research, as summarized in [17] . At the architectural level, the support of low-voltage operation modes is a widely used strategy to increase the energy efficiency of processors [18] , because of its generality and flexibility. Nonetheless, voltage scaling limits the maximum operating frequency of systems, ultimately penalizing their performance.
To overcome this performance loss, processors can be enriched with application-specific custom instructions or accelerators [19] , that efficiently support the most frequent operations of a target domain. In the WBSN context, the authors of [10] , [11] , and [20] have indeed proposed systems employing dedicated filtering, signal compression and FFT engines. The presence of single-function hardware blocks can nonetheless lead to over-specialized architectures, resulting in a loss of flexibility that can only be partially palliated by adopting reconfigurable accelerators [21] .
A more generic approach we adopt in this paper is to employ multiple and homogeneous processing units, able to support a target workload at a low clock frequency. This second strategy, popular in many domains such as multimedia [22] , [23] , is particularly effective in the WBSN scenario, where multiple signals are usually acquired in parallel and processed within a time window [10] , [24] .
Dynamic Voltage Frequency Scaling, carried out by adjusting the performance and the power consumption at run-time according to the workload [25] , [26] , is often used in conjunction with a multi-core strategy. For bio-signal analysis applications, such workload is dictated by the acquisition rate of signals, resulting in the presence of both high-activity and idle periods, which can be exploited by the adoption of deep sleep modes to increase the energy efficiency [27] . Nonetheless, the reliability of SRAMs decreases when operating at ultra-low voltages [2] , posing a hard limit on the voltage range that can be safely employed. To overcome such a problem, the authors of [24] propose a system where different voltage domains are used for computing and storage resources. As opposed to our work, this choices mandate the use of voltage level shifters, that present a non-negligible area [16] .
A striking alternative is offered by specialized SRAM implementations that, with larger than standard six-transistors SRAMs [13] , [28] , can reliably operate at extremely low V dd . Similarly to [14] , [29] , and [30] , herein we explore the benefits of a hybrid solution, which employs large and lowpower SRAM cells only for a small portion of the memory subsystem. As opposed to [14] , [30] , our solutions do not incur in any error on the computations related to the 6T memory at low voltages. Moreover, compared to [29] and [30] and similarly to [14] , [31] , we employ only a single and tunable voltage domain for the entire system, resulting in a simpler and leaner implementation. Differently from our previous work [31] which was proposing an hybrid memory design based on 6T and 8T cells, in this work we (i) extend the original design by using SCM cells instead of the 8T cells which deliver better energy efficiency, while increasing the design flexibility by allowing the integration of smaller memory cuts avoiding additional area overheads; (ii) we propose a novel framework for effective handling of the hybrid memory and transition phases based on the combination of an additional HW synchronization component and a set of programming directives.
The run-time management of multiple resources under tight run-time and memory constraints is a challenging task. To this end, the authors of [32] introduced an approach based on software libraries, that nonetheless incurs in substantial overheads due to busy waitings and system calls. Alternatives relying on hardware locks [33] are also resourceintensive, thus not suited for low-power computing architectures such as WBSNs. As in our previous works [8] , [12] , we instead support run-time synchronization among different cores with dedicated instruction set extensions, presenting a small area and timing footprint. In [8] and [12] , synchronization is only supported to manage parallelism. Herein, we generalize this methodology to orchestrate both parallel execution on multiple resources and dynamically set the operating voltage. In this last respect, leveraging a domain-specific heterogeneous memory system, we go beyond classic DVFS, trading off the accessibility of resources (in addition to the operating frequency) with power consumption.
BIO-SIGNAL PROCESSING ARCHITECTURE
The proposed architecture employs a joint synchronization policy to transition between operating modes at different voltage levels, as well as to perform clock-gating of individual computing units. These two strategies target different time granularities: at a coarser granularity, an ultra-low voltage operating point of the entire system is adopted when only data buffering is required (i.e.: during sensing phases), as dictated by the workload of the DSP application and the data acquisition rate. At a finer granularity and while in processing mode, clock-gating enables an efficient synchronization of cores executing code in lock-step or waiting for input data in a producer-consumer relationship, as detailed in Section 3.2.
Both energy-saving strategies are embedded in our multi-core platform, which is composed by an array of Computing Units interfaced to Instruction and Data Memories (IM and DM), as depicted in Fig. 2 . IM and DM are divided into multiple banks, so that each can be accessed independently and power-gated if they are not required by the application. Each DM bank is itself composed by an area-efficient 6T region (6T-DM) and a highly-reliable SCM region (SC-DM). The communication between cores and memories is based on high bandwidth logarithmic interconnects, implementing a mesh-of-trees topology and supporting single-cycle communication between cores and memory banks [34] . On the other hand, in case of no banking conflicts, data routing is done in parallel for each core, thus enabling a high sustainable bandwidth for processors-memories communication. In case of multiple conflicting requests, to ensure fair access to memory banks, a round-robin scheduler arbitrates the access to different locations of the same bank by different processors. In addition, the interconnect allows to merge simultaneous read requests to the same memory address, reducing the memory accesses and therefore increasing the energy efficiency of the system [8] .
A Synchronization Unit is employed to (i) orchestrate the execution of the system, (ii) clock-gate individual cores and (iii) dynamically select the voltage supply level and, therefore, the operating mode. The SU pauses and resumes cores, either after data-dependent branches (to recover lock-step execution) or to manage producer-consumer relationships. Moreover, the SU also dictates the voltage supply of the platform. At the high-Vdd processing supply level, all computing and storage elements can be reliably employed. Conversely, when all the cores are idle waiting for a window of samples to be acquired, the low-Vdd sensing mode is enforced, in which only the SC-DM regions are reliably accessible, while the 6T-DM memory are state-retentive. In sensing mode, the analogto-digital converter is in charge of periodically moving the data sampled by the analog front-end to the SC-DM region.
In the following, we detail the implementation of the proposed strategy to support these energy-saving mechanisms, as well as the description of the synchronizer unit which gives the necessary hardware support.
Hybrid Memory Management Strategy
Considering typical sampling frequencies for biomedical signals (typically around few hundreds of Hertz), the time needed to acquire a window of samples exceeds the time to perform the required computation. Therefore, the workload profile of WBSN application presents periods of low activity, where only data collection is performed. In this sensing state, the only requirement for the architecture is to make available enough memory to store locally the data sampled by the ADC. As shown in Fig. 3 , the only active elements during sensing are the ADC and the reliable SC-DM, where samples are stored for future analysis. In this mode, all the cores, the 6T-DM portion and the IM are inactive. Memory elements beside the SC-DM are not accessible, but their content is reliably retained.
Once the ADC has transferred the desired number of samples to the data memory, the system switches to execution mode (cf. Fig. 3) , performing a burst of computation on the available data. This operating point is characterized by a high workload, being the required processing elements active and working on the sampled data. The execution mode requires a reliable access to IM and DM banks storing the binary code and application data.
To support this run-time behavior, we consider a hybrid data memory architecture, which overcomes the limitation imposed by classic 6T-SRAM when operating under aggressive voltage scaling. The memory bank structure combines 6T and SCM regions, extending the reliable operating range to low supply voltages. In the case of our target CMOS technology, the SCM portion of the DM is able to reliably operate down to 600 mV, while the 6T portion of the DM can be reliably accessed at a minimum level of 800 mV (see Section 4.3). Due to the utilization of a single voltage domain, our proposed strategy allows a low-overhead transition between sensing and processing modes, which is only dependent on the rise time of the voltage supply level. This solution extends the one presented in [31] by substituting the 8T reliable memory portion with a SCM design. This lead to a better energy-efficiency, a wider operating range and the possibility of integrating small memory cuts which incurs in large overheads with standard SRAM design.
Synchronization Strategy
To determine the dynamic voltage supply level of the platform, as well as to properly clock-gate individual cores, we propose a hybrid hardware/software (HW/SW) synchronization mechanism. The approach extends the one described in [8] by also including the management of multiple operation modes. Its hardware support is provided by the abovementioned SU, which orchestrates the execution of the multi-core system based on the received interrupts from the ADC and the synchronization instructions issued by the cores. Software support consists of a set of dedicated instructions (SINC, SDEC and SNOP), which modify a number of reserved locations (synchronization points) in the data memory. Synchronization points, implemented as single data words, store the information regarding (i) which cores have started and ended the execution of a data-dependent branch, and (ii) which consumer cores are clock-gated while waiting for data from producer cores or from the ADC. One synchronization point is therefore required for each datadependent branch and each producer/consumer relationship. Each of these words consist of a 1-bit flag per core which indicates if the core is registered for the corresponding event and a core counter which keeps track of how many of them have not arrived to the end of the event. In addition, a SLEEP instruction requests the synchronizer to clock-gate the issuing core until the next synchronization event happens (e.g., new data to process is available).
Code
conditional_code_C() 10: } 11: SDEC (<synch_point_A>) 12: SLEEP (); 13: . . . continue in lock-step 14: }
Software Adaptations and Mapping
To enforce lock-step execution after data-dependent blocks of code, each core executes a SINC instruction before conditional branches, to notify the synchronizer about a possible desynchronization. When the core finishes executing the branch, it issues a SDEC and enables clock-gating with a SLEEP instruction. After all cores that diverged finish executing the conditional section, the synchronizer wakes them up to resume their execution in lock-step. A simple example showcasing a typical de-synchronization due to a datadependent branch is presented in Code Excerpt 1. A graphic representation and a time diagram of this run-time behavior is also depicted in Fig. 4a . Producer-consumer relationships require the consumer cores waiting for data to execute a SNOP instruction, registering themselves in the corresponding synchronization point. Afterwards, such cores request to be clock-gated by issuing a SLEEP instruction, thus avoiding active waiting. Producers, instead, use SINC to register in the synchronization point when starting to compute data for the consumer cores, and SDEC when data is ready. The synchronizer detects when all the necessary input data from the producers is available (i.e., all the producers have issued the SDEC instruction), and resumes execution of all the registered cores. Pseudo-code excerpts presented in Code Excerpts 2 and 3 showcase a generic example of a producer-consumer relationship, which is also represented on the diagram depicted in Fig. 4b .
To map an application to the proposed platform starting from an equivalent single-core implementation, the code (written in C) must be partitioned into phases that can be executed in parallel in a pipelined manner. Phases should correspond to a non-negligible workload, but a finely-tuned load balancing is not required. Subsequently, the custom synchronization instructions are properly placed to manage datadependent branches and producer/consumer relationships. While these two steps must be manually performed by the application programmer at present, they can be automated. Each phase is then assigned to a number of computing units corresponding to the number of parallel computing streams within a phase (e.g., three CUs are assigned for the "conditioning" phase in Fig. 1 ) and the IM and DM content referring to different phases are mapped in different banks of IM and DM in order to reduce access conflicts. The assignment of computing units and memory banks is performed semi-automatically through linking directives.
Hardware Support: Synchronization Unit
The aforementioned Synchronization Unit is interfaced between the read-write ports of the cores and the interconnect networks, to monitor the state of each computing unit and orchestrate their execution. In addition, this module is also connected to the ADC interrupt line and the stall, sleep and wake-up pins of each of the processors. The SU is composed by a sequential and a combinational part, which are detailed in the following and whose behaviours are depicted from a high level of abstraction in the flowcharts in Fig. 5 .
On one hand, the sequential (clocked) logic is responsible for controlling the transitions between sensing and processing modes (cf. Fig. 5a ). A lack of activity while in processing mode (i.e., when all cores have issued a SLEEP instruction as showcased in the producer-consumer relationship of Fig. 4b ) triggers a transition towards the lowpower state. This condition is detected by the synchronizer, which lowers the clock frequency and the voltage supply to the low-Vdd level, setting the system to sensing mode. When a new window of data becomes available, the ADC makes the system transit to the processing mode. In such a case, the synchronizer raises the platform voltage, waits for a stabilization period, increases the clock frequency and wakes up the corresponding cores.
On the other hand, the combinational circuitry coordinates the execution among cores while in processing mode (cf. Fig. 5b ). First, explicit stalls due to memory conflicts coming from the interconnect are handled and forwarded to the corresponding cores. Second, lock-step execution is ensured among all those cores issuing the same instruction during the same clock cycle by stalling all of them if one is explicitly stalled due to a memory conflict. Third, in the case of issuing a synchronization instruction, the value to be written into the corresponding synchronization point is derived by setting the necessary flags, modifying the core counter and merging into a single write requests the results of possibly concurrent manipulations of the same point. Moreover, the synchronizer is also in charge of waking-up the registered cores when the core counter to be stored reaches zero, as described in Section 3.2.
EXPERIMENTAL RESULTS
In this section we first present the chosen set-up and simulation framework. Then, we make a detailed exploration to choose an optimal balance between the SCM and the 6T memory regions composing the data memory sub-system. Finally, we comparatively evaluate the energy efficiency of the studied architecture featuring the proposed techniques. 
Setup and Evaluation Framework
We consider a target system composed by eight ULP TamaRISC cores [12] , featuring a three-stage pipeline, 16-bit data width and 24-bit instructions. The energy consumption of this core is comparable to commercial low-power processors such as ARM Cortex-M0 [35] . Nonetheless, different cores can be embedded in the proposed system, as long as they allow extensions to incorporate the custom synchronization instructions described in Section 3.2.
Cores are interfaced with a 96 KByte Instruction Memory (32 KWords of 24 bits width) divided into eight banks and a 64 KByte Data Memory (32 KWords of 16 bits width) divided into 16 banks. Each DM bank presents a reliable region of SCM cells and an area-efficient one implemented as 6T cells.
The system clock frequency in the processing mode is 20 MHz (the maximum possible in the chosen technology for the target system with 800 mV supply voltage), while in sensing mode the clock is set at 10 KHz to lower the dynamic power. The resources that are not required to be active by the application, such us unnecessary cores and IM banks, can be powered down at boot time.
Similarly to [8] , the developed experimental framework combines detailed post-layout characterization of the system with faster cycle-accurate simulations of complex biosignal analysis applications. Processors and their C compiler are designed using ASIP Designer from Synopsys. The resulting RTL and SystemC implementations are then embedded as components of the multi-core virtual platform. The lowerlevel RTL description is used to characterize each of the employed architectural blocks at a 40 nm technology node through an EDA toolchain: Design Compiler from Synopsys and Encounter from Cadence are used in the synthesis and place-and-route step, while Modelsim from Menthor Graphics is employed to retrieve switching activity of the platform when executing synthetic benchmarks. In a second step, the obtained energy values are imported in the higher-level SystemC platform simulator, allowing the evaluation of energy consumptions when executing complete real-world applications under different architectural configurations.
Bio-Signal Processing Benchmarks
We have considered four bio-signal analysis benchmarks, which are widely used in the field of electrocardiogram embedded processing [3] , [8] and present different levels of complexity and parallelism. They also present different tradeoffs in terms of results elaboration and runtime requirements (i.e., computational and memory resources). Their characteristics are summarized next.
Compressed Sensing (8L-CS). This signal compression algorithm has been extensively investigated in different domains, including low-power sensing and image processing. CS assumes that the input data has a sparse representation in a transformed domain, so that the data dimensionality can be dramatically reduced. Mamaghanian et al., [3] used CS to implement a low-complexity ECG compression algorithm based on the multiplication of the input vector of samples by a sparse sensing matrix resulting in a much smaller set of measurements which can be later used to reconstruct the original signal. The algorithm used in this benchmark utilizes a software version of the energy-efficient pseudo-random number generator introduced in [36] to generate the sensing matrix used to achieve a 50 percent compression. The resulting 8L-CS does not present any data-dependent branch nor code divergence leading to an almost full lockstep execution of code among cores. In our implementation, eight ECG leads (8L) are processed in parallel employing all the cores of the platform.
Morphological Filtering (3L-MF). ECG acquisitions are normally corrupted by different sources of noise and artifacts (including human perspiration, muscular activity or small displacements of the employed electrodes), which must be filtered to retrieve a high-quality signal. Morphological filtering [37] performs this task by employing structuring elements to unwanted components from input streams. Herein, we consider an optimized version of this algorithm [4] , which removes both low and high frequency noise components using flat and peak-shaped structuring elements. This benchmark filters in parallel ECG signals from a standard threechannel acquisition using three computing cores. In contrast to 8L-CS, the presence of numerous data-dependent branches in the 3L-MF code highlights the ability of the platform to recover lockstep execution after diverging sections of code.
ECG Delineation (3L-MMD). This application perform the automated identification of the fiducial points of ECGs, i.e.: the starting, peak and end of the three ECG main waves (QRS complex, P and T waves). This process is known as delineation. On top of the filtering stage of 3L-MF, this benchmark performs a Root Mean Square (RMS) fusion of the filtered signals resulting in a single ECG stream, that is later delineated employing an algorithm based on Multiscale Morphological Derivatives as in [8] . This benchmark requires synchronization for both lockstep execution and producer-consumer notifications to transfer data among the three processing stages, namely filtering, combination and delineation. 3L-MMD employs five cores of the platform, three of them executing code in lockstep.
Selective ECG Processing (RP-CLASS). This benchmark, detailed in [5] , embeds a neuro-fuzzy classifier that detects abnormal heartbeats. When a detection occurs, a further analysis is executed on the abnormal heartbeat. By default a single ECG channel is filtered and analyzed by the classifier. Only for abnormal heartbeat, three-channels filtering and delineation is performed, as described above for 3L-MMD. This benchmark presents a complex structure, requiring lockstep execution in some cases and a sophisticated control flow across cores. RP-CLASS utilizes six cores of the platform and benefits from both proposed synchronization mechanisms. Table 1 reports the most relevant workload characteristics of the four bio-signal processing benchmarks considered in this work, when mapped on the target multi-core platform. It highlights the small overhead caused by the insertion of synchronization instructions, in terms of run-time as well as code size. Moreover, the table shows that the sensing periods, where the processing cores stay idle, are dominant, accounting in all the cases for more than 90 percent of the time.
Reliable Memory Requirements
In the first round of experiments we have explored the energy efficiency of the multi-core platform when different sizes of highly-reliable SCMs are employed (Fig. 6) . The considered SCM design uses a cross-coupled pair of AND-OR-INV (AOI) as the storage element, which is more energy efficient than 6T-SRAM. The choice of this memory element, combined with the use of regular place and route, results in more than 3Â area saving [14] compared to the SCM design in [38] that uses a latch as the storage element.
In applications with multiple producer-consumer computation phases, the availability of a large SCM region enables the acquisition of wider samples windows and thus maximizes the pipelined execution of different phases. For what concerns the supply voltage levels for sensing and processing, such values were determined considering the measurements results presented in [14] . The minimum operating voltage point was measured by the authors of [14] over nine chips and the results show that for the majority of the chips, the SCMEM operated correctly at voltages below 0.4 V and on average it has 400 mV lower minimum operating voltage point than the 6T memory. However, we considered the worst case scenario, i.e., the highest minimum voltage for both SCMEM and 6T among the different measured chips, which conservatively lead to 600 mV for sensing and 800 mV for processing.
As shown in Fig. 6 , the illustrated trade-off results in an optimal size of the SCM region of 64 bytes for three out of four of the considered benchmarks. Thus, we used this size in the experiments of the following sections. This choice increases the area of the data memory by 0.2 percent and leads to a negligible system area overhead (% 0:1 percent) w.r.t. a design including only 6T-SRAM. Since irregular 6T memory banks cannot be generated with standard memory compilers, the addition of small SCM regions does not imply a reduction of the 6T part but a superposition. As expected, the benefits of employing wider SCM regions are most evident in the 3L-MMD and RP-CLASS benchmarks, which expose producerconsumer relationships. For these cases, the ability to process a larger window of data in a pipelined fashion across multiple processors is leveraged to increase parallelism and reduce the time spent in processing mode. For the other two benchmarks, only modest gains can be achieved by employing bigger SCMs due to the reduced number of transitions between sensing and processing modes. Such time overhead, due to transitioning between processing and sensing modes, has been conservatively modeled as 100 ns in our experiments, taking into account wide margins with respect to silicon implementations [39] , [40] . Table 2 reports the obtained values for leakage and dynamic power for both processing and sensing phases. In all cases, leakage power is effectively reduced by % 40 percent when transitioning to the state-retentive sensing mode. In the context of WBSN applications, this aspect is particularly relevant, since the benchmarks spend in this state up to 95 percent of their execution time. As expected, dynamic power is negligible during the sensing periods where most of the system is clock-gated and the voltage is reduced.
Power Consumption Evaluation
To highlight the efficiency of our proposed solution, we compared it with two different baseline systems. The first baseline system (no Hybrid in Fig. 7) does not implement the hybrid memory subsystem and it is always running at the higher voltage level of 800 mV, while still employing synchronization for managing lock-step execution and efficient producer-consumer waiting. We set the working frequency of this baseline at 2.5 MHz (which allows it to barely meet the real-time constraints of the considered benchmarks) in order to minimize the power consumption of elements which are not clock-gated, such as the clock tree itself. The energy profile of the no Hybrid architecture has been investigated in detail in our previous work [8] , which showcases how synchronization alone leads to tangible efficiency gains with respect to a single-core alternative (40 percent less energy) and with respect to a multi-core which does not support synchronization (32 percent less energy).
In the second case (no Sync in Fig. 7 ), we employ active waiting instead of clock-gating to manage producer-consumer relationships and lock-step execution is disabled, but we still allow the system to transit to the low-power sensing mode when all the cores are idle. To make a fair comparison, in this setting access conflicts are reduced by assigning different IM and DM banks to each processor, even when they execute the same computing phase. As in the target system, no Sync adopts a 20 MHz clock when in processing mode, with the aim of increasing as much as possible the time spent in the energy-efficient sensing mode. Fig. 7 shows the breakdown of the average power consumption for 60 s of activity for all the three architectures considering the time spent in sensing and processing modes. Two main conclusions can be drawn from this comparison. First, energy savings are consistently achieved in all benchmarks by employing different operation modes supported by a hybrid data memory. Savings derive from a reduction of up to 32 percent in leakage power of all system components, as well as from the dynamic power of the clock tree (reaching 60 percent reduction in 3L-MF) due to the lower frequency employed in sensing mode. Efficiency gains grow linearly with the ratio between the time spend in sensing mode and the total run-time. The overhead deriving from the use of SCMs in hybrid banks in instead negligible, due to their small required size of just few bytes. Therefore, a high-workload application, always residing in processing mode, would require the same energy in our target system and in the No hybrid one.
Second, synchronization can effectively increase the system efficiency. In fact, synchronization allows merging memory requests of data and instruction words, thus minimizing the accesses to memories and the number of active banks, leading to a reduction in leakage and dynamic energy. These two aspects are especially beneficial when multiple cores execute the same processing phase, as in the case of 8L-CS where memory consumption is reduced by 83 percent.
CONCLUSIONS
Nowadays, very promising opportunities for increasing the energy efficiency of digital platforms gains reside at the architectural and system design level. Such solutions require the specialization of computing resources for a target application domain. In this work, we have presented a dedicated computing architecture for bio-medical signal processing, which harnesses the high-level features of applications in this domain. The proposed system adapts to the varying workload requirements typical of bio-signal analysis applications, which are leveraged as an energy-saving opportunity by an hybrid memory scheme. Moreover, the employed multi-core structure exploits parallel and pipelined execution, matching application-level characteristics, while allowing SIMD execution and avoiding active waiting.
The platform supports two operating modes, which determine the accessibility of resources and the system energy consumption: a high-performance processing mode and a low-power sensing mode. Such dual-mode operation is supported by employing specialized data memory banks, which include an area-efficient 6T SRAM partition and a low-voltage reliable SCM partition. At the same time, our strategy includes a light-weight mechanism to perform the transitions between modes and allow synchronization of cores for broadcasting of data and instructions and clockgating of individual cores. Experimental results showcased that, by using our proposed methodology, overall energy gains of up to 50 percent can be achieved while requiring a negligible 0.1 percent area increase in a multi-core platform devoted to bio-medical DSP applications.
Rub en Braojos received the MSc degree in computer science and engineering from Complutense University of Madrid, Spain, in 2010. He is currently working toward the PhD degree in the Embedded Systems Laboratory, Ecole Polytechnique F ed erale de Lausanne, Switzerland. His research interests include embedded bio-signal processing, ultra-low power embedded systems, and WBSNs applied to the field of healthcare. He is a member of the IEEE.
Daniele Bortolotti received the MS degree in electronic engineering and the PhD degree in electronics, computer science, and telecommunications from the University of Bologna, Italy, in 2010 and 2014, respectively. He is currently a post-doctoral researcher in the Department of Electrical, Electronic and Information Engineering Guglielmo Marconi, University of Bologna. The focus of his research has initially been on virtual platforms and architectural aspects for multi-processors systems-on-chip. Recently his focus comprises HW/SW design strategies for ultra-low power bio-sensors nodes operating in near-threshold for WBSN applications and low-level power management techniques for many-cores HPC nodes.
Andrea Bartolini received the PhD degree in electrical engineering from the University of Bologna, Italy, in 2011. He is currently a post-doctoral researcher in the Department of Electrical, Electronic and Information Engineering Guglielmo Marconi, University of Bologna. He also holds a postdoc position in the Integrated Systems Laboratory, ETH Zurich. His research interests concern dynamic resource management ranging from embedded to large scale HPC systems with special emphasis on software-level thermal and power-aware techniques. His research interest also includes ultra-low power design strategies for bio-sensors nodes operating in near-threshold. He is a member of the IEEE.
Giovanni Ansaloni received the MS degree in electronic engineering from the University of Ferrara, Italy, in 2003, the MAS degree from ALaRI Institute, Switzerland, in 2005, and the PhD degree from the University of Lugano, Switzerland, in 2011. He is currently a post-doctoral researcher in the Faculty of Informatics, Universit a della Svizzera Italiana, USI-Lugano, Switzerland. From 2011 to 2015, he was a researcher with Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland. His research efforts focus on smart Wireless Body Sensor Nodes systems and applications, including software optimizations of processing algorithms for bio-signal analysis, and architectural explorations of ultra-low-power WBSN platforms. He is a member of the IEEE.
Luca Benini is a full professor with the University of Bologna and he is the chair of Digital Circuits and Systems, ETHZ. He has served as chief architect of the Platform2012/STHORM project in STmicroelectronics, Grenoble, in the period 2009-2013. He has held visiting and consulting researcher positions with EPFL, IMEC, Hewlett-Packard Laboratories, and Stanford University. His research interests include energy-efficient system design and multi-core SoC design. He is also active in the area of energy-efficient smart sensors and sensor networks for biomedical and ambient intelligence applications. He has published more than 700 papers in peer-reviewed international journals and conferences, four books and several book chapters. He is a fellow of the IEEE and a member of the Academia Europaea.
David Atienza (M'05-SM'13-F'16) received the MSc and the PhD degrees in computer science and engineering from UCM, Spain, and IMEC, Belgium, in 2001 and 2005, respectively. He is an associate professor of electrical and computer engineering, and director of the Embedded Systems Laboratory, Swiss Federal Institute of Technology, Lausanne, Switzerland. His research interests include system-level hardware-software co-design methodologies for high-performance multi-processor system-on-chip and ulow-power embedded systems, including especially new 2-D/3-D thermal-aware design for MPSoCs, and ultra-low power system architectures for wireless body sensor nodes. He is a co-author of more than 250 papers in peer-reviewed international journals and conferences, several book chapters, and five U.S. patents. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
