9 research outputs found

    HEAL-WEAR: an Ultra-Low Power Heterogeneous System for Bio-Signal Analysis

    Get PDF
    Personalized healthcare devices enable low-cost, unobtrusive and long-term acquisition of clinically-relevant biosignals. These appliances, termed Wireless Body Sensor Nodes (WBSNs), are fostering a revolution in health monitoring for patients affected by chronic ailments. Nowadays, WBSNs often embed complex digital processing routines, which must be performed within an extremely tight energy budget. Addressing this challenge, in this paper we introduce a novel computing architecture devoted to the ultra-low power analysis of biosignals. Its heterogeneous structure comprises multiple processors interfaced with a shared acceleration resource, implemented as a Coarse Grained Reconfigurable Array (CGRA). The CGRA mesh effectively supports the execution of the intensive loops that characterize bio-signal analysis applications, while requiring a low reconfiguration overhead. Moreover, both the processors and the reconfigurable fabric feature Single-Instruction / Multiple- Data (SIMD) execution modes, which increase efficiency when multiple data streams are concurrently processed. The run-time behavior on the system is orchestrated by a light-weight hardware mechanism, which concurrently synchronizes processors for SIMD execution and regulates access to the reconfigurable accelerator. By jointly leveraging run-time reconfiguration and SIMD execution, the illustrated heterogeneous system achieves, when executing complex bio-signal analysis applications, speedups of up to 11.3x on the considered kernels and up to 37.2% overall energy savings, with respect to an ultra-low power multicore platform which does not feature CGRA acceleration

    Low-Power Processor Architecture Exploration for Online Biomedical Signal Analysis

    Get PDF
    Abstract. In this study, we explore sequential and parallel processing architectures, utilizing a custom ultra-low-power (ULP) processing core, to extend the lifetime of health monitoring systems, where slow biosignal events and highly parallel computations exist. To this end, a single- and a multi-core architecture are proposed and compared. The single-core architecture is composed of one ULP processing core, an instruction memory (IM) and a data memory (DM), while the multicore architecture consists of several ULP processing cores, individual IMs for each core, a shared DM and an interconnection crossbar between the cores and the DM. These architectures are compared with respect to power/performance trade-offs for different target workloads of online biomedical signal analysis, while exploiting near threshold com-2 A. Y. Dogan, J. Constantin, D. Atienza, A. Burg, L. Benini puting. The results show that with respect to the single-core architecture, the multi-core solution consumes 62 % less power for high computation requirements (167 MOps/s), while consuming 46 % more power for extremely low computation needs when the power consumption is dominated by leakage. Additionally, we show that our proposed ULP processing core, using a simplified instruction set architecture (ISA), achieves energy savings of 54 % compared to a reference microcontrolle

    Low-power processor architecture exploration for online biomedical signal analysis

    No full text
    In this study, the authors explore sequential and parallel processing architectures, utilising a custom ultra-low-power (ULP) processing core, to extend the lifetime of health monitoring systems, where slow biosignal events and highly parallel computations exist. To this end, a single- and a multi-core architecture are proposed and compared. The single-core architecture is composed of one ULP processing core, an instruction memory (IM) and a data memory (DM), while the multi-core architecture consists of several ULP processing cores, individual IMs for each core, a shared DM and an interconnection crossbar between the cores and the DM. These architectures are compared with respect to power/performance trade-offs for different target workloads of online biomedical signal analysis, while exploiting near threshold computing. The results show that with respect to the single-core architecture, the multi-core solution consumes 62% less power for high computation requirements (167 MOps/s), while consuming 46% more power for extremely low computation needs when the power consumption is dominated by leakage. Additionally, the authors show that the proposed ULP processing core, using a simplified instruction set architecture (ISA), achieves energy savings of 54% compared to a reference microcontroller ISA (PIC24)

    Microarchitectural Low-Power Design Techniques for Embedded Microprocessors

    Get PDF
    With the omnipresence of embedded processing in all forms of electronics today, there is a strong trend towards wireless, battery-powered, portable embedded systems which have to operate under stringent energy constraints. Consequently, low power consumption and high energy efficiency have emerged as the two key criteria for embedded microprocessor design. In this thesis we present a range of microarchitectural low-power design techniques which enable the increase of performance for embedded microprocessors and/or the reduction of energy consumption, e.g., through voltage scaling. In the context of cryptographic applications, we explore the effectiveness of instruction set extensions (ISEs) for a range of different cryptographic hash functions (SHA-3 candidates) on a 16-bit microcontroller architecture (PIC24). Specifically, we demonstrate the effectiveness of light-weight ISEs based on lookup table integration and microcoded instructions using finite state machines for operand and address generation. On-node processing in autonomous wireless sensor node devices requires deeply embedded cores with extremely low power consumption. To address this need, we present TamaRISC, a custom-designed ISA with a corresponding ultra-low-power microarchitecture implementation. The TamaRISC architecture is employed in conjunction with an ISE and standard cell memories to design a sub-threshold capable processor system targeted at compressed sensing applications. We furthermore employ TamaRISC in a hybrid SIMD/MIMD multi-core architecture targeted at moderate to high processing requirements (> 1 MOPS). A range of different microarchitectural techniques for efficient memory organization are presented. Specifically, we introduce a configurable data memory mapping technique for private and shared access, as well as instruction broadcast together with synchronized code execution based on checkpointing. We then study an inherent suboptimality due to the worst-case design principle in synchronous circuits, and introduce the concept of dynamic timing margins. We show that dynamic timing margins exist in microprocessor circuits, and that these margins are to a large extent state-dependent and that they are correlated to the sequences of instruction types which are executed within the processor pipeline. To perform this analysis we propose a circuit/processor characterization flow and tool called dynamic timing analysis. Moreover, this flow is employed in order to devise a high-level instruction set simulation environment for impact-evaluation of timing errors on application performance. The presented approach improves the state of the art significantly in terms of simulation accuracy through the use of statistical fault injection. The dynamic timing margins in microprocessors are then systematically exploited for throughput improvements or energy reductions via our proposed instruction-based dynamic clock adjustment (DCA) technique. To this end, we introduce a 6-stage 32-bit microprocessor with cycle-by-cycle DCA. Besides a comprehensive design flow and simulation environment for evaluation of the DCA approach, we additionally present a silicon prototype of a DCA-enabled OpenRISC microarchitecture fabricated in 28 nm FD-SOI CMOS. The test chip includes a suitable clock generation unit which allows for cycle-by-cycle DCA over a wide range with fine granularity at frequencies exceeding 1 GHz. Measurement results of speedups and power reductions are provided

    Hardware/Software Co-Design of Ultra-Low Power Biomedical Monitors

    Get PDF
    Ongoing changes in world demographics and the prevalence of unhealthy lifestyles are imposing a paradigm shift in healthcare delivery. Nowadays, chronic ailments such as cardiovascular diseases, hypertension and diabetes, represent the most common causes of death according to the World Health Organization. It is estimated that 63% of deaths worldwide are directly or indirectly related to these non-communicable diseases (NCDs), and by 2030 it is predicted that the health delivery cost will reach an amount comparable to 75% of the current GDP. In this context, technologies based on Wireless Sensor Nodes (WSNs) effectively alleviate this burden enabling the conception of wearable biomedical monitors composed of one or several devices connected through a Wireless Body Sensor Network (WBSN). Energy efficiency is of paramount importance for these devices, which must operate for prolonged periods of time with a single battery charge. In this thesis I propose a set of hardware/software co-design techniques to drastically increase the energy efficiency of bio-medical monitors. To this end, I jointly explore different alternatives to reduce the required computational effort at the software level while optimizing the power consumption of the processing hardware by employing ultra-low power multi-core architectures that exploit DSP application characteristics. First, at the sensor level, I study the utilization of a heartbeat classifier to perform selective advanced DSP on state-of-the-art ECG bio-medical monitors. To this end, I developed a framework to design and train real-time, lightweight heartbeat neuro-fuzzy classifiers, detail- ing the required optimizations to efficiently execute them on a resource-constrained platform. Then, at the network level I propose a more complex transmission-aware WBSN for activity monitoring that provides different tradeoffs between classification accuracy and transmission volume. In this work, I study the combination of a minimal set of WSNs with a smartphone, and propose two classification schemes that trade accuracy for transmission volume. The proposed method can achieve accuracies ranging from 88% to 97% and can save up to 86% of wireless transmissions, outperforming the state-of-the-art alternatives. Second, I propose a synchronization-based low-power multi-core architecture for bio-signal processing. I introduce a hardware/software synchronization mechanism that allows to achieve high energy efficiency while parallelizing the execution of multi-channel DSP applications. Then, I generalize the methodology to support bio-signal processing applications with an arbitrarily high degree of parallelism. Due to the benefits of SIMD execution and software pipelining, the architecture can reduce its power consumption by up 38% when compared to an equivalent low-power single-core alternative. Finally, I focused on the optimization of the multi-core memory subsystem, which is the major contributor to the overall system power consumption. First I considered a hybrid memory subsystem featuring a small reliable partition that can operate at ultra-low voltage enabling low-power buffering of data and obtaining up to 50% energy savings. Second, I explore a two-level memory hierarchy based on non-volatile memories (NVM) that allows for aggressive fine-grained power gating enabled by emerging low-power NVM technologies and monolithic 3D integration. Experimental results show that, by adopting this memory hierarchy, power consumption can be reduced by 5.42x in the DSP stage

    Study and development of low power consumption SRAMs on 28 nm FD-SOI CMOS process

    Get PDF
    Since analog circuit designs in CMOS nanometer (< 90 nm) nodes can be substantially affected by manufacturing process variations, circuit performance becomes more challenging to achieve efficient solutions by using analytical models. Extensive simulations are thus commonly required to provide a high yield. On the other hand, due to the fact that the classical bulk MOS structure is reaching scaling limits (< 32 nm), alternative approaches are being developed as successors, such as fully depleted silicon-oninsulator (FD-SOI), Multigate MOSFET, FinFETs, among others, and new design techniques emerge by taking advantage of the improved features of these devices. This thesis focused on the development of analytical expressions for the major performance parameters of the SRAM cache implemented in 28 nm FD-SOI CMOS, mainly to explore the transistor dimensions at low computational cost, thereby producing efficient designs in terms of energy consumption, speed and yield. By taking advantage of both low computational cost and close agreement results of the developed models, in this thesis we were able to propose a non-traditional sizing procedure for the simple 6T-SRAM cell, that unlike the traditional thin-cell design, transistor lengths are used as a design variable in order to reduce the static leakage. The single-P-well (SPW) structure in combination with reverse-body-biasing (RBB) technique were used to achieve a better balance between P-type and N-type transistors. As a result, we developed a 128 kB SRAM cache, whose post-layout simulations show that the circuit consumes an average energy per operation of 0.604 pJ/word-access (64 I/O bits) at supply voltage of 0.45 V and operation frequency of 40 MHz. The total chip area of the 128 kB SRAM cache is 0.060 mm2 .O projeto de circuitos analogicos em processos nanométricos CMOS ( < 90 nm) per substancialmente afetado pelas variacões do processo de fabricacão, sendo cada vez mais desafiador para os projetistas alcançar soluções eficientes no desempenho dos circuitos mediante o uso de modelos analíticos. Simulacões extensas com alto custo com- putacional sao normalmente requeridas para providenciar um correto funcionamento do circuito. Por outro lado, devido ao fato que a estrutura bulk-CMOS esta alcançando seus limites de escala (< 32 nm), outros transistores foram desenvolvidos como sucessores, tais como o fully depleted silicon-on-insulator (FD-SOI), Multigate MOSFET, entre outros, surgindo novas tecnicas de projeto que utilizam as características aprimoradas destes dispositivos. Dessa forma, esta tese de doutorado se foca no desenvolvimento de modelos analíticos dos parametros mais importantes do cache SRAM implementado em processo CMOS FD-SOI de 28 nm, principalmente para explorar as dimensõoes dos transistores com baixo custo computacional, e assim produzir solucões eficientes em termos de consumo de energia, velocidade e rendimento. Aproveitando o baixo custo computacional e a alta concordância dos modelos analíticos, nesta tese fomos capazes de propor um dimensionamento nao tradicional para a célula de memória 6T-SRAM, em que diferentemente é do classico dimensionamento "thin-cell”, os comprimentos dos transistores são utilizados como variável de projeto com o fim de reduzir o consumo estático de corrente. A estrutura single-P-well (SPW), combinada com a técnica reverse-body-biasing (RBB) foram utilizadas para alcançar um melhor balanço entre as correntes específicas dos transistores do tipo P e N

    Hardware / Software Architectural and Technological Exploration for Energy-Efficient and Reliable Biomedical Devices

    Get PDF
    Nowadays, the ubiquity of smart appliances in our everyday lives is increasingly strengthening the links between humans and machines. Beyond making our lives easier and more convenient, smart devices are now playing an important role in personalized healthcare delivery. This technological breakthrough is particularly relevant in a world where population aging and unhealthy habits have made non-communicable diseases the first leading cause of death worldwide according to international public health organizations. In this context, smart health monitoring systems termed Wireless Body Sensor Nodes (WBSNs), represent a paradigm shift in the healthcare landscape by greatly lowering the cost of long-term monitoring of chronic diseases, as well as improving patients' lifestyles. WBSNs are able to autonomously acquire biological signals and embed on-node Digital Signal Processing (DSP) capabilities to deliver clinically-accurate health diagnoses in real-time, even outside of a hospital environment. Energy efficiency and reliability are fundamental requirements for WBSNs, since they must operate for extended periods of time, while relying on compact batteries. These constraints, in turn, impose carefully designed hardware and software architectures for hosting the execution of complex biomedical applications. In this thesis, I develop and explore novel solutions at the architectural and technological level of the integrated circuit design domain, to enhance the energy efficiency and reliability of current WBSNs. Firstly, following a top-down approach driven by the characteristics of biomedical algorithms, I perform an architectural exploration of a heterogeneous and reconfigurable computing platform devoted to bio-signal analysis. By interfacing a shared Coarse-Grained Reconfigurable Array (CGRA) accelerator, this domain-specific platform can achieve higher performance and energy savings, beyond the capabilities offered by a baseline multi-processor system. More precisely, I propose three CGRA architectures, each contributing differently to the maximization of the application parallelization. The proposed Single, Multi and Interleaved-Datapath CGRA designs allow the developed platform to achieve substantial energy savings of up to 37%, when executing complex biomedical applications, with respect to a multi-core-only platform. Secondly, I investigate how the modeling of technology reliability issues in logic and memory components can be exploited to adequately adjust the frequency and supply voltage of a circuit, with the aim of optimizing its computing performance and energy efficiency. To this end, I propose a novel framework for workload-dependent Bias Temperature Instability (BTI) impact analysis on biomedical application results quality. Remarkably, the framework is able to determine the range of safe circuit operating frequencies without introducing worst-case guard bands. Experiments highlight the possibility to safely raise the frequency up to 101% above the maximum obtained with the classical static timing analysis. Finally, through the study of several well-known biomedical algorithms, I propose an approach allowing energy savings by dynamically and unequally protecting an under-powered data memory in a new way compared to regular error protection schemes. This solution relies on the Dynamic eRror compEnsation And Masking (DREAM) technique that reduces by approximately 21% the energy consumed by traditional error correction codes
    corecore