12 research outputs found

    Voltage stacking for near/sub-threshold operation

    Get PDF

    Low energy digital circuits in advanced nanometer technologies

    Get PDF
    The demand for portable devices and the continuing trend towards the Internet ofThings (IoT) have made of energy consumption one of the main concerns in the industry and researchers. The most efficient way of reducing the energy consump-tion of digital circuits is decreasing the supply voltage (Vdd) since the dynamicenergy quadratically depends onVdd. Several works have shown that an optimumsupply voltage exists that minimizes the energy consumption of digital circuits. This optimum supply voltage is usually around 200 mV and 400 mV dependingon the circuit and technology used. To obtain these low supply voltages, on-chipdc-dc converters with high efficiency are needed.This thesis focuses on the study of subthreshold digital systems in advancednanometer technologies. These systems usually can be divided into a Power Man-agement Unit (PMU) and a digital circuit operating at the subthreshold regime.In particular, while considering the PMU, one of the key circuits is the dc-dcconverter. This block converts the voltage from the power source (battery, supercapacitor or wireless power transfer link) to a voltage between 200 mV and 400mV in order to power the digital circuit. In this thesis, we developed two chargerecycling techniques in order to improve the efficiency of switched capacitors dc-dcconverters. The first one is based on a technique used in adiabatic circuits calledstepwise charging. This technique was used in circuits and applications wherethe switching consumption of a big capacitance is very important. We analyzedthe possibility of using this technique in switched capacitor dc-dc converters withintegrated capacitors. We showed through measurements that a 29% reductionin the gate drive losses can be obtained with this technique. The second one isa simplification of stepwise charging which can be applied in some architecturesof switched capacitors dc-dc converters. We also fabricated and tested a dc-dcconverter with this technique and obtained a 25% energy reduction in the drivingof the switches that implement the converter.Furthermore, we studied the digital circuit working in the subthreshold regime,in particular, operating at the minimum energy point. We studied different modelsfor circuits working in these conditions and improved them by considering thedifferences between the NMOS and PMOS transistors. We obtained an optimumNMOS/PMOS leakage current imbalance that minimizes the total leakage energy per operation. This optimum depends on the architecture of the digital circuitand the input data. However, we also showed that important energy reductionscan be obtained by operating at a mean optimum imbalance. We proposed two techniques to achieve the optimum imbalance. We used aFully Depleted Silicon on Insulator (FD-SOI) 28 nm technology for most of the simulations, but we also show that these techniques can be applied in traditionalbulk CMOS technologies. The first one consists in using the back plane voltage of the transistors (or bulk voltage in traditional CMOS) to adjust independently theleakage current of the NMOS and PMOS transistor to work under the optimum NMOS/PMOS leakage current imbalance. We called this approach the OptimumBack Plane Biasing (OBB). A second technique consists of using the length of the transistors to adjust this leakage current imbalance. In the subthreshold regimeand in advanced nanometer technologies a moderate increase in the length has little impact in the output capacitance of the gates and thus in the dynamic energy.We called this approach an Asymmetric Length Biasing (ALB). Finally, we use these techniques in some basic circuits such as adders. We show that around 50% energy reduction can be obtained, in a wide range of frequency while working near the minimum energy point and using these techniques. The main contributions of this thesis are: • Analysis of the stepwise charging technique in small capacitances. •Implementation of stepwise charging technique as a charge recycling tech-nique for efficiency improvement in switched capacitor dc-dc converters. • Development of a charge sharing technique for efficiency improvement inswitched capacitor dc-dc converters. • Analysis of minimum operating voltage of digital circuits due to intrinsicnoise and the impact of technology scaling in this minimum. • Improvement in the modeling of the minimum energy point while considering NMOS and PMOS transistors difference. • Demonstration of the existence of an optimum leakage current imbalance be-tween the NMOS and PMOS transistors that minimizes energy consumptionin the subthreshold regiion. • Development of a back plane (bulk) voltage strategy for working in this optimum.• Development of a sizing strategy for working in the aforementioned optimum. • Analysis of the impact of architecture and input data on the optimum im-balance. The thesis is based on the publications [1–8]. During the Ph.D. program, other publications were generated [9–16] that are partially related with the thesis butwere not included in it.La constante demanda de dispositivos portables y los avances hacia la Internet de las Cosas han hecho del consumo de energía uno de los mayores desafíos y preocupación en la industria y la academia. La forma más eficiente de reducir el consumo de energía de los circuitos digitales es reduciendo su voltaje de alimentación ya que la energía dinámica depende de manera cuadrática con dicho voltaje. Varios trabajos demostraron que existe un voltaje de alimentación óptimo, que minimiza la energía consumida para realizar cierta operación en un circuito digital, llamado punto de mínima energía. Este óptimo voltaje se encuentra usualmente entre 200 mV y 400 mV dependiendo del circuito y de la tecnología utilizada. Para obtener estos voltajes de alimentación de la fuente de energía, se necesitan conversores dc-dc integrados con alta eficiencia. Esta tesis se concentra en el estudio de sistemas digitales trabajando en la región sub umbral diseñados en tecnologías nanométricas avanzadas (28 nm). Estos sistemas se pueden dividir usualmente en dos bloques, uno llamado bloque de manejo de potencia, y el segundo, el circuito digital operando en la region sub umbral. En particular, en lo que corresponde al bloque de manejo de potencia, el circuito más crítico es en general el conversor dc-dc. Este circuito convierte el voltaje de una batería (o super capacitor o enlace de transferencia inalámbrica de energía o unidad de cosechado de energía) en un voltaje entre 200 mV y 400 mV para alimentar el circuito digital en su voltaje óptimo. En esta tesis desarrollamos dos técnicas que, mediante el reciclado de carga, mejoran la eficiencia de los conversores dc-dc a capacitores conmutados. La primera es basada en una técnica utilizada en circuitos adiabáticos que se llama carga gradual o a pasos. Esta técnica se ha utilizado en circuitos y aplicaciones en donde el consumo por la carga y descarga de una capacidad grande es dominante. Nosotros analizamos la posibilidad de utilizar esta técnica en conversores dc-dc a capacitores conmutados con capacitores integrados. Se demostró a través de medidas que se puede reducir en un 29% el consumo debido al encendido y apagado de las llaves que implementan el conversor dc-dc. La segunda técnica, es una simplificación de la primera, la cual puede ser aplicada en ciertas arquitecturas de conversores dc-dc a capacitores conmutados. También se fabricó y midió un conversor con esta técnica y se obtuvo una reducción del 25% en la energía consumida por el manejo de las llaves del conversor. Por otro lado, estudiamos los circuitos digitales operando en la región sub umbral y en particular cerca del punto de mínima energía. Estudiamos diferentes modelos para circuitos operando en estas condiciones y los mejoramos considerando las diferencias entre los transistores NMOS y PMOS. Mediante este modelo demostramos que existe un óptimo en la relación entre las corrientes de fuga de ambos transistores que minimiza la energía de fuga consumida por operación. Este óptimo depende de la arquitectura del circuito digital y ademas de los datos de entrada del circuito. Sin embargo, demostramos que se puede reducir el consumo de manera considerable al operar en un óptimo promedio. Propusimos dos técnicas para alcanzar la relación óptima. Utilizamos una tecnología FD-SOI de 28nm para la mayoría de las simulaciones, pero también mostramos que estas técnicas pueden ser utilizadas en tecnologías bulk convencionales. La primer técnica, consiste en utilizar el voltaje de la puerta trasera (o sustrato en CMOS convencional) para ajustar de manera independiente las corrientes del NMOS y PMOS para que el circuito trabaje en el óptimo de la relación de corrientes. Esta técnica la llamamos polarización de voltaje de puerta trasera óptimo. La segunda técnica, consiste en utilizar los largos de los transistores para ajustar las corrientes de fugas de cada transistor y obtener la relación óptima. Trabajando en la región sub umbral y en tecnologías avanzadas, incrementar moderadamente el largo del transistor tiene poco impacto en la energía dinámica y es por eso que se puede utilizar. Finalmente, utilizamos estas técnicas en circuitos básicos como sumadores y mostramos que se puede obtener una reducción de la energía consumida de aproximadamente 50%, en un amplio rango de frecuencias, mientras estos circuitos trabajan cerca del punto de energía mínima. Las principales contribuciones de la tesis son: • Análisis de la técnica de carga gradual o a pasos en capacidades pequeñas. • Implementación de la técnica de carga gradual para la mejora de eficiencia de conversores dc-dc a capacitores conmutados. • Simplificación de la técnica de carga gradual para mejora de la eficiencia en algunas arquitecturas de conversores dc-dc de capacitores conmutados. • Análisis del mínimo voltaje de operación en circuitos digitales debido al ruido intrínseco del dispositivo y el impacto del escalado de las tecnologías en el mismo. • Mejoras en el modelado del punto de energía mínima de operación de un circuito digital en el cual se consideran las diferencias entre el transistor PMOS y NMOS. • Demostración de la existencia de un óptimo en la relación entre las corrientes de fuga entre el NMOS y PMOS que minimiza la energía de fugas consumida en la región sub umbral. • Desarrollo de una estrategia de polarización del voltaje de puerta trasera para que el circuito digital trabaje en el óptimo antes mencionado. • Desarrollo de una estrategia para el dimensionado de los transistores que componen las compuertas digitales que permite al circuito digital operar en el óptimo antes mencionado. • Análisis del impacto de la arquitectura del circuito y de los datos de entrada del mismo en el óptimo antes mencionado

    Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add

    Get PDF
    The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft

    Power Profiling Model for RISC-V Core

    Get PDF
    The reduction of power consumption is considered to be a critical factor for efficient computation of microprocessors. Therefore, it is necessary to implement a power management system that is aware of the computational load of the CPU cores. To enable such power management, this project aims to develop a power profiling model for the RISC-V core. TheSyDeKick verification environment was used to develop the power profiling models. Additionally, Python-controlled mixed mode simulations of C-programs compiled for A-Core were conducted to obtain needed data for the power profiling of the digital circuitry. The proposed methodology could employ a time-varying power consumption profiling for the A-Core RISC-V microprocessor core which depends on software, voltage, and clock frequency. The results of this project allow for the creation of parameterized power profiles for the A-Core, which can contribute to more efficient and sustainable computing

    Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing

    Get PDF
    This paper presents Mr. Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108-mu W fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) -1 GFlop/s-with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr. Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things

    Battery-sourced switched-inductor multiple-output CMOS power-supply systems

    Get PDF
    Wireless microsystems add intelligence to larger systems by sensing, processing and transmitting information which can ultimately save energy and resources. Each function has their own power profile and supply level to maximize performance and save energy since they are powered by a small battery. Also, due to its small size, the battery has limited energy and therefore the power-supply system cannot consume much power. Switched-inductor converters are efficient across wide operating conditions but one fundamental challenge is integration because miniaturized dc-dc converters cannot afford to accommodate more than one off-chip power inductor. The objective of this research is to explore, develop, analyze, prototype, test, and evaluate how one switched inductor can derive power from a small battery to supply, regulate, and respond to several independent outputs reliably and accurately. Managing and stabilizing the feedback loops that supply several outputs at different voltages under diverse and dynamic loading conditions with one CMOS chip and one inductor is also challenging. Plus, since a single inductor cannot supply all outputs at once, steady-state ripples and load dumps produce cross-regulation effects that are difficult to manage and suppress. Additionally, as the battery depletes the power-supply system must be able to regulate both buck and boost voltages. The presented system can efficiently generate buck and boost voltages with the fastest response time while having a low silicon area consumption per output in a low-cost technology which can reduce the overall size and cost of the system.Ph.D

    Microarchitectures pour la sauvegarde incrémentale, robuste et efficace dans les systèmes à alimentation intermittente

    Get PDF
    Embedded devices powered with environmental energy harvesting, have to sustain computation while experiencing unexpected power failures.To preserve the progress across the power interruptions, Non-Volatile Memories (NVMs) are used to quickly save the state. This dissertation first presents an overview and comparison of different NVM technologies, based on different surveys from the literature. The second contribution we propose is a dedicated backup controller, called Freezer, that implements an on-demand incremental backup scheme. This can make the size of the backup 87.7% smaller then a full-memory backup strategy from the state of the art (SoA). Our third contribution addresses the problem of corruption of the state, due to interruptions during the backup process. Two algorithms are presented, that improve on the Freezer incremental backup process, making it robust to errors, by always guaranteeing the existence of a correct state, that can be restored in case of backup errors. These two algorithms can consume 23% less energy than the usual double-buffering technique used in the SoA. The fourth contribution, addresses the scalability of our proposed approach. Combining Freezer with Bloom filters, we introduce a backup scheme that can cover much larger address spaces, while achieving a backup size which is half the size of the regular Freezer approach.Les appareils embarqués alimentés par la récupération d'énergie environnementale doivent maintenir le calcul tout en subissant des pannes de courant inattendues. Pour préserver la progression à travers les interruptions de courant, des mémoires non volatiles (NVM) sont utilisées pour enregistrer rapidement l'état. Cette thèse présente d'abord une vue d'ensemble et une comparaison des différentes technologies NVM, basées sur différentes enquêtes de la littérature. La deuxième contribution que nous proposons est un contrôleur de sauvegarde dédié, appelé Freezer, qui implémente un schéma de sauvegarde incrémentale à la demande. Cela peut réduire la taille de la sauvegarde de 87,7% à celle d'une stratégie de sauvegarde à mémoire complète de l'état de l'art. Notre troisième contribution aborde le problème de la corruption de l'état, due aux interruptions pendant le processus de sauvegarde. Deux algorithmes sont présentés, qui améliorent le processus de sauvegarde incrémentale de Freezer, le rendant robuste aux erreurs, en garantissant toujours l'existence d'un état correct, qui peut être restauré en cas d'erreurs de sauvegarde. Ces deux algorithmes peuvent consommer 23%23\% d'énergie en moins que la technique de ``double-buffering'' utilisée dans l'état de l'art. La quatrième contribution porte sur l'évolutivité de notre approche proposée. En combinant Freezer avec des filtres Bloom, nous introduisons un schéma de sauvegarde qui peut couvrir des espaces d'adressage beaucoup plus grands, tout en obtenant une taille de sauvegarde qui est la moitié de la taille de l'approche Freezer habituelle

    Deep in-memory computing

    Get PDF
    There is much interest in embedding data analytics into sensor-rich platforms such as wearables, biomedical devices, autonomous vehicles, robots, and Internet-of-Things to provide these with decision-making capabilities. Such platforms often need to implement machine learning (ML) algorithms under stringent energy constraints with battery-powered electronics. Especially, energy consumption in memory subsystems dominates such a system's energy efficiency. In addition, the memory access latency is a major bottleneck for overall system throughput. To address these issues in memory-intensive inference applications, this dissertation proposes deep in-memory accelerator (DIMA), which deeply embeds computation into the memory array, employing two key principles: (1) accessing and processing multiple rows of memory array at a time, and (2) embedding pitch-matched low-swing analog processing at the periphery of bitcell array. The signal-to-noise ratio (SNR) is budgeted by employing low-swing operations in both memory read and processing to exploit the application level's error immunity for aggressive energy efficiency. This dissertation first describes the system rationale underlying the DIMA's processing stages by identifying the common functional flow across a diverse set of inference algorithms. Based on the analysis, this dissertation presents a multi-functional DIMA to support four algorithms: support vector machine (SVM), template matching (TM), k-nearest neighbor (k-NN), and matched filter. The circuit and architectural level design techniques and guidelines are provided to address the challenges in achieving multi-functionality. A prototype integrated circuit (IC) of a multi-functional DIMA was fabricated with a 16 KB SRAM array in a 65 nm CMOS process. Measurement results show up to 5.6X and 5.8X energy and delay reductions leading to 31X energy delay product (EDP) reduction with negligible (<1%) accuracy degradation as compared to the conventional 8-b fixed-point digital implementation optimally designed for each algorithm. Then, DIMA also has been applied to more complex algorithms: (1) convolutional neural network (CNN), (2) sparse distributed memory (SDM), and (3) random forest (RF). System-level simulations of CNN using circuit behavioral models in a 45 nm SOI CMOS demonstrate that high probability (>0.99) of handwritten digit recognition can be achieved using the MNIST database, along with a 24.5X reduced EDP, a 5.0X reduced energy, and a 4.9X higher throughput as compared to the conventional system. The DIMA-based SDM architecture also achieves up to 25X and 12X delay and energy reductions, respectively, over conventional SDM with negligible accuracy degradation (within 0.4%) for 16X16 binary-pixel image classification. A DIMA-based RF was realized as a prototype IC with a 16 KB SRAM array in a 65 nm process. To the best of our knowledge, this is the first IC realization of an RF algorithm. The measurement results show that the prototype achieves a 6.8X lower EDP compared to a conventional design at the same accuracy (94%) for an eight-class traffic sign recognition problem. The multi-functional DIMA and extension to other algorithms naturally motivated us to consider a programmable DIMA instruction set architecture (ISA), namely MATI. This dissertation explores a synergistic combination of the instruction set, architecture and circuit design to achieve the programmability without losing DIMA's energy and throughput benefits. Employing silicon-validated energy, delay and behavioral models of deep in-memory components, we demonstrate that MATI is able to realize nine ML benchmarks while incurring negligible overhead in energy (< 0.1%), and area (4.5%), and in throughput, over a fixed four-function DIMA. In this process, MATI is able to simultaneously achieve enhancements in both energy (2.5X to 5.5X) and throughput (1.4X to 3.4X) for an overall EDP improvement of up to 12.6X over fixed-function digital architectures

    Neural network computing using on-chip accelerators

    Get PDF
    The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications
    corecore