19 research outputs found

    Axp: A hw-sw co-design pipeline for energy-efficient approximated convnets via associative matching

    Get PDF
    The reduction in energy consumption is key for deep neural networks (DNNs) to ensure usability and reliability, whether they are deployed on low-power end-nodes with limited resources or high-performance platforms that serve large pools of users. Leveraging the over-parametrization shown by many DNN models, convolutional neural networks (ConvNets) in particular, energy efficiency can be improved substantially preserving the model accuracy. The solution proposed in this work exploits the intrinsic redundancy of ConvNets to maximize the reuse of partial arithmetic results during the inference stages. Specifically, the weight-set of a given ConvNet is discretized through a clustering procedure such that the largest possible number of inner multiplications fall into predefined bins; this allows an off-line computation of the most frequent results, which in turn can be stored locally and retrieved when needed during the forward pass. Such a reuse mechanism leads to remarkable energy savings with the aid of a custom processing element (PE) that integrates an associative memory with a standard floating-point unit (FPU). Moreover, the adoption of an approximate associative rule based on a partial bit-match increases the hit rate over the pre-computed results, maximizing the energy reduction even further. Results collected on a set of ConvNets trained for computer vision and speech processing tasks reveal that the proposed associative-based hw-sw co-design achieves up to 77% in energy savings with less than 1% in accuracy loss

    Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing

    Get PDF
    This paper presents Mr. Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108-mu W fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) -1 GFlop/s-with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr. Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things

    A Biosensor-CMOS Platform and Integrated Readout Circuit in 0.18-μm CMOS Technology for Cancer Biomarker Detection

    Get PDF
    This paper presents a biosensor-CMOS platform for measuring the capacitive coupling of biorecognition elements. The biosensor is designed, fabricated, and tested for the detection and quantification of a protein that reveals the presence of early-stage cancer. For the first time, the spermidine/spermine N1 acetyltransferase (SSAT) enzyme has been screened and quantified on the surface of a capacitive sensor. The sensor surface is treated to immobilize antibodies, and the baseline capacitance of the biosensor is reduced by connecting an array of capacitors in series for fixed exposure area to the analyte. A large sensing area with small baseline capacitance is implemented to achieve a high sensitivity to SSAT enzyme concentrations. The sensed capacitance value is digitized by using a 12-bit highly digital successive-approximation capacitance-to-digital converter that is implemented in a 0.18 μm CMOS technology. The readout circuit operates in the near-subthreshold regime and provides power and area efficient operation. The capacitance range is 16.137 pF with a 4.5 fF absolute resolution, which adequately covers the concentrations of 10 mg/L, 5 mg/L, 2.5 mg/L, and 1.25 mg/L of the SSAT enzyme. The concentrations were selected as a pilot study, and the platform was shown to demonstrate high sensitivity for SSAT enzymes on the surface of the capacitive sensor. The tested prototype demonstrated 42.5 μS of measurement time and a total power consumption of 2.1 μW

    A Primer on Software Defined Radios

    Get PDF
    The commercial success of cellular phone systems during the late 1980s and early 1990 years heralded the wireless revolution that became apparent at the turn of the 21st century and has led the modern society to a highly interconnected world where ubiquitous connectivity and mobility are enabled by powerful wireless terminals. Software defined radio (SDR) technology has played a major role in accelerating the pace at which wireless capabilities have advanced, in particular over the past 15 years, and SDRs are now at the core of modern wireless communication systems. In this paper we give an overview of SDRs that includes a discussion of drivers and technologies that have contributed to their continuous advancement, and presents the theory needed to understand the architecture and operation of current SDRs. We also review the choices for SDR platforms and the programming options that are currently available for SDR research, development, and teaching, and present case studies illustrating SDR use. Our hope is that the paper will be useful as a reference to wireless researchers and developers working in the industry or in academic settings on further advancing and refining the capabilities of wireless systems

    IMPLANTACIÓN DE UNA LPWAN PARA MONITOREO DE TEMPERATURA Y HUMEDAD EN UN INVERNADERO

    Get PDF
    ResumenSe presenta la implantación de una LPWAN conectada a la Internet para monitoreo de temperatura y humedad de un invernadero. La LPWAN está formada por cinco nodos distribuidos en el invernadero y un gateway conectado a la Internet. Es una solución de IoT, donde cada nodo consta de un microcontrolador, un sensor de temperatura ambiente y humedad relativa y un transceptor de radio LoRa. Periódicamente los nodos envían la información recabada por los sensores al gateway y éste a su vez la transmite a un servidor ubicado en la nube. Operando una interfaz de usuario, se puede acceder al servidor y monitorear los valores de temperatura y humedad enviados por los nodos. La interfaz de usuario puede acceder información de diferentes invernaderos en los que exista una LPWAN como la diseñada en este trabajo. El alcance de la LPWAN fue de 12.5 Kilómetros con línea de vista.Palabras Claves: Gateway, IoT, LoRa, LPWAN, sensor de temperatura. IMPLEMENTATION OF A LPWAN FOR TEMPERATURE AND MOISTURE MONITORING IN A GREENHOUSEAbstractThis paper presents the implementation of a LPWAN connected to the Internet to monitor the temperature and humidity of a greenhouse. The LPWAN consists of five nodes distributed in the greenhouse and a gateway connected to the Internet. It is an IoT solution, where each node consists of a microcontroller, an ambient temperature and relative humidity sensor and a LoRa radio transceiver. Periodically the nodes send the information collected by the sensors to the gateway and this in turn transmits it to a server located in the cloud. By operating a user interface, you can access the server and monitor the temperature and humidity values sent by the nodes. The user interface can access information from different greenhouses in which there is an LPWAN as the one designed in this work. The scope of the LPWAN was 12.5 Kilometers with line of sight.Keywords: Gateway, IoT, LoRa, LPWAN, temperature sensor, transceiver

    Microarchitectural Low-Power Design Techniques for Embedded Microprocessors

    Get PDF
    With the omnipresence of embedded processing in all forms of electronics today, there is a strong trend towards wireless, battery-powered, portable embedded systems which have to operate under stringent energy constraints. Consequently, low power consumption and high energy efficiency have emerged as the two key criteria for embedded microprocessor design. In this thesis we present a range of microarchitectural low-power design techniques which enable the increase of performance for embedded microprocessors and/or the reduction of energy consumption, e.g., through voltage scaling. In the context of cryptographic applications, we explore the effectiveness of instruction set extensions (ISEs) for a range of different cryptographic hash functions (SHA-3 candidates) on a 16-bit microcontroller architecture (PIC24). Specifically, we demonstrate the effectiveness of light-weight ISEs based on lookup table integration and microcoded instructions using finite state machines for operand and address generation. On-node processing in autonomous wireless sensor node devices requires deeply embedded cores with extremely low power consumption. To address this need, we present TamaRISC, a custom-designed ISA with a corresponding ultra-low-power microarchitecture implementation. The TamaRISC architecture is employed in conjunction with an ISE and standard cell memories to design a sub-threshold capable processor system targeted at compressed sensing applications. We furthermore employ TamaRISC in a hybrid SIMD/MIMD multi-core architecture targeted at moderate to high processing requirements (> 1 MOPS). A range of different microarchitectural techniques for efficient memory organization are presented. Specifically, we introduce a configurable data memory mapping technique for private and shared access, as well as instruction broadcast together with synchronized code execution based on checkpointing. We then study an inherent suboptimality due to the worst-case design principle in synchronous circuits, and introduce the concept of dynamic timing margins. We show that dynamic timing margins exist in microprocessor circuits, and that these margins are to a large extent state-dependent and that they are correlated to the sequences of instruction types which are executed within the processor pipeline. To perform this analysis we propose a circuit/processor characterization flow and tool called dynamic timing analysis. Moreover, this flow is employed in order to devise a high-level instruction set simulation environment for impact-evaluation of timing errors on application performance. The presented approach improves the state of the art significantly in terms of simulation accuracy through the use of statistical fault injection. The dynamic timing margins in microprocessors are then systematically exploited for throughput improvements or energy reductions via our proposed instruction-based dynamic clock adjustment (DCA) technique. To this end, we introduce a 6-stage 32-bit microprocessor with cycle-by-cycle DCA. Besides a comprehensive design flow and simulation environment for evaluation of the DCA approach, we additionally present a silicon prototype of a DCA-enabled OpenRISC microarchitecture fabricated in 28 nm FD-SOI CMOS. The test chip includes a suitable clock generation unit which allows for cycle-by-cycle DCA over a wide range with fine granularity at frequencies exceeding 1 GHz. Measurement results of speedups and power reductions are provided

    A Primer on Software Defined Radios

    Get PDF
    The commercial success of cellular phone systems during the late 1980s and early 1990 years heralded the wireless revolution that became apparent at the turn of the 21st century and has led the modern society to a highly interconnected world where ubiquitous connectivity and mobility are enabled by powerful wireless terminals. Software defined radio (SDR) technology has played a major role in accelerating the pace at which wireless capabilities have advanced, in particular over the past 15 years, and SDRs are now at the core of modern wireless communication systems. In this paper we give an overview of SDRs that includes a discussion of drivers and technologies that have contributed to their continuous advancement, and presents the theory needed to understand the architecture and operation of current SDRs. We also review the choices for SDR platforms and the programming options that are currently available for SDR research, development, and teaching, and present case studies illustrating SDR use. Our hope is that the paper will be useful as a reference to wireless researchers and developers working in the industry or in academic settings on further advancing and refining the capabilities of wireless systems

    Low-Power Design of Digital VLSI Circuits around the Point of First Failure

    Get PDF
    As an increase of intelligent and self-powered devices is forecasted for our future everyday life, the implementation of energy-autonomous devices that can wirelessly communicate data from sensors is crucial. Even though techniques such as voltage scaling proved to effectively reduce the energy consumption of digital circuits, additional energy savings are still required for a longer battery life. One of the main limitations of essentially any low-energy technique is the potential degradation of the quality of service (QoS). Thus, a thorough understanding of how circuits behave when operated around the point of first failure (PoFF) is key for the effective application of conventional energy-efficient methods as well as for the development of future low-energy techniques. In this thesis, a variety of circuits, techniques, and tools is described to reduce the energy consumption in digital systems when operated either in the safe and conservative exact region, close to the PoFF, or even inside the inexact region. A straightforward approach to reduce the power consumed by clock distribution while safely operating in the exact region is dual-edge-triggered (DET) clocking. However, the DET approach is rarely taken, primarily due to the perceived complexity of its integration. In this thesis, a fully automated design flow is introduced for applying DET clocking to a conventional single-edge-triggered (SET) design. In addition, the first static true-single-phase-clock DET flip-flop (DET-FF) that completely avoids clock-overlap hazards of DET registers is proposed. Even though the correct timing of synchronous circuits is ensured in worst-case conditions, the critical path might not always be excited. Thus, dynamic clock adjustment (DCA) has been proposed to trim any available dynamic timing margin by changing the operating clock frequency at runtime. This thesis describes a dynamically-adjustable clock generator (DCG) capable of modifying the period of the produced clock signal on a cycle-by-cycle basis that enables the DCA technique. In addition, a timing-monitoring sequential (TMS) that detects input transitions on either one of the clock phases to enable the selection of the best timing-monitoring strategy at runtime is proposed. Energy-quality scaling techniques aimat trading lower energy consumption for a small degradation on the QoS whenever approximations can be tolerated. In this thesis, a low-power methodology for the perturbation of baseline coefficients in reconfigurable finite impulse response (FIR) filters is proposed. The baseline coefficients are optimized to reduce the switching activity of the multipliers in the FIR filter, enabling the possibility of scaling the power consumption of the filter at runtime. The area as well as the leakage power of many system-on-chips is often dominated by embedded memories. Gain-cell embedded DRAM (GC-eDRAM) is a compact, low-power and CMOS-compatible alternative to the conventional static random-access memory (SRAM) when a higher memory density is desired. However, due to GC-eDRAMs relying on many interdependent variables, the adaptation of existing memories and the design of future GCeDRAMs prove to be highly complex tasks. Thus, the first modeling tool that estimates timing, memory availability, bandwidth, and area of GC-eDRAMs for a fast exploration of their design space is proposed in this thesis

    Neuromorphic Engineering Editors' Pick 2021

    Get PDF
    This collection showcases well-received spontaneous articles from the past couple of years, which have been specially handpicked by our Chief Editors, Profs. André van Schaik and Bernabé Linares-Barranco. The work presented here highlights the broad diversity of research performed across the section and aims to put a spotlight on the main areas of interest. All research presented here displays strong advances in theory, experiment, and methodology with applications to compelling problems. This collection aims to further support Frontiers’ strong community by recognizing highly deserving authors

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
    corecore