130 research outputs found

    PULP-NN: Accelerating Quantized Neural Networks on Parallel Ultra-Low-Power RISC-V Processors

    Get PDF
    We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster\u2019s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 7 with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 7 and 19.6 7 less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 7 and by 7.45 7 the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 7 higher than STM32L4 and 39.5 7 higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue \u2018Harmonizing energy-autonomous computing and intelligence\u2019

    PULP-NN: A computing library for quantized neural network inference at the edge on RISC-V based parallel ultra low power clusters

    Get PDF
    We present PULP-NN, a multicore computing library for a parallel ultra-low-power cluster of RISC-V based processors. The library consists of a set of kernels for Quantized Neural Network (QNN) inference on edge devices, targeting byte and sub-byte data types, down to INT-1. Our software solution exploits the digital signal processing (DSP) extensions available in the PULP RISC-V processors and the cluster's parallelism, improving performance by up to 63 7 with respect to a baseline implementation on a single RISC-V core implementing the RV32IMC ISA. Using the PULP-NN routines, the inference of a CIFAR-10 QNN model runs in 30 7 and 19.6 7 less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on an STM32L4 and an STM32H7 MCUs, respectively. By running the library kernels on the GAP-8 processor at the maximum efficiency operating point, the energy efficiency on GAP-8 is 14.1 7 higher than STM32L4 and 39.5 7 than STM32H7

    A High-performance, Energy-efficient Modular DMA Engine Architecture

    Full text link
    Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8x with only 1 % additional area compared to a base system without a DMAE. We achieve an area reduction of 10 % while improving ML inference performance by 23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio

    DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

    Get PDF
    The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency -- requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY) - an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5x better MAC/cycle than the GreenWaves proprietary software solution and 18.1x better than the state-of-the-art result on an STM32-F746 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746. We release all our developments - the DORY framework, the optimized backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication in IEEE Transactions on Computers (https://ieeexplore.ieee.org/document/9381618

    Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node

    Full text link
    Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltage scaling. On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ~1/1000). In this operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW - low enough to enable always-on operation in ultra-low power smart cameras, long-lifetime environmental sensors, and insect-sized pico-drones.Comment: Submitted to ISICAS2020 journal special issu

    Many-core architectures with time predictable execution Support for hard real-time applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 183-193).Hybrid control systems are a growing domain of application. They are pervasive and their complexity is increasing rapidly. Distributed control systems for future "Intelligent Grid" and renewable energy generation systems are demanding high-performance, hard real-time computation, and more programmability. General-purpose computer systems are primarily designed to process data and not to interact with physical processes as required by these systems. Generic general-purpose architectures even with the use of real-time operating systems fail to meet the hard realtime constraints of hybrid system dynamics. ASIC, FPGA, or traditional embedded design approaches to these systems often result in expensive, complicated systems that are hard to program, reuse, or maintain. In this thesis, we propose a domain-specific architecture template targeting hybrid control system applications. Using power electronics control applications, we present new modeling techniques, synthesis methodologies, and a parameterizable computer architecture for these large distributed control systems. We propose a new system modeling approach, called Adaptive Hybrid Automaton, based on previous work in control system theory, that uses a mixed-model abstractions and lends itself well to digital processing. We develop a domain-specific architecture based on this modeling that uses heterogeneous processing units and predictable execution, called MARTHA. We develop a hard real-time aware router architecture to enable deterministic on-chip interconnect network communication. We present several algorithms for scheduling task-based applications onto these types of heterogeneous architectures. We create Heracles, an open-source, functional, parameterized, synthesizable many-core system design toolkit, that can be used to explore future multi/many-core processors with different topologies, routing schemes, processing elements or cores, and memory system organizations. Using the Heracles design tool we build a prototype of the proposed architecture using a state-of-the-art FPGA-based platform, and deploy and test it in actual physical power electronics systems. We develop and release an open-source, small representative set of power electronics system applications that can be used for hard real-time application benchmarking.by Michel A. Kinsy.Ph.D

    XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes

    Get PDF
    Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU

    An investigation of the XMOS XSl architecture as a platform for development of audio control standards

    Get PDF
    This thesis investigates the feasiblity of using a new microcontroller architecture, the XMOS XS1, in the research and development of control standards for audio distribution networks. This investigation is conducted in the context of an emerging audio distribution network standard, Ethernet Audio/Video Bridging (`Ethernet AVB'), and an emerging audio control standard, AES-64. The thesis describes these emerging standards, the XMOS XS1 architecture (including its associated programming language, XC), and the open-source implementation of an Ethernet AVB streaming audio device based on the XMOS XS1 architecture. It is shown how the XMOS XS1 architecture and its associated features, focusing on the XC language's mechanisms for concurrency, event-driven programming, and integration of C software modules, enable a powerful implementation of the AES-64 control standard. Feasibility is demonstrated by the implementation of an AES-64 protocol stack and its integration into an XMOS XS1-based Ethernet AVB streaming audio device, providing control of Ethernet AVB features and audio hardware, as well as implementations of advanced AES-64 control mechanisms. It is demonstrated that the XMOS XS1 architecture is a compelling platform for the development of audio control standards, and has enabled the implementation of AES-64 connection management and control over standards-compliant Ethernet AVB streaming audio devices where no such implementation previously existed. The research additionally describes a linear design method for applications based on the XMOS XS1 architecture, and provides a baseline implementation reference for the AES-64 control standard where none previously existed
    • …
    corecore