138 research outputs found
An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics
Near-sensor data analytics is a promising direction for IoT endpoints, as it
minimizes energy spent on communication and reduces network load - but it also
poses security concerns, as valuable data is stored or sent over the network at
various stages of the analytics pipeline. Using encryption to protect sensitive
data at the boundary of the on-chip analytics engine is a way to address data
security issues. To cope with the combined workload of analytics and encryption
in a tight power envelope, we propose Fulmine, a System-on-Chip based on a
tightly-coupled multi-core cluster augmented with specialized blocks for
compute-intensive data processing and encryption functions, supporting software
programmability for regular computing tasks. The Fulmine SoC, fabricated in
65nm technology, consumes less than 20mW on average at 0.8V achieving an
efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to
25MIPS/mW in software. As a strong argument for real-life flexible application
of our platform, we show experimental results for three secure analytics use
cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN
consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with
secured remote recognition in 5.74pJ/op; and seizure detection with encrypted
data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE
Transactions on Circuits and Systems - I: Regular Paper
A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones
Fully-autonomous miniaturized robots (e.g., drones), with artificial
intelligence (AI) based visual navigation capabilities are extremely
challenging drivers of Internet-of-Things edge intelligence capabilities.
Visual navigation based on AI approaches, such as deep neural networks (DNNs)
are becoming pervasive for standard-size drones, but are considered out of
reach for nanodrones with size of a few cm. In this work, we
present the first (to the best of our knowledge) demonstration of a navigation
engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based
visual navigation. To achieve this goal we developed a complete methodology for
parallel execution of complex DNNs directly on-bard of resource-constrained
milliwatt-scale nodes. Our system is based on GAP8, a novel parallel
ultra-low-power computing platform, and a 27 g commercial, open-source
CrazyFlie 2.0 nano-quadrotor. As part of our general methodology we discuss the
software mapping techniques that enable the state-of-the-art deep convolutional
neural network presented in [1] to be fully executed on-board within a strict 6
fps real-time constraint with no compromise in terms of flight results, while
all processing is done with only 64 mW on average. Our navigation engine is
flexible and can be used to span a wide performance range: at its peak
performance corner it achieves 18 fps while still consuming on average just
3.5% of the power envelope of the deployed nano-aircraft.Comment: 15 pages, 13 figures, 5 tables, 2 listings, accepted for publication
in the IEEE Internet of Things Journal (IEEE IOTJ
A Microcontroller is All You Need: Enabling Transformer Execution on Low-Power IoT Endnodes
Transformer networks have become state-of-the-art for many tasks such as NLP and are closing the gap on other tasks like image recognition. Similarly, Transformers and Attention methods are starting to attract attention on smaller-scale tasks, which fit the typical memory envelope of MCUs. In this work, we propose a new set of execution kernels tuned for efficient execution on MCU-class RISC-V and ARM Cortex-M cores. We focus on minimizing memory movements while maximizing data reuse in the Attention layers. With our library, we obtain 3.4x, 1.8 x, and 2.1 x lower latency and energy on 8-bit Attention layers, compared to previous state-of-the-art (SoA) linear and matrix multiplication kernels in the CMSIS-NN and PULP-NN libraries on the STM32H7 (Cortex M7), STM32L4 (Cortex M4), and GAP8 (RISC-V IMC-Xpulp) platforms, respectively. As a use case for our TinyTransformer library, we also demonstrate that we can fit a 263 kB Transformer on the GAP8 platform, outperforming the previous SoA convolutional architecture on the TinyRadarNN dataset, with a latency of 9.24 ms and 0.47 mJ energy consumption and an accuracy improvement of 3.5%
TCN Mapping Optimization for Ultra-Low Power Time-Series Edge Inference
Temporal Convolutional Networks (TCNs) are emerging lightweight Deep Learning
models for Time Series analysis. We introduce an automated exploration approach
and a library of optimized kernels to map TCNs on Parallel Ultra-Low Power
(PULP) microcontrollers. Our approach minimizes latency and energy by
exploiting a layer tiling optimizer to jointly find the tiling dimensions and
select among alternative implementations of the causal and dilated
1D-convolution operations at the core of TCNs. We benchmark our approach on a
commercial PULP device, achieving up to 103X lower latency and 20.3X lower
energy than the Cube-AI toolkit executed on the STM32L4 and from 2.9X to 26.6X
lower energy compared to commercial closed-source and academic open-source
approaches on the same hardware target
DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme
edge of the Internet-of-Things is a critical enabler to support pervasive Deep
Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited
on-chip memory and often replace caches with scratchpads, to reduce area
overheads and increase energy efficiency -- requiring explicit DMA-based memory
transfers between different levels of the memory hierarchy. Mapping modern DNNs
on these systems requires aggressive topology-dependent tiling and
double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY)
- an automatic tool to deploy DNNs on low cost MCUs with typically less than
1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming
(CP) problem: it maximizes L1 memory utilization under the topological
constraints imposed by each DNN layer. Then, it generates ANSI C code to
orchestrate off- and on-chip transfers and computation phases. Furthermore, to
maximize speed, DORY augments the CP formulation with heuristics promoting
performance-effective tile sizes. As a case study for DORY, we target
GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power
MCU-class devices on the market. On this device, DORY achieves up to 2.5x
better MAC/cycle than the GreenWaves proprietary software solution and 18.1x
better than the state-of-the-art result on an STM32-F746 MCU on single layers.
Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128
network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an
STM32-F746. We release all our developments - the DORY framework, the optimized
backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication
in IEEE Transactions on Computers
(https://ieeexplore.ieee.org/document/9381618
Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs
Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18
Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units
(MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep
Neural Network (DNN) models in future TinyML applications. This paper tackles
this challenge by introducing a novel reduced precision optimization technique
for ODL primitives on MCU-class devices, leveraging the State-of-Art
advancements in RISC-V RV32 architectures with support for vectorized 16-bit
floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our
approach for the Forward and Backward steps of the Back-Propagation training
algorithm is composed of specialized shape transform operators and Matrix
Multiplication (MM) kernels, accelerated with parallelization and loop
unrolling. When evaluated on a single training step of a 2D Convolution layer,
the SIMD-optimized FP16 primitives result up to 1.72 faster than the
FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency
of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81
MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN
for Image Classification and Keyword Spotting, respectively -- requiring 17.1
ms and 6.4 ms on the target platform to compute a training step on a single
sample. Overall, our approach results more than two orders of magnitude faster
than existing ODL software frameworks for single-core MCUs and outperforms by
1.6 previous FP32 parallel implementations on a Continual Learning
setup.Comment: Pre-print version submitted to Elsevier's Future Generation Computer
Systems journal. For the associated open-source release, see
https://github.com/pulp-platform/pulp-trainli
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO
Smart glasses are rapidly gaining advanced functionality thanks to
cutting-edge computing technologies, accelerated hardware architectures, and
tiny AI algorithms. Integrating AI into smart glasses featuring a small form
factor and limited battery capacity is still challenging when targeting
full-day usage for a satisfactory user experience. This paper illustrates the
design and implementation of tiny machine-learning algorithms exploiting novel
low-power processors to enable prolonged continuous operation in smart glasses.
We explore the energy- and latency-efficient of smart glasses in the case of
real-time object detection. To this goal, we designed a smart glasses prototype
as a research platform featuring two microcontrollers, including a novel
milliwatt-power RISC-V parallel processor with a hardware accelerator for
visual AI, and a Bluetooth low-power module for communication. The smart
glasses integrate power cycling mechanisms, including image and audio sensing
interfaces. Furthermore, we developed a family of novel tiny deep-learning
models based on YOLO with sub-million parameters customized for
microcontroller-based inference dubbed TinyissimoYOLO v1.3, v5, and v8, aiming
at benchmarking object detection with smart glasses for energy and latency.
Evaluations on the prototype of the smart glasses demonstrate TinyissimoYOLO's
17ms inference latency and 1.59mJ energy consumption per inference while
ensuring acceptable detection accuracy. Further evaluation reveals an
end-to-end latency from image capturing to the algorithm's prediction of 56ms
or equivalently 18 fps, with a total power consumption of 62.9mW, equivalent to
a 9.3 hours of continuous run time on a 154mAh battery. These results
outperform MCUNet (TinyNAS+TinyEngine), which runs a simpler task (image
classification) at just 7.3 fps per second
Reduced precision floating-point optimization for Deep Neural Network On-Device Learning on microcontrollers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72x faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively - requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6x previous FP32 parallel implementations on a Continual Learning setup.& COPY; 2023 Elsevier B.V. All rights reserved
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
- …