21 research outputs found
Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems
A personalized KeyWord Spotting (KWS) pipeline typically requires the
training of a Deep Learning model on a large set of user-defined speech
utterances, preventing fast customization directly applied on-device. To fill
this gap, this paper investigates few-shot learning methods for open-set KWS
classification by combining a deep feature encoder with a prototype-based
classifier. With user-defined keywords from 10 classes of the Google Speech
Command dataset, our study reports an accuracy of up to 76% in a 10-shot
scenario while the false acceptance rate of unknown data is kept to 5%. In the
analyzed settings, the usage of the triplet loss to train an encoder with
normalized output features performs better than the prototypical networks
jointly trained with a generator of dummy unknown-class prototypes. This design
is also more effective than encoders trained on a classification problem and
features fewer parameters than other iso-accuracy approaches.Comment: Accepted at INTERSPEECH 202
A sub-mW IoT-endnode for always-on visual monitoring and smart triggering
This work presents a fully-programmable Internet of Things (IoT) visual
sensing node that targets sub-mW power consumption in always-on monitoring
scenarios. The system features a spatial-contrast binary
pixel imager with focal-plane processing. The sensor, when working at its
lowest power mode ( at 10 fps), provides as output the number of
changed pixels. Based on this information, a dedicated camera interface,
implemented on a low-power FPGA, wakes up an ultra-low-power parallel
processing unit to extract context-aware visual information. We evaluate the
smart sensor on three always-on visual triggering application scenarios.
Triggering accuracy comparable to RGB image sensors is achieved at nominal
lighting conditions, while consuming an average power between and
, depending on context activity. The digital sub-system is extremely
flexible, thanks to a fully-programmable digital signal processing engine, but
still achieves 19x lower power consumption compared to MCU-based cameras with
significantly lower on-board computing capabilities.Comment: 11 pages, 9 figures, submitteted to IEEE IoT Journa
Ultra-Low Power IoT Smart Visual Sensing Devices for Always-ON Applications
This work presents the design of a Smart Ultra-Low Power visual sensor architecture that couples together an ultra-low power event-based image sensor with a parallel and power-optimized digital architecture for data processing. By means of mixed-signal circuits, the imager generates a stream of address events after the extraction and binarization of spatial gradients.
When targeting monitoring applications, the sensing and processing energy costs can be reduced by two orders of magnitude thanks to either the mixed-signal imaging technology, the event-based data compression and the use of event-driven computing approaches.
From a system-level point of view, a context-aware power management scheme is enabled by means of a power-optimized sensor peripheral block, that requests the processor activation only when a relevant information is detected within the focal plane of the imager. When targeting a smart visual node for triggering purpose, the event-driven approach brings a 10x power reduction with respect to other presented visual systems, while leading to comparable results in terms of detection accuracy. To further enhance the recognition capabilities of the smart camera system, this work introduces the concept of event-based binarized neural networks. By coupling together the theory of binarized neural networks and focal-plane processing, a 17.8% energy reduction is demonstrated on a real-world data classification with a performance drop of 3% with respect to a baseline system featuring commercial visual sensors and a Binary Neural Network engine. Moreover, if coupling the BNN engine with the event-driven triggering detection flow, the average power consumption can be as low as the sleep power of 0.3mW in case of infrequent events, which is 8x lower than a smart camera system featuring a commercial RGB imager
Reduced precision floating-point optimization for Deep Neural Network On-Device Learning on microcontrollers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72x faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively - requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6x previous FP32 parallel implementations on a Continual Learning setup.& COPY; 2023 Elsevier B.V. All rights reserved
Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units
(MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep
Neural Network (DNN) models in future TinyML applications. This paper tackles
this challenge by introducing a novel reduced precision optimization technique
for ODL primitives on MCU-class devices, leveraging the State-of-Art
advancements in RISC-V RV32 architectures with support for vectorized 16-bit
floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our
approach for the Forward and Backward steps of the Back-Propagation training
algorithm is composed of specialized shape transform operators and Matrix
Multiplication (MM) kernels, accelerated with parallelization and loop
unrolling. When evaluated on a single training step of a 2D Convolution layer,
the SIMD-optimized FP16 primitives result up to 1.72 faster than the
FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency
of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81
MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN
for Image Classification and Keyword Spotting, respectively -- requiring 17.1
ms and 6.4 ms on the target platform to compute a training step on a single
sample. Overall, our approach results more than two orders of magnitude faster
than existing ODL software frameworks for single-core MCUs and outperforms by
1.6 previous FP32 parallel implementations on a Continual Learning
setup.Comment: Pre-print version submitted to Elsevier's Future Generation Computer
Systems journal. For the associated open-source release, see
https://github.com/pulp-platform/pulp-trainli