124 research outputs found
Keyword Spotting System and Evaluation of Pruning and Quantization Methods on Low-power Edge Microcontrollers
Keyword spotting (KWS) is beneficial for voice-based user interactions with
low-power devices at the edge. The edge devices are usually always-on, so edge
computing brings bandwidth savings and privacy protection. The devices
typically have limited memory spaces, computational performances, power and
costs, for example, Cortex-M based microcontrollers. The challenge is to meet
the high computation and low-latency requirements of deep learning on these
devices. This paper firstly shows our small-footprint KWS system running on
STM32F7 microcontroller with Cortex-M7 core @216MHz and 512KB static RAM. Our
selected convolutional neural network (CNN) architecture has simplified number
of operations for KWS to meet the constraint of edge devices. Our baseline
system generates classification results for each 37ms including real-time audio
feature extraction part. This paper further evaluates the actual performance
for different pruning and quantization methods on microcontroller, including
different granularity of sparsity, skipping zero weights, weight-prioritized
loop order, and SIMD instruction. The result shows that for microcontrollers,
there are considerable challenges for accelerate unstructured pruned models,
and the structured pruning is more friendly than unstructured pruning. The
result also verified that the performance improvement for quantization and SIMD
instruction.Comment: Submitted to DCASE2022 Workshop. Code available at:
https://github.com/RoboBachelor/Keyword-Spotting-STM3
Target-Aware Neural Architecture Search and Deployment for Keyword Spotting
Keyword spotting (KWS) utilities have become increasingly popular on a wide range of mobile and home devices, representing a prolific application field for Convolutional Neural Networks (CNNs), which are commonly exploited to perform keyword classification. Addressing the challenges of targeting such resource-constrained platforms, requires a careful definition of the CNN architecture and the overall system implementation. These reasons have led to a growing need for design and optimization flows, able to intrinsically take into account the system's performance when ported on the target platform. In this work, we present a design methodology based on Neural Architecture Search, exploited to combine the exploration of the optimal network topology, the audio pre-processing scheme, and the data quantization policy. The proposed design flow includes target-awareness in the exploration loop, comparing the different design alternatives according to a model-based pre-evaluation of metrics like execution latency, memory footprint, and energy consumption, evaluated considering the application's execution on the target processing platform. We have tested our design flow to obtain target-specific CNNs for a resource-constrained commercial platform, the ST SensorTile. Considering two different application scenarios, enabling the comparison with the state-of-the-art of efficient CNN-based models for KWS, we have obtained up to a 1.8% accuracy improvement and a 40% footprint reduction in the most favorable case
Machine Learning for Microcontroller-Class Hardware -- A Review
The advancements in machine learning opened a new opportunity to bring
intelligence to the low-end Internet-of-Things nodes such as microcontrollers.
Conventional machine learning deployment has high memory and compute footprint
hindering their direct deployment on ultra resource-constrained
microcontrollers. This paper highlights the unique requirements of enabling
onboard machine learning for microcontroller class devices. Researchers use a
specialized model development workflow for resource-limited applications to
ensure the compute and latency budget is within the device limits while still
maintaining the desired performance. We characterize a closed-loop widely
applicable workflow of machine learning model development for microcontroller
class devices and show that several classes of applications adopt a specific
instance of it. We present both qualitative and numerical insights into
different stages of model development by showcasing several use cases. Finally,
we identify the open research challenges and unsolved questions demanding
careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa
Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes
Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-The-Art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy
Optimality Assessment of Memory-Bounded ConvNets Deployed on Resource-Constrained RISC Cores
A cost-effective implementation of Convolutional Neural Nets on the mobile edge of the Internet-of-Things (IoT) requires smart optimizations to fit large models into memory-constrained cores. Reduction methods that use a joint combination of filter pruning and weight quantization have proven efficient in searching the compression that ensures minimum model size without accuracy loss. However, there exist other optimal configurations that stem from the memory constraint. The objective of this work is to make an assessment of such memory-bounded implementations and to show that most of them are centred on specific parameter settings that are found difficult to be implemented on a low-power RISC. Hence, the focus is on quantifying the distance to optimality of the closest implementations that instead can be actually deployed on hardware. The analysis is powered by a two-stage framework that efficiently explores the memory-accuracy space using a lightweight, hardware-conscious heuristic optimization. Results are collected from three realistic IoT tasks (Image Classification on CIFAR-10, Keyword Spotting on the Speech Commands Dataset, Facial Expression Recognition on Fer2013) run on RISC cores (Cortex-M by ARM) with few hundreds KB of on-chip RAM
Are We There Yet? Product Quantization and its Hardware Acceleration
Conventional multiply-accumulate (MAC) operations have long dominated
computation time for deep neural networks (DNNs). Recently, product
quantization (PQ) has been successfully applied to these workloads, replacing
MACs with memory lookups to pre-computed dot products. While this property
makes PQ an attractive solution for model acceleration, little is understood
about the associated trade-offs in terms of compute and memory footprint, and
the impact on accuracy. Our empirical study investigates the impact of
different PQ settings and training methods on layerwise reconstruction error
and end-to-end model accuracy. When studying the efficiency of deploying PQ
DNNs, we find that metrics such as FLOPs, number of parameters, and even
CPU/GPU performance, can be misleading. To address this issue, and to more
fairly assess PQ in terms of hardware efficiency, we design the first custom
hardware accelerator to evaluate the speed and efficiency of running PQ models.
We identify PQ configurations that are able to improve performance-per-area for
ResNet20 by 40%-104%, even when compared to a highly optimized conventional DNN
accelerator. Our hardware performance outperforms recent PQ solutions by 4x,
with only a 0.6% accuracy degradation. This work demonstrates the practical and
hardware-aware design of PQ models, paving the way for wider adoption of this
emerging DNN approximation methodology
- …