13 research outputs found
MobileTL: On-device Transfer Learning with Inverted Residual Blocks
Transfer learning on edge is challenging due to on-device limited resources.
Existing work addresses this issue by training a subset of parameters or adding
model patches. Developed with inference in mind, Inverted Residual Blocks
(IRBs) split a convolutional layer into depthwise and pointwise convolutions,
leading to more stacking layers, e.g., convolution, normalization, and
activation layers. Though they are efficient for inference, IRBs require that
additional activation maps are stored in memory for training weights for
convolution layers and scales for normalization layers. As a result, their high
memory cost prohibits training IRBs on resource-limited edge devices, and
making them unsuitable in the context of transfer learning. To address this
issue, we present MobileTL, a memory and computationally efficient on-device
transfer learning method for models built with IRBs. MobileTL trains the shifts
for internal normalization layers to avoid storing activation maps for the
backward pass. Also, MobileTL approximates the backward computation of the
activation layer (e.g., Hard-Swish and ReLU6) as a signed function which
enables storing a binary mask instead of activation maps for the backward pass.
MobileTL fine-tunes a few top blocks (close to output) rather than propagating
the gradient through the whole network to reduce the computation cost. Our
method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs,
respectively. For MobileNetV3, we observe a 36% reduction in floating-point
operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6%
accuracy reduction on CIFAR10. Extensive experiments on multiple datasets
demonstrate that our method is Pareto-optimal (best accuracy under given
hardware constraints) compared to prior work in transfer learning for edge
devices
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
We present APQ for efficient deep learning inference on resource-constrained
hardware. Unlike previous methods that separately search the neural
architecture, pruning policy, and quantization policy, we optimize them in a
joint manner. To deal with the larger design space it brings, a promising
approach is to train a quantization-aware accuracy predictor to quickly get the
accuracy of the quantized model and feed it to the search engine to select the
best fit. However, training this quantization-aware accuracy predictor requires
collecting a large number of quantized pairs, which involves
quantization-aware finetuning and thus is highly time-consuming. To tackle this
challenge, we propose to transfer the knowledge from a full-precision (i.e.,
fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy
predictor, which greatly improves the sample efficiency. Besides, collecting
the dataset for the fp32 accuracy predictor only requires to evaluate neural
networks without any training cost by sampling from a pretrained once-for-all
network, which is highly efficient. Extensive experiments on ImageNet
demonstrate the benefits of our joint optimization approach. With the same
accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ
achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU
hours and CO2 emission, pushing the frontier for green AI that is
environmental-friendly. The code and video are publicly available.Comment: Accepted by CVPR 202
Lightweight Neural Architecture Search for Temporal Convolutional Networks at the Edge
Neural Architecture Search (NAS) is quickly becoming the go-to approach to
optimize the structure of Deep Learning (DL) models for complex tasks such as
Image Classification or Object Detection. However, many other relevant
applications of DL, especially at the edge, are based on time-series processing
and require models with unique features, for which NAS is less explored. This
work focuses in particular on Temporal Convolutional Networks (TCNs), a
convolutional model for time-series processing that has recently emerged as a
promising alternative to more complex recurrent architectures. We propose the
first NAS tool that explicitly targets the optimization of the most peculiar
architectural parameters of TCNs, namely dilation, receptive-field and number
of features in each layer. The proposed approach searches for networks that
offer good trade-offs between accuracy and number of parameters/operations,
enabling an efficient deployment on embedded platforms. We test the proposed
NAS on four real-world, edge-relevant tasks, involving audio and bio-signals.
Results show that, starting from a single seed network, our method is capable
of obtaining a rich collection of Pareto optimal architectures, among which we
obtain models with the same accuracy as the seed, and 15.9-152x fewer
parameters. Compared to three state-of-the-art NAS tools, ProxylessNAS,
MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to
10^12x) and obtains superior solutions, while requiring low GPU memory and
search time. We deploy our NAS outputs on two distinct edge devices, the
multicore GreenWaves Technology GAP8 IoT processor and the single-core
STMicroelectronics STM32H7 microcontroller. With respect to the
state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5x
and 3.8x on the two targets respectively, without any accuracy loss.Comment: Accepted for publication at the IEEE Transactions on Computer
Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks
Neural Architecture Search (NAS) is increasingly popular to automatically
explore the accuracy versus computational complexity trade-off of Deep Learning
(DL) architectures. When targeting tiny edge devices, the main challenge for DL
deployment is matching the tight memory constraints, hence most NAS algorithms
consider model size as the complexity metric. Other methods reduce the energy
or latency of DL models by trading off accuracy and number of inference
operations. Energy and memory are rarely considered simultaneously, in
particular by low-search-cost Differentiable NAS (DNAS) solutions. We overcome
this limitation proposing the first DNAS that directly addresses the most
realistic scenario from a designer's perspective: the co-optimization of
accuracy and energy (or latency) under a memory constraint, determined by the
target HW. We do so by combining two complexity-dependent loss functions during
training, with independent strength. Testing on three edge-relevant tasks from
the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in
the energy vs. accuracy space, with memory footprints constraints spanning from
75% to 6.25% of the baseline networks. When deployed on a commercial edge
device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy
consumption and 4.04% in accuracy for the same memory constraint, and reduce
energy by up to 2.2x with negligible accuracy drop with respect to the
baseline.Comment: Accepted for publication at the ISLPED 2022 ACM/IEEE International
Symposium on Low Power Electronics and Desig
Optimizing AI at the Edge: from network topology design to MCU deployment
The first topic analyzed in the thesis will be Neural Architecture Search (NAS).
I will focus on two different tools that I developed, one to optimize the architecture of Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged, and one to optimize the data precision of tensors inside CNNs.
The first NAS proposed explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive field, and the number of features in each layer. Note that this is the first NAS that explicitly targets these networks.
The second NAS proposed instead focuses on finding the most efficient data format for a target CNN, with the granularity of the layer filter. Note that applying these two NASes in sequence allows an "application designer" to minimize the structure of the neural network employed, minimizing the number of operations or the memory usage of the network.
After that, the second topic described is the optimization of neural network deployment on edge devices. Importantly, exploiting edge platforms' scarce resources is critical for NN efficient execution on MCUs.
To do so, I will introduce DORY (Deployment Oriented to memoRY) -- an automatic tool to deploy CNNs on low-cost MCUs.
DORY, in different steps, can manage different levels of memory inside the MCU automatically, offload the computation workload (i.e., the different layers of a neural network) to dedicated hardware accelerators, and automatically generates ANSI C code that orchestrates off- and on-chip transfers with the computation phases.
On top of this, I will introduce two optimized computation libraries that DORY can exploit to deploy TCNs and Transformers on edge efficiently.
I conclude the thesis with two different applications on bio-signal analysis, i.e., heart rate tracking and sEMG-based gesture recognition