13 research outputs found

    Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

    Full text link
    Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.Comment: Accepted as a full paper by the TinyML Research Symposium 202

    TinyProp -- Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

    Full text link
    Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and CIFAR10, we are 5 times faster compared to non-sparse training with an accuracy loss of on average 1%. On average, TinyProp is 2.9 times faster than existing, static sparse backpropagation algorithms and the accuracy loss is reduced on average by 6 % compared to a typical static setting of the back-propagation ratio.Comment: 7 Pages, AIPE Conference 202

    Neural network-based vehicle image classification for IoT devices

    Get PDF
    Convolutional Neural Networks (CNNs) have previously provided unforeseen results in automatic image analysis and interpretation, an area which has numerous applications in both consumer electronics and industry. However, the signal processing related to CNNs is computationally very demanding, which has prohibited their use in the smallest embedded computing platforms, to which many Internet of Things (IoT) devices belong. Fortunately, in the recent years researchers have developed many approaches for optimizing the performance and for shrinking the memory footprint of CNNs. This paper presents a neuralnetwork-based image classifier that has been trained to classify vehicle images into four different classes. The neural network is optimized by a technique called binarization, and the resulting binarized network is placed to an IoT-class processor core for execution. Binarization reduces the memory footprint of the CNN by around 95% and increases performance by more than 6×. Furthermore, we show that by utilizing a custom instruction 'popcount' of the processor, the performance of the binarized vehicle classifier can still be increased by more than 2×, making the CNN-based image classifier suitable for the smallest embedded processors.©2029 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.fi=vertaisarvioitu|en=peerReviewed

    MonTM: Monitoring-Based Thermal Management for Mixed-Criticality Systems

    Get PDF
    With a rapidly growing functionality of embedded real-time applications, it becomes inevitable to integrate tasks of different safety integrity levels on one many-core processor leading to a large-scale mixed-criticality system. In this process, it is not sufficient to only isolate shared architectural resources, as different tasks executing on different cores also possibly interfere via the many-core processor’s thermal management. This can possibly lead to best-effort tasks causing deadline violations for safety-critical tasks. In order to prevent such a scenario, we propose a monitoring-based hardware extension that communicates imminent thermal violations between cores via a lightweight interconnect. Building on this infrastructure, we propose a thermal strategy such that best-effort tasks can be throttled in favor of safety-critical tasks. Furthermore, assigning static voltage/frequency (V/f) levels to each safety-critical task based on their worst-case execution time may result in unnecessary high V/f levels when the actual execution finishes faster. To free the otherwise wasted thermal resources, our solution monitors the progress of safety-critical tasks to detect slack and safely reduce their V/f levels. This increases the thermal headroom for best-effort tasks, boosting their performance. In our evaluation, we demonstrate our approach on an 80-core processor to show that it satisfies the thermal and deadline requirements, and simultaneously reduces the run-time of best-effort tasks by up to 45% compared to the state of the art

    Key-Recovery Fault Injection Attack on the Classic McEliece KEM

    Get PDF
    We present a key-recovery fault injection attack on the Classic McEliece Key Encapsulation Mechanism (KEM). The fault injections target the error-locator polynomial of the Goppa code and the validity checks in the decryption algorithm, making a chosen ciphertext attack possible. Faulty decryption outputs are used to generate a system of polynomial equations in the secret support elements of the Goppa code. After solving the equations, we can determine a suitable Goppa polynomial and form an alternative secret key. To demonstrate the feasibility of the attack on hardware, we simulate the fault injections on virtual prototypes of two RISC-V cores at register-transfer level

    Memory Latency Distribution-Driven Regulation for Temporal Isolation in MPSoCs

    Get PDF
    Temporal isolation is one of the most significant challenges that must be addressed before Multi-Processor Systems-on-Chip (MPSoCs) can be widely adopted in mixed-criticality systems with both time-sensitive real-time (RT) applications and performance-oriented non-real-time (NRT) applications. Specifically, the main memory subsystem is one of the most prevalent causes of interference, performance degradation and loss of isolation. Existing memory bandwidth regulation mechanisms use static, dynamic, or predictive DRAM bandwidth management techniques to restore the execution time of an application under contention as close as possible to the execution time in isolation. In this paper, we propose a novel distribution-driven regulation whose goal is to achieve a timeliness objective formulated as a constraint on the probability of meeting a certain target execution time for the RT applications. Using existing interconnect-level Performance Monitoring Units (PMU), we can observe the Cumulative Distribution Function (CDF) of the per-request memory latency. Regulation is then triggered to enforce first-order stochastical dominance with respect to a desired reference. Consequently, it is possible to enforce that the overall observed execution time random variable is dominated by the reference execution time. The mechanism requires no prior information of the contending application and treats the DRAM subsystem as a black box. We provide a full-stack implementation of our mechanism on a Commercial Off-The-Shelf (COTS) platform (Xilinx Ultrascale+ MPSoC), evaluate it using real and synthetic benchmarks, experimentally validate that the timeliness objectives are met for the RT applications, and demonstrate that it is able to provide 2.2x more overall throughput for NRT applications compared to DRAM bandwidth management-based regulation approaches

    The Universal Safety Format in Action: Tool Integration and Practical Application

    Get PDF
    Designing software that meets the stringent requirements of functional safety standards imposes a significant development effort compared to conventional software. A key aspect is the integration of safety mechanisms into the functional design to ensure a safe state during operation even in the event of hardware errors. These safety mechanisms can be applied at different levels of abstraction during the development process and are usually implemented and integrated manually into the design. This does not only cause significant effort but does also reduce the overall maintainability of the software. To mitigate this, we present the Universal Safety Format (USF), which enables the generation of safety mechanisms based on the separation of concerns principle in a model-driven approach. Safety mechanisms are described as generic patterns using a transformation language independent from the functional design or any particular programming language. The USF was designed to be easily integrated into existing tools and workflows that can support different programming languages. Tools supporting the USF can utilize the patterns in a functional design to generate and integrate specific safety mechanisms for different languages using the transformation rules contained within the patterns. This enables not only the reuse of safety patterns in different designs, but also across different programming languages. The approach is demonstrated with an automotive use-case as well as different tools supporting the USF

    Automated HW/SW co-design for edge AI : State, challenges and steps ahead

    Get PDF
    Gigantic rates of data production in the era of Big Data, Internet of Thing (IoT), and Smart Cyber Physical Systems (CPS) pose incessantly escalating demands for massive data processing, storage, and transmission while continuously interacting with the physical world using edge sensors and actuators. For IoT systems, there is now a strong trend to move the intelligence from the cloud to the edge or the extreme edge (known as TinyML). Yet, this shift to edge AI systems requires to design powerful machine learning systems under very strict resource constraints. This poses a difficult design task that needs to take the complete system stack from machine learning algorithm, to model optimization and compression, to software implementation, to hardware platform and ML accelerator design into account. This paper discusses the open research challenges to achieve such a holistic Design Space Exploration for a HW/SW Co-design for Edge AI Systems and discusses the current state with three currently developed flows: one design flow for systems with tightly-coupled accelerator architectures based on RISC-V, one approach using loosely-coupled, application-specific accelerators as well as one framework that integrates software and hardware optimization techniques to built efficient Deep Neural Network (DNN) systems.publishedVersionPeer reviewe

    TinyProp - Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

    No full text
    Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and CIFAR10, we are 5 times faster compared to non-sparse training with an accuracy loss of on average 1%. On average, TinyProp is 2.9 times faster than existing, static sparse backpropagation algorithms and the accuracy loss is reduced on average by 6 % compared to a typical static setting of the back-propagation ratio

    Design and validation of fault-tolerant embedded controllers

    No full text
    Embedded control systems are an important and often safety-critical class of applications that need to operate reliably even in the presence of faults. We show that intermittent fault scenarios caused by wear-out effects due to a higher density and a smaller geometry of the embedded electronic components may become a reliability concern for real-time embedded control applications. To mitigate the effects of such intermittent faults, we propose a novel fault-tolerant controller design method such that the resulting controllers ensure closed loop stability (i.e., guarantee safety) with only possibly degraded performance under such fault scenarios. In order to measure the amortized performance offered by the software implementations of such fault-tolerant controllers, we provide a program analysis methodology that statically estimates the quality of control guaranteed by the C code implementation of the fault-tolerant control law. This combination of fault-tolerant controller design followed by performance feedback computed using a formal analysis is illustrated with a case study from the automotive domain
    corecore