273 research outputs found

    Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables

    Get PDF
    Deep learning has revolutionized the way sensor data are analyzed and interpreted. The accuracy gains these approaches o↵er make them attractive for the next generation of mobile, wearable and embedded sensory applications. However, state-of-the-art deep learning algorithms typically require a significant amount of device and processor resources, even just for the inference stages that are used to discriminate high-level classes from low-level data. The limited availability of memory, computation, and energy on mobile and embedded platforms thus pose a significant challenge to the adoption of these powerful learning techniques. In this paper, we propose SparseSep, a new approach that leverages the sparsification of fully connected layers and separation of convolutional kernels to reduce the resource requirements of popular deep learning algorithms. As a result, SparseSep allows large-scale DNNs and CNNs to run eciently on mobile and embedded hardware with only minimal impact on inference accuracy. We experiment using SparseSep across a variety of common processors such as the Qualcomm Snapdragon 400, ARM Cortex M0 and M3, and Nvidia Tegra K1, and show that it allows inference for various deep models to execute more eciently; for example, on average requiring 11.3 times less memory and running 13.3 times faster on these representative platforms

    Machine Learning for Microcontroller-Class Hardware -- A Review

    Full text link
    The advancements in machine learning opened a new opportunity to bring intelligence to the low-end Internet-of-Things nodes such as microcontrollers. Conventional machine learning deployment has high memory and compute footprint hindering their direct deployment on ultra resource-constrained microcontrollers. This paper highlights the unique requirements of enabling onboard machine learning for microcontroller class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of machine learning model development for microcontroller class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa

    Sparsifying Deep Learning Layers for Constrained Resource Inference on Wearables

    Get PDF
    Deep learning has revolutionized the way sensor data are analyzed and interpreted. The accuracy gains these approaches offer make them attractive for the next generation of mobile, wearable and embedded sensory applications. However, state-of-the-art deep learning algorithms typically require a significant amount of device and processor resources, even just for the inference stages that are used to discriminate high-level classes from low-level data. The limited availability of memory, computation, and energy on mobile and embedded platforms thus pose a significant challenge to the adoption of these powerful learning techniques. In this paper, we propose SparseSep, a new approach that leverages the sparsification of fully connected layers and separation of convolutional kernels to reduce the resource requirements of popular deep learning algorithms. As a result, SparseSep allows large-scale DNNs and CNNs to run efficiently on mobile and embedded hardware with only minimal impact on inference accuracy. We experiment using SparseSep across a variety of common processors such as the Qualcomm Snapdragon 400, ARM Cortex M0 and M3, and Nvidia Tegra K1, and show that it allows inference for various deep models to execute more efficiently; for example, on average requiring 11.3 times less memory and running 13.3 times faster on these representative platforms

    NASA JSC neural network survey results

    Get PDF
    A survey of Artificial Neural Systems in support of NASA's (Johnson Space Center) Automatic Perception for Mission Planning and Flight Control Research Program was conducted. Several of the world's leading researchers contributed papers containing their most recent results on artificial neural systems. These papers were broken into categories and descriptive accounts of the results make up a large part of this report. Also included is material on sources of information on artificial neural systems such as books, technical reports, software tools, etc

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft
    • …
    corecore