Search CORE

1,323 research outputs found

Energy-efficient embedded machine learning algorithms for smart sensing systems

Author: OSTA MARIO
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 27/02/2020
Field of study

Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches: Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits. 2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15 7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation

Archivio istituzionale della ricerca - Università di Genova

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Author: Azarkhish Erfan
Benini Luca
Bonetti Andrea
Emery Stephane
Jokic Petar
Pons Marc
Publication venue
Publication date: 24/06/2021
Field of study

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

Author: Armeniakos Giorgos
Hanif Muhammad Abdullah
Jiao Xun
Leon Vasileios
Pekmestzi Kiamal
Shafique Muhammad
Soudris Dimitrios
Publication venue
Publication date: 20/07/2023
Field of study

The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

arXiv.org e-Print Archive

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

Author: Gupta Rajesh K.
Lin Jeng-Hau
Srivastava Mani
Tu Zhuowen
Xing Tianwei
Zhang Zhiru
Zhao Ritchie
Publication venue
Publication date: 15/07/2017
Field of study

State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such networks strain the computational capabilities and energy available to embedded and mobile processing platforms, restricting their use in many important applications. In this paper, we push the boundaries of hardware-effective CNN design by proposing BCNN with Separable Filters (BCNNw/SF), which applies Singular Value Decomposition (SVD) on BCNN kernels to further reduce computational and storage complexity. To enable its implementation, we provide a closed form of the gradient over SVD to calculate the exact gradient with respect to every binarized weight in backward propagation. We verify BCNNw/SF on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for CIFAR-10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of 17% and execution time reduction of 31.3% compared to BCNN with only minor accuracy sacrifices.Comment: 9 pages, 6 figures, accepted for Embedded Vision Workshop (CVPRW

arXiv.org e-Print Archive

Crossref

Challenges and Opportunities in Near-Threshold DNN Accelerators around Timing Errors

Author: Basu Prabal
Chakraborty Koushik
Gundi Noel Daniel
Pandey Pramesh
Patrick Mitchell Craig
Roy Sanghamitra
Shabanian Tahmoures
Publication venue: Hosted by Utah State University Libraries
Publication date: 16/10/2020
Field of study

AI evolution is accelerating and Deep Neural Network (DNN) inference accelerators are at the forefront of ad hoc architectures that are evolving to support the immense throughput required for AI computation. However, much more energy efficient design paradigms are inevitable to realize the complete potential of AI evolution and curtail energy consumption. The Near-Threshold Computing (NTC) design paradigm can serve as the best candidate for providing the required energy efficiency. However, NTC operation is plagued with ample performance and reliability concerns arising from the timing errors. In this paper, we dive deep into DNN architecture to uncover some unique challenges and opportunities for operation in the NTC paradigm. By performing rigorous simulations in TPU systolic array, we reveal the severity of timing errors and its impact on inference accuracy at NTC. We analyze various attributes—such as data–delay relationship, delay disparity within arithmetic units, utilization pattern, hardware homogeneity, workload characteristics—and uncover unique localized and global techniques to deal with the timing errors in NTC

DigitalCommons@USU

Recommended from our members

Efficient Learning in Heterogeneous Internet of Things Ecosystems

Author: Kim Yeseong
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The Internet of Things (IoT) is a growing network of heterogeneous devices, combining various sensing and computing nodes at different scales, which creates a large volume of data. Many IoT applications use machine learning (ML) algorithms to analyze the data. The high computational complexity of ML workloads poses significant computational challenges to IoT computing platforms, which tend to be less-powerful and resource-constrained devices. Transmitting such large volumes of data to the cloud also have various issues such as scalability, security and privacy. In this dissertation, we propose efficient solutions to perform the ML tasks while decreasing power consumption and improving performance. We first leverage the heterogeneous and interconnected nature of the IoT systems, where IoT applications run on many different architectures (e.g., X86 server or ARM-based edge device) while communicating with each other. We present a cross-platform power and performance prediction technique for intelligent task allocation. The proposed technique estimates the time-variant energy consumption with only 7% error across completely different architectures, enabling the intelligent task allocation that saves the energy consumption of 16.5% for state-of-the-art ML workloads.We next show how to further advance the learning procedures towards real-time and online processing by distributing such learning tasks onto the hierarchy of IoT devices. Our solution leverages brain-inspired high-dimensional (HD) computing to derive a new class oflearning algorithms that can easily run on IoT devices, while providing high accuracy comparable to the state-of-the-arts. We present that the HD-based learning algorithms can cover various real-world problems from conventional classification to other cognitive tasks beyond classical MLs such as DNA pattern matching. We demonstrate that the HD-based learning can enable secure, collaborative learning by efficiently distributing a large volume of learning tasks into heterogeneous computing nodes. We have implemented the proposed learning solution on various platforms while offering superior computing efficiency. For example, our solution achieves 486×and 7× performance improvements for each of the training and inference phases on a low-power ARM processor, as compared to state-of-the-art deep learning

eScholarship - University of California

Recommended from our members

Model-Architecture Co-design of Deep Neural Networks for Embedded Systems

Author: Maji Partha
Publication venue: University of Cambridge
Publication date: 24/06/2020
Field of study

In deep learning, a convolutional neural network (ConvNet or CNN) is a powerful tool for building interesting embedded applications that use data to make predictions. An application running on an embedded system typically has limited access to memory resources, processing power, and storage. Implementing deep convolutional neural network-based inference on resource-constrained devices can be very challenging, as these environments cannot usually make use of the massive computing power and storage that are present in cloud server environments. Furthermore, the constantly evolving nature of modern deep network architecture aggravates the problem by making it necessary to balance flexibility against specialisation to avoid the inability to adapt. However, much of the baseline architecture of a deep convolutional neural network stayed the same. With careful optimisation of the most common and widely occurring layer architectures, it is typically possible to accelerate these emerging workloads for resource-constrained embedded systems. This thesis makes four contributions. I first developed a lossy three-stage low-rank approximation scheme that can reduce the computational complexity of a pre-trained model by 3-5x and up to 8-9x for individual convolutional layers. This scheme requires restructuring of the convolutional layers and generally suits the scenario where both the training data and trained model are available. In many scenarios, the training data is not available for fine-tuning any loss in prediction accuracy if structural changes are made to a model as a post-processing step. Besides the lack of availability of training data, there are other situations where the architecture of a model cannot be changed after training. My second contribution handles this scenario by using a low-level optimisation scheme that requires no changes to the model architecture, unlike the low-rank approximation scheme. This novel scheme uses a modified version of the Cook-Toom algorithm to reduce the computational intensity of commonly occurring dense and spatial convolutional layers and speedup inference time by 2-4x. My third contribution is an efficient implementation of the Cook-Toom class of algorithms on ubiquitous Arm's low-power Cortex processor. Unlike the direct convolution, computing convolutions using the modified Cook-Toom algorithm requires a different data processing pipeline as it involves pre- and post-transformations of the intermediate activations. I introduced a multi-channel multi-region (MCMR) scheme to enable an efficient implementation of the fast Cook-Toom algorithm. I demonstrate that by effectively using SIMD instructions and the MCMR scheme an average 2-3x and a peak 4x per layer speedup is easily achievable. My final contribution is the Cook-Toom accelerator, a custom hardware architecture for modern convolutional neural networks. This accelerator architecture is designed from the ground up to address some of the limitations of a resource-constrained SIMD processor. I also illustrate how new emerging layer types can be mapped efficiently to the same flexible architecture without any modification

Apollo (Cambridge)

A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits

Author: Chen Chuangtao
Han Jie
Qian Weikang
Wang Xuan
Wen Chenyi
Wu Ying
Xiao Weihua
Yin Xunzhao
Zhuo Cheng
Publication venue
Publication date: 29/06/2023
Field of study

Given the stringent requirements of energy efficiency for Internet-of-Things edge devices, approximate multipliers, as a basic component of many processors and accelerators, have been constantly proposed and studied for decades, especially in error-resilient applications. The computation error and energy efficiency largely depend on how and where the approximation is introduced into a design. Thus, this article aims to provide a comprehensive review of the approximation techniques in multiplier designs ranging from algorithms and architectures to circuits. We have implemented representative approximate multiplier designs in each category to understand the impact of the design techniques on accuracy and efficiency. The designs can then be effectively deployed in high-level applications, such as machine learning, to gain energy efficiency at the cost of slight accuracy loss.Comment: 38 pages, 37 figure

arXiv.org e-Print Archive