Search CORE

373 research outputs found

CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration with Better-than-Binary Energy Efficiency

Author: Benini Luca
Cavigelli Lukas
Rutishauser Georg
Scherer Moritz
Publication venue
Publication date: 04/02/2021
Field of study

We present a 3.1 POp/s/W fully digital hardware accelerator for ternary neural networks. CUTIE, the Completely Unrolled Ternary Inference Engine, focuses on minimizing non-computational energy and switching activity so that dynamic power spent on storing (locally or globally) intermediate results is minimized. This is achieved by 1) a data path architecture completely unrolled in the feature map and filter dimensions to reduce switching activity by favoring silencing over iterative computation and maximizing data re-use, 2) targeting ternary neural networks which, in contrast to binary NNs, allow for sparse weights which reduce switching activity, and 3) introducing an optimized training method for higher sparsity of the filter weights, resulting in a further reduction of the switching activity. Compared with state-of-the-art accelerators, CUTIE achieves greater or equal accuracy while decreasing the overall core inference energy cost by a factor of 4.8x-21x

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

CNNParted: An open source framework for efficient Convolutional Neural Network inference partitioning in embedded systems

Author: Becker Jürgen
Harbaum Tanja
Hoefer Julian
Hotfilter Tim
Kreß Fabian
Schmidt Patrick
Sidorenko Vladimir
Walter Iris
Publication venue: Elsevier
Publication date: 20/04/2023
Field of study

Applications such as autonomous driving or assistive robotics heavily rely on the usage of Deep Neural Networks. In particular, Convolutional Neural Networks (CNNs) provide precise and reliable results in image processing tasks like camera-based object detection or semantic segmentation. However, to achieve even better results, CNNs are becoming more and more complex. Deploying these networks in distributed embedded systems thereby imposes new challenges, due to additional constraints regarding performance and energy consumption in the near-sensor compute platforms, i.e. the sensor nodes. Processing all data in the central node, however, is disadvantageous since raw data of camera consumes large bandwidth and running CNN inference of multiple tasks requires certain performance. Moreover, sending raw data over the interconnect is not advisable for privacy reasons. Hence, offloading CNN workload to the sensor nodes in the system can lead to reduced traffic on the link and a higher level of data security. However, due to the limited hardware-resources on the sensor nodes, partitioning CNNs has to be done carefully to meet overall latency requirements and energy constraints. Therefore, we present CNNParted, an open-source framework for efficient, hardware-aware CNN inference partitioning targeting embedded AI applications. It automatically searches for potential partitioning points in the CNN to find a beneficial workload distribution between sensor nodes and a central edge node. Thereby, CNNParted not only analyzes the CNN architecture but also takes hardware components, such as dedicated hardware accelerators and memories, into consideration to evaluate inference partitioning regarding latency and energy consumption. Exemplary, we apply CNNParted to three commonly used feed forward CNNs in embedded systems. Thereby, the framework first searches for several potential partitioning points and then evaluates the latter regarding inference latency and energy consumption. Based on the results, beneficial partitioning points can be identified depending on the system constraints. Using the framework, we are able to find and evaluate 10 potential partitioning points for FCN ResNet-50, 13 partitioning points for GoogLeNet, and 8 partitioning points for SqueezeNet V1.1 within 520 s, 330 s, and 140 s, respectively, on an AMD EPYC 7702P running 8 concurrent threads. For GoogLeNet, we determine two partitioning points that provide a good trade-off between required bandwidth, latency and energy consumption. We also provide insights into further interesting findings that can be derived from the evaluation results

KITopen

Gen-acceleration: Pioneering work for hardware accelerator generation using large language models

Author: Vungarala Durga Lakshmi Venkata Deepak
Publication venue: Digital Commons @ NJIT
Publication date: 31/12/2023
Field of study

Optimizing computational power is critical in the age of data-intensive applications and Artificial Intelligence (AI)/Machine Learning (ML). While facing challenging bottlenecks, conventional Von-Neumann architecture with implementing such huge tasks looks seemingly impossible. Hardware Accelerators are critical in efficiently deploying these technologies and have been vastly explored in edge devices. This study explores a state-of-the-art hardware accelerator; Gemmini is studied; we leveraged the open-sourced tool. Furthermore, we developed a Hardware Accelerator in the study we compared with the Non-Von-Neumann architecture. Gemmini is renowned for efficient matrix multiplication, but configuring it for specific tasks requires manual effort and expertise. We propose implementing it by reducing manual intervention and domain expertise, making it easy to develop and deploy hardware accelerators that are time-consuming and need expertise in the field; by leveraging the Large Language Models (LLMs), they enable data-informed decision-making, enhancing performance. This work introduces an innovative method for hardware accelerator generation by undertaking the Gemmini to generate optimizing hardware accelerators for AI/ML applications and paving the way for automation and customization in the field

Digital Commons @ New Jersey Institute of Technology (NJIT)

Fully Quantized Always-on Face Detector Considering Mobile Image Sensors

Author: Chun Se Young
Je Hyunwoo
Jeong Wongi
Kim Kijeong
Lee Haechang
No Albert
Ryu Dongil
Publication venue
Publication date: 02/11/2023
Field of study

Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One noteworthy limitation is the absence of suitable face detectors for the always-on scenario, a crucial aspect of image sensor-level applications. These detectors must operate directly with sensor RAW data before the image signal processor (ISP) takes over. This gap poses a significant challenge in achieving optimal performance in such scenarios. Further research and development are necessary to bridge this gap and fully leverage the potential of iCIS applications. In this study, we aim to bridge the gap by exploring extremely low-bit lightweight face detectors, focusing on the always-on face detection scenario for mobile image sensor applications. To achieve this, our proposed model utilizes sensor-aware synthetic RAW inputs, simulating always-on face detection processed "before" the ISP chain. Our approach employs ternary (-1, 0, 1) weights for potential implementations in image sensors, resulting in a relatively simple network architecture with shallow layers and extremely low-bitwidth. Our method demonstrates reasonable face detection performance and excellent efficiency in simulation studies, offering promising possibilities for practical always-on face detectors in real-world applications.Comment: Accepted to ICCV 2023 Workshop on Low-Bit Quantized Neural Networks (LBQNN), Ora

arXiv.org e-Print Archive

2022 roadmap on neuromorphic computing and engineering

Author: Christensen Dennis V
Datta Suman
Dittmann Regina
et al
Feldmann Johannes
Grollier Julie
Indiveri Giacomo
Keene Scott T
Lanza Mario
Le Gallo Manuel
Liang Shi-Jun
Linares-Barranco Bernabe
Marković Danijela
Menzel Stephan
Miao Feng
Mikolajick Thomas
Milano Gianluca
Mizrahi Alice
Quill Tyler J
Redaelli Andrea
Ricciardi Carlo
Salleo Alberto
Sebastian Abu
Slesazeck Stefan
Spiga Sabina
Strachan John Paul
Valentian Alexandre
Valov Ilia
Vianello Elisa
Yang J Joshua
Yao Peng
Publication venue: 'IOP Publishing'
Publication date: 01/06/2022
Field of study

Modern computation based on von Neumann architecture is now a mature cutting-edge science. In the von Neumann architecture, processing and memory units are implemented as separate blocks interchanging data intensively and continuously. This data transfer is responsible for a large part of the power consumption. The next generation computer technology is expected to solve problems at the exascale with 10

^{18}

calculations each second. Even though these future computers will be incredibly powerful, if they are based on von Neumann type architectures, they will consume between 20 and 30 megawatts of power and will not have intrinsic physically built-in capabilities to learn or deal with complex data as our brain does. These needs can be addressed by neuromorphic computing systems which are inspired by the biological concepts of the human brain. This new generation of computers has the potential to be used for the storage and processing of large amounts of digital information with much lower power consumption than conventional processors. Among their potential future applications, an important niche is moving the control from data centers to edge devices. The aim of this roadmap is to present a snapshot of the present state of neuromorphic technology and provide an opinion on the challenges and opportunities that the future holds in the major areas of neuromorphic technology, namely materials, devices, neuromorphic circuits, neuromorphic algorithms, applications, and ethics. The roadmap is a collection of perspectives where leading researchers in the neuromorphic community provide their own view about the current state and the future challenges for each research area. We hope that this roadmap will be a useful resource by providing a concise yet comprehensive introduction to readers outside this field, for those who are just entering the field, as well as providing future perspectives for those who are well established in the neuromorphic computing community

ZORA

2022 roadmap on neuromorphic computing and engineering

Author: et al
Furber Steve
Publication venue: 'IOP Publishing'
Publication date: 20/05/2022
Field of study

The University of Manchester - Institutional Repository

Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

Author: Bussolino B.
Capra M.
Marchisio A.
Martina M.
Masera G.
Shafique M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)