Search CORE

21 research outputs found

YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

Author: Grimaldi Matteo
Khan Shahrukh
Kumar Ravish
Lazarevich Ivan
Mitra Saptarshi
Sah Sudhakar
Publication venue
Publication date: 21/08/2023
Field of study

We present YOLOBench, a benchmark comprised of 550+ YOLO-based object detection models on 4 different datasets and 4 different embedded hardware platforms (x86 CPU, ARM CPU, Nvidia GPU, NPU). We collect accuracy and latency numbers for a variety of YOLO-based one-stage detectors at different model scales by performing a fair, controlled comparison of these detectors with a fixed training environment (code and training hyperparameters). Pareto-optimality analysis of the collected data reveals that, if modern detection heads and training techniques are incorporated into the learning process, multiple architectures of the YOLO series achieve a good accuracy-latency trade-off, including older models like YOLOv3 and YOLOv4. We also evaluate training-free accuracy estimators used in neural architecture search on YOLOBench and demonstrate that, while most state-of-the-art zero-cost accuracy estimators are outperformed by a simple baseline like MAC count, some of them can be effectively used to predict Pareto-optimal detection models. We showcase that by using a zero-cost proxy to identify a YOLO architecture competitive against a state-of-the-art YOLOv8 model on a Raspberry Pi 4 CPU. The code and data are available at https://github.com/Deeplite/deeplite-torch-zo

arXiv.org e-Print Archive

DeepliteRT: Computer Vision at the Edge

Author: Ashfaq Saad
AskariHemmat MohammadHossein
Hoffman Alexander
Mitra Saptarshi
Saboori Ehsan
Sah Sudhakar
Publication venue
Publication date: 19/09/2023
Field of study

The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x and 2.17x, respectively.Comment: Accepted at British Machine Vision Conference (BMVC) 202

arXiv.org e-Print Archive

Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime

Author: Ashfaq Saad
AskariHemmat MohammadHossein
Hoffman Alexander
Mastropietro Olivier
Saboori Ehsan
Sah Sudhakar
Publication venue
Publication date: 18/07/2022
Field of study

Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively

arXiv.org e-Print Archive

Automatic Sequential to Parallel Code Conversion

Author: . M. N. Babu
. Sudhakar Sah
. Vinay Vaidya
Athavale Aditi
Pawar Prasad
Rajguru Chaitanya
Ranadive Priti
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 16/09/2014
Field of study

The way software programs are being written has been redefined since the introduction of multicore processors. Software developers have started writing parallel programs that are robust and scalable. This would ensure use of processor power being made available in the form of multiple cores. Though this trend is increasing, there are legacy applications that have been developed over the past few decades. Most of these applications are inherently sequential making no use of multithreading or parallel programming. If such applications are ported to execute on the multicore hardware as they are then optimal usage of all cores is not guaranteed. Such applications would ideally utilize only one core and the other cores would remain idle, unless the operating system supports some parallelism while scheduling. Hence there is a need to convert such legacy sequential codes to their parallel versions so that multicore hardware is exploited to the fullest. In this paper we present a tool that we have developed to automatically convert a sequential C code to parallel code. This Sequential to Parallel (S2P) tool is still in the development phase. We also discuss other parallelization tools available today, compare such tools with S2P tool and present our performance analysis results on different kind of multicore hardware

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

Author: Ashfaq Saad
AskariHemmat MohammadHossein
Ganji Darshan C.
Hassanien Ahmed
Hoffman Alexander
Léonardon Mathieu
Mitra Saptarshi
Saboori Ehsan
Sah Sudhakar
Publication venue
Publication date: 18/04/2023
Field of study

A lot of recent progress has been made in ultra low-bit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with sub-byte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74x on x86 platforms

arXiv.org e-Print Archive

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

Author: Ashfaq Saad
Askarihemmat Mohammadhossein
Ganji Darshan, C
Hassanien Ahmed
Hoffmann Alexander
Leonardon Mathieu
Mitra Saptarshi
Saaboori Ehsan
Sah Sudhakar
Publication venue: HAL CCSD
Publication date: 19/06/2023
Field of study

International audienceA lot of recent progress has been made in ultra lowbit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with subbyte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74× on x86 platforms

HAL-Université de Bretagne Occidentale

Disruption of tetR type regulator adeN by mobile genetic element confers elevated virulence in Acinetobacter baumannii

Author: Arunkumar KP
Madhangi M
Pagal Sudhakar
Prashanth K
Sah Suresh
Saranathan Rajagopalan
Satti Annapurna
Sawant Ajit R
Tomar Archana
Publication venue
Publication date
Field of study

Acinetobacter baumannii is an important human pathogen and considered as a major threat due to its extreme drug resistance. In this study, the genome of a hyper-virulent MDR strain PKAB07 of A. baumannii isolated from an Indian patient was sequenced and analyzed to understand its mechanisms of virulence, resistance and evolution. Comparative genome analysis of PKAB07 revealed virulence and resistance related genes scattered throughout the genome, instead of being organized as an island, indicating the highly mosaic nature of the genome. Many intermittent horizontal gene transfer events, insertion sequence (IS) element insertions identified were augmenting resistance machinery and elevating the SNP densities in A. baumannii eventually aiding in their swift evolution. ISAba1, the most widely distributed insertion sequence in A. baumannii was found in multiple sites in PKAB07. Out of many ISAba1 insertions, we identified novel insertions in 9 different genes wherein insertional inactivation of adeN (tetR type regulator) was significant. To assess the significance of this disruption in A. baumannii, adeN mutant and complement strains were constructed in A. baumannii ATCC 17978 strain and studied. Biofilm levels were abrogated in the adeN knockout when compared with the wild type and complemented strain of adeN knockout. Virulence of the adeN knockout mutant strain was observed to be high, which was validated by in vitro experiments and Galleria mellonella infection model. The overexpression of adeJ, a major component of AdeIJK efflux pump observed in adeN knockout strain could be the possible reason for the elevated virulence in adeN mutant and PKB07 strain. Knocking out of adeN in ATCC strain led to increased resistance and virulence at par with the PKAB07. Disruption of tetR type regulator adeN by ISAba1 consequently has led to elevated virulence in this pathogen

Open Access Repository of IISc Research Publications