12 research outputs found
An Energy-Efficient Hardware Implementation of HOG-Based Object Detection at 1080HD 60 fps with Multi-Scale Support
A real-time and energy-efficient multi-scale object detector hardware implementation is presented in this paper. Detection is done using Histogram of Oriented Gradients (HOG) features and Support Vector Machine (SVM) classification. Multi-scale detection is essential for robust and practical applications to detect objects of different sizes. Parallel detectors with balanced workload are used to increase the throughput, enabling voltage scaling and energy consumption reduction. Image pre-processing is also introduced to further reduce power and area costs of the image scales generation. This design can operate on high definition 1080HD video at 60 fps in real-time with a clock rate of 270 MHz, and consumes 45.3 mW (0.36 nJ/pixel) based on post-layout simulations. The ASIC has an area of 490 kgates and 0.538 Mbit on-chip memory in a 45 nm SOI CMOS process.Texas Instruments IncorporatedUnited States. Defense Advanced Research Projects Agency (Young Faculty Award Grant N66001-14-1-4039
Energy-Efficient HOG-based Object Detection at 1080HD 60 fps with Multi-Scale Support
In this paper, we present a real-time and energy-efficient multi-scale object detector using Histogram of Oriented Gradient (HOG) features and Support Vector Machine (SVM) classification. Parallel detectors with balanced workload are used to enable processing of multiple scales and increase the throughput such that voltage scaling can be applied to reduce energy consumption. Image pre-processing is also introduced to further reduce power and area cost of the image scales generation. This design can operate on high definition 1080HD video at 60 fps in real-time with a clock rate of 270 MHz, and consumes 45.3 mW (0.36 nJ/pixel) based on post-layout simulations. The ASIC has an area of 490 kgates and 0.538 Mbit on-chip memory in a 45nm SOI CMOS process
A 58.6mW Real-Time Programmable Object Detector with Multi-Scale Multi-Object Support Using Deformable Parts Model on 1920x1080 Video at 30fps
This paper presents a programmable, energy-efficient and real-time object detection accelerator using deformable parts models (DPM), with 2× higher accuracy than traditional rigid body models. With 8 deformable parts detection, three methods are used to address the high computational complexity: classification pruning for 33× fewer parts classification, vector quantization for 15× memory size reduction, and feature basis projection for 2× reduction of the cost of each classification. The chip is implemented in 65nm CMOS technology, and can process HD (1920×1080) images at 30fps without any off-chip storage while consuming only 58.6mW (0.94nJ/pixel, 1168 GOPS/W). The chip has two classification engines to simultaneously detect two different classes of objects. With a tested high throughput of 60fps, the classification engines can be time multiplexed to detect even more than two object classes. It is energy scalable by changing the pruning factor or disabling the parts classification.United States. Defense Advanced Research Projects Agenc
A 58.6mW Real-Time Programmable Object Detector with Multi-Scale Multi-Object Support Using Deformable Parts Model on 1920x1080 Video at 30fps
This paper presents a programmable, energy-efficient and real-time object detection accelerator using deformable parts models (DPM), with 2× higher accuracy than traditional rigid body models. With 8 deformable parts detection, three methods are used to address the high computational complexity: classification pruning for 33× fewer parts classification, vector quantization for 15× memory size reduction, and feature basis projection for 2× reduction of the cost of each classification. The chip is implemented in 65nm CMOS technology, and can process HD (1920×1080) images at 30fps without any off-chip storage while consuming only 58.6mW (0.94nJ/pixel, 1168 GOPS/W). The chip has two classification engines to simultaneously detect two different classes of objects. With a tested high throughput of 60fps, the classification engines can be time multiplexed to detect even more than two object classes. It is energy scalable by changing the pruning factor or disabling the parts classification.United States. Defense Advanced Research Projects Agenc
A 58.6 mW 30 Frames/s Real-Time Programmable Multiobject Detection Accelerator With Deformable Parts Models on Full HD 1920×1080 Videos
This paper presents a programmable, energy-efficient, and real-time object detection hardware accelerator for low power and high throughput applications using deformable parts models, with 2x higher detection accuracy than traditional rigid body models. Three methods are used to address the high computational complexity of eight deformable parts detection: classification pruning for 33x fewer part classification, vector quantization for 15x memory size reduction, and feature basis projection for 2x reduction in the cost of each classification. The chip was fabricated in a 65 nm CMOS technology, and can process full high definition 1920 × 1080 videos at 60 frames/s without any OFF-chip storage. The chip has two programmable classification engines (CEs) for multiobject detection. At 30 frames/s, the chip consumes only 58.6 mW (0.94 nJ/pixel, 1168 GOPS/W). At a higher throughput of 60 frames/s, the CEs can be time multiplexed to detect even more than two object classes. This proposed accelerator enables object detection to be as energy-efficient as video compression, which is found in most cameras today.United States. Defense Advanced Research Projects AgencyTexas Instruments Incorporate
Visual-Inertial Odometry on Chip: An Algorithm-and-Hardware Co-design Approach
Autonomous navigation of miniaturized robots (e.g., nano/pico aerial vehicles) is currently a grand challenge for robotics research, due to the need of processing a large amount of sensor data (e.g., camera frames) with limited on-board computational resources. In this paper we focus on the design of a visual-inertial odometry (VIO) system in which the robot estimates its ego-motion (and a landmark-based map) from on- board camera and IMU data. We argue that scaling down VIO to miniaturized platforms (without sacrificing performance) requires a paradigm shift in the design of perception algorithms, and we advocate a co-design approach in which algorithmic and hardware design choices are tightly coupled. Our contribution is four-fold. First, we discuss the VIO co-design problem, in which one tries to attain a desired resource-performance trade-off, by making suitable design choices (in terms of hardware, algorithms, implementation, and parameters). Second, we characterize the design space, by discussing how a relevant set of design choices affects the resource-performance trade-off in VIO. Third, we provide a systematic experiment-driven way to explore the design space, towards a design that meets the desired trade-off. Fourth, we demonstrate the result of the co-design process by providing a VIO implementation on specialized hardware and showing that such implementation has the same accuracy and speed of a desktop implementation, while requiring a fraction of the power.United States. Air Force Office of Scientific Research. Young Investigator Program (FA9550-16-1-0228)National Science Foundation (U.S.) (NSF CAREER 1350685
Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision
Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy
Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones
This paper presents Navion, an energy-efficient accelerator for visual-inertial odometry (VIO) that enables autonomous navigation of miniaturized robots (e.g., nano drones), and virtual/augmented reality on portable devices. The chip uses inertial measurements and mono/stereo images to estimate the
drone’s trajectory and a 3D map of the environment. This estimate is obtained by running a state-of-the-art algorithm based on non-linear factor graph optimization, which requires large irregularly structured memories and heterogeneous computation flow. To reduce the energy consumption and footprint, the entire VIO system is fully integrated on chip to eliminate costly off-chip processing and storage. This work uses compression and exploits
both structured and unstructured sparsity to reduce on-chip memory size by 4.1x. Parallelism is used under tight area constraints to increase throughput by 43%. The chip is fabricated in 65nm CMOS, and can process 752x480 stereo images at up to 171 fps and inertial measurements at up to 52 kHz, while consuming an average of 24mW. The chip is configurable to maximize accuracy, throughput and energy-efficiency across different environments. To the best of our knowledge, this is the first fully integrated VIO system in an ASIC. Keywords: VIO, localization, mapping, nano drones, navigationNational Science Foundation (U.S.) (CAREER Grant 1350685)United States. Air Force. Office of Scientific Research. Young Investigator Program (FA9550-16-1-0228
Hardware for Machine Learning: Challenges and Opportunities
Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).United States. Defense Advanced Research Projects Agency (DARPA)Texas Instruments IncorporatedIntel Corporatio
Energy efficient accelerators for autonomous navigation in miniaturized robots
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 143-149).Autonomy is becoming an increasingly desirable feature for very small nano/pico robots to navigate cluttered and confined indoors environments such as collapsed buildings, caves, etc. Robot perception (i.e., semantic and geometric understanding) is considered the computation bottleneck in autonomous navigation systems because of the high dimensionality of the problem. For example, multi-scale object detection is desired for robustness, which requires significant data expansion. Additionally, a 3D map size grows overtime while the robot explores the environment, which requires computation power and large memory size. In this thesis, we introduce ASIC solutions that enable real-time and low power perception. First, the thesis demonstrates energy-efficient and high-throughput object detection accelerators for semantic understanding, which can process full HD (19201080, 60 fps) videos with energy consumption between 0.36 to 1.74 nJ/pixel. On-the-fly processing, parallel architectures, and image pre-processing are used to reduce the overhead of multi-scale detection using rigid-body models. Detection accuracy can be doubled with deformable parts models, but requires 35 more computation. To overcome this overhead, we exploit data compression, computation pruning, and basis projection for an overall 5 power reduction and 3.6 smaller memory size. Second, this thesis presents an algorithm and hardware co-design approach to enable real-time and energy-efficient localization and mapping for geometric understanding, using visual-inertial odometry. The chip (Navion) processes 752480 stereo frames at up to 171 fps, with an energy consumption between 1.6 to 3.5 nJ/pixel. Parallelism, rescheduling, resource sharing, exploiting sparsity, and image compression are applied to overcome the high dimensionality of the problem, resulting in 4.1 memory size reduction, and enabling full integration. Navion can adapt to different environments to maximize accuracy, throughput and energy-efficiency trade-offs. To the best of our knowledge, this thesis presents the first fully integrated VIO system in an ASIC.by Amr Suleiman.Ph. D