9 research outputs found
Real-Time Fully Unsupervised Domain Adaptation for Lane Detection in Autonomous Driving
While deep neural networks are being utilized heavily for autonomous driving,
they need to be adapted to new unseen environmental conditions for which they
were not trained. We focus on a safety critical application of lane detection,
and propose a lightweight, fully unsupervised, real-time adaptation approach
that only adapts the batch-normalization parameters of the model. We
demonstrate that our technique can perform inference, followed by on-device
adaptation, under a tight constraint of 30 FPS on Nvidia Jetson Orin. It shows
similar accuracy (avg. of 92.19%) as a state-of-the-art semi-supervised
adaptation algorithm but which does not support real-time adaptation.Comment: Accepted in 2023 Design, Automation & Test in Europe Conference (DATE
2023) - Late Breaking Result
RobotPerf: An Open-Source, Vendor-Agnostic, Benchmarking Suite for Evaluating Robotics Computing System Performance
We introduce RobotPerf, a vendor-agnostic benchmarking suite designed to
evaluate robotics computing performance across a diverse range of hardware
platforms using ROS 2 as its common baseline. The suite encompasses ROS 2
packages covering the full robotics pipeline and integrates two distinct
benchmarking approaches: black-box testing, which measures performance by
eliminating upper layers and replacing them with a test application, and
grey-box testing, an application-specific measure that observes internal system
states with minimal interference. Our benchmarking framework provides
ready-to-use tools and is easily adaptable for the assessment of custom ROS 2
computational graphs. Drawing from the knowledge of leading robot architects
and system architecture experts, RobotPerf establishes a standardized approach
to robotics benchmarking. As an open-source initiative, RobotPerf remains
committed to evolving with community input to advance the future of
hardware-accelerated robotics
QuaRL: Quantization for Sustainable Reinforcement Learning
Deep reinforcement learning has achieved significant milestones, however, the
computational demands of reinforcement learning training and inference remain
substantial. Quantization is an effective method to reduce the computational
overheads of neural networks, though in the context of reinforcement learning,
it is unknown whether quantization's computational benefits outweigh the
accuracy costs introduced by the corresponding quantization error. To quantify
this tradeoff we perform a broad study applying quantization to reinforcement
learning. We apply standard quantization techniques such as post-training
quantization (PTQ) and quantization aware training (QAT) to a comprehensive set
of reinforcement learning tasks (Atari, Gym), algorithms (A2C, DDPG, DQN, D4PG,
PPO), and models (MLPs, CNNs) and show that policies may be quantized to 8-bits
without degrading reward, enabling significant inference speedups on
resource-constrained edge devices. Motivated by the effectiveness of standard
quantization techniques on reinforcement learning policies, we introduce a
novel quantization algorithm, \textit{ActorQ}, for quantized actor-learner
distributed reinforcement learning training. By leveraging full precision
optimization on the learner and quantized execution on the actors,
\textit{ActorQ} enables 8-bit inference while maintaining convergence. We
develop a system for quantized reinforcement learning training around
\textit{ActorQ} and demonstrate end to end speedups of 1.5 - 2.5
over full precision training on a range of tasks (Deepmind Control
Suite). Finally, we break down the various runtime costs of distributed
reinforcement learning training (such as communication time, inference time,
model load time, etc) and evaluate the effects of quantization on these system
attributes.Comment: Equal contribution from first three authors. Updating with QuaRL for
sustainable (carbon emissions) RL result
Improving compute in-memory ECC reliability with successive correction
Compute in-memory (CIM) is an exciting technique that minimizes data transport, maximizes memory throughput, and performs computation on the bitline of memory sub-arrays. This is especially interesting for machine learning applications, where increased memory bandwidth and analog domain computation offer improved area and energy efficiency. Unfortunately, CIM faces new challenges traditional CMOS architectures have avoided. In this work, we explore the impact of device variation (calibrated with measured data on foundry RRAM arrays) and propose a new class of error correcting codes (ECC) for hard and soft errors in CIM. We demonstrate single, double, and triple error correction offering over 16,000× reduction in bit error rate over a design without ECC and over 427× over prior work, while consuming only 29.1% area and 26.3% power overhead. © 2022 ACM