55,176 research outputs found
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
In this paper, we study the problem of learning vision-based dynamic
manipulation skills using a scalable reinforcement learning approach. We study
this problem in the context of grasping, a longstanding challenge in robotic
manipulation. In contrast to static learning behaviors that choose a grasp
point and then execute the desired grasp, our method enables closed-loop
vision-based control, whereby the robot continuously updates its grasp strategy
based on the most recent observations to optimize long-horizon grasp success.
To that end, we introduce QT-Opt, a scalable self-supervised vision-based
reinforcement learning framework that can leverage over 580k real-world grasp
attempts to train a deep neural network Q-function with over 1.2M parameters to
perform closed-loop, real-world grasping that generalizes to 96% grasp success
on unseen objects. Aside from attaining a very high success rate, our method
exhibits behaviors that are quite distinct from more standard grasping systems:
using only RGB vision-based perception from an over-the-shoulder camera, our
method automatically learns regrasping strategies, probes objects to find the
most effective grasps, learns to reposition objects and perform other
non-prehensile pre-grasp manipulations, and responds dynamically to
disturbances and perturbations.Comment: CoRL 2018 camera ready. 23 pages, 14 figure
Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels
This paper introduces a vision transformer (ViT)-based deep joint source and
channel coding (DeepJSCC) scheme for wireless image transmission over
multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We
consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and
closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the
classical separation-based benchmarks with robustness to channel estimation
errors and showcases remarkable flexibility in adapting to diverse channel
conditions and antenna numbers without requiring retraining. Specifically, by
harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently
learns feature mapping and power allocation strategies tailored to the unique
characteristics of the source image and prevailing channel conditions.
Extensive numerical experiments validate the significant improvements in
transmission quality achieved by DeepJSCC-MIMO for both open-loop and
closed-loop MIMO systems across a wide range of scenarios. Moreover,
DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel
estimation errors, and different antenna numbers, making it an appealing
solution for emerging semantic communication systems.Comment: arXiv admin note: text overlap with arXiv:2210.1534
Robustifying the Deployment of tinyML Models for Autonomous mini-vehicles
Standard-size autonomous navigation vehicles have rapidly improved thanks to
the breakthroughs of deep learning. However, scaling autonomous driving to
low-power systems deployed on dynamic environments poses several challenges
that prevent their adoption. To address them, we propose a closed-loop learning
flow for autonomous driving mini-vehicles that includes the target environment
in-the-loop. We leverage a family of compact and high-throughput tinyCNNs to
control the mini-vehicle, which learn in the target environment by imitating a
computer vision algorithm, i.e., the expert. Thus, the tinyCNNs, having only
access to an on-board fast-rate linear camera, gain robustness to lighting
conditions and improve over time. Further, we leverage GAP8, a parallel
ultra-low-power RISC-V SoC, to meet the inference requirements. When running
the family of CNNs, our GAP8's solution outperforms any other implementation on
the STM32L4 and NXP k64f (Cortex-M4), reducing the latency by over 13x and the
energy consummation by 92%
Generating Goal-Directed Visuomotor Plans Based on Learning Using a Predictive Coding-type Deep Visuomotor Recurrent Neural Network Model
The current paper presents how a predictive coding type deep recurrent neural
networks can generate vision-based goal-directed plans based on prior learning
experience by examining experiment results using a real arm robot. The proposed
deep recurrent neural network learns to predict visuo-proprioceptive sequences
by extracting an adequate predictive model from various visuomotor experiences
related to object-directed behaviors. The predictive model was developed in
terms of mapping from intention state space to expected visuo-proprioceptive
sequences space through iterative learning. Our arm robot experiments adopted
with three different tasks with different levels of difficulty showed that the
error minimization principle in the predictive coding framework applied to
inference of the optimal intention states for given goal states can generate
goal-directed plans even for unlearned goal states with generalization. It was,
however, shown that sufficient generalization requires relatively large number
of learning trajectories. The paper discusses possible countermeasure to
overcome this problem.Comment: 6 page
Learning a visuomotor controller for real world robotic grasping using simulated depth images
We want to build robots that are useful in unstructured real world
applications, such as doing work in the household. Grasping in particular is an
important skill in this domain, yet it remains a challenge. One of the key
hurdles is handling unexpected changes or motion in the objects being grasped
and kinematic noise or other errors in the robot. This paper proposes an
approach to learning a closed-loop controller for robotic grasping that
dynamically guides the gripper to the object. We use a wrist-mounted sensor to
acquire depth images in front of the gripper and train a convolutional neural
network to learn a distance function to true grasps for grasp configurations
over an image. The training sensor data is generated in simulation, a major
advantage over previous work that uses real robot experience, which is costly
to obtain. Despite being trained in simulation, our approach works well on real
noisy sensor images. We compare our controller in simulated and real robot
experiments to a strong baseline for grasp pose detection, and find that our
approach significantly outperforms the baseline in the presence of kinematic
noise, perceptual errors and disturbances of the object during grasping.Comment: 1st Conference on Robot Learning (CoRL), 13-15 November 2017,
Mountain View, C
An inner-loop free solution to inverse problems using deep neural networks
We propose a new method that uses deep learning techniques to accelerate the
popular alternating direction method of multipliers (ADMM) solution for inverse
problems. The ADMM updates consist of a proximity operator, a least squares
regression that includes a big matrix inversion, and an explicit solution for
updating the dual variables. Typically, inner loops are required to solve the
first two sub-minimization problems due to the intractability of the prior and
the matrix inversion. To avoid such drawbacks or limitations, we propose an
inner-loop free update rule with two pre-trained deep convolutional
architectures. More specifically, we learn a conditional denoising auto-encoder
which imposes an implicit data-dependent prior/regularization on ground-truth
in the first sub-minimization problem. This design follows an empirical
Bayesian strategy, leading to so-called amortized inference. For matrix
inversion in the second sub-problem, we learn a convolutional neural network to
approximate the matrix inversion, i.e., the inverse mapping is learned by
feeding the input through the learned forward network. Note that training this
neural network does not require ground-truth or measurements, i.e., it is
data-independent. Extensive experiments on both synthetic data and real
datasets demonstrate the efficiency and accuracy of the proposed method
compared with the conventional ADMM solution using inner loops for solving
inverse problems
Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning
Prediction is an appealing objective for self-supervised learning of
behavioral skills, particularly for autonomous robots. However, effectively
utilizing predictive models for control, especially with raw image inputs,
poses a number of major challenges. How should the predictions be used? What
happens when they are inaccurate? In this paper, we tackle these questions by
proposing a method for learning robotic skills from raw image observations,
using only autonomously collected experience. We show that even an imperfect
model can complete complex tasks if it can continuously retry, but this
requires the model to not lose track of the objective (e.g., the object of
interest). To enable a robot to continuously retry a task, we devise a
self-supervised algorithm for learning image registration, which can keep track
of objects of interest for the duration of the trial. We demonstrate that this
idea can be combined with a video-prediction based controller to enable complex
behaviors to be learned from scratch using only raw visual inputs, including
grasping, repositioning objects, and non-prehensile manipulation. Our
real-world experiments demonstrate that a model trained with 160 robot hours of
autonomously collected, unlabeled data is able to successfully perform complex
manipulation tasks with a wide range of objects not seen during training.Comment: accepted at the Conference on Robot Learning (CoRL) 201
Highly Efficient Regression for Scalable Person Re-Identification
Existing person re-identification models are poor for scaling up to large
data required in real-world applications due to: (1) Complexity: They employ
complex models for optimal performance resulting in high computational cost for
training at a large scale; (2) Inadaptability: Once trained, they are
unsuitable for incremental update to incorporate any new data available. This
work proposes a truly scalable solution to re-id by addressing both problems.
Specifically, a Highly Efficient Regression (HER) model is formulated by
embedding the Fisher's criterion to a ridge regression model for very fast
re-id model learning with scalable memory/storage usage. Importantly, this new
HER model supports faster than real-time incremental model updates therefore
making real-time active learning feasible in re-id with human-in-the-loop.
Extensive experiments show that such a simple and fast model not only
outperforms notably the state-of-the-art re-id methods, but also is more
scalable to large data with additional benefits to active learning for reducing
human labelling effort in re-id deployment
FutureMapping: The Computational Structure of Spatial AI Systems
We discuss and predict the evolution of Simultaneous Localisation and Mapping
(SLAM) into a general geometric and semantic `Spatial AI' perception capability
for intelligent embodied devices. A big gap remains between the visual
perception performance that devices such as augmented reality eyewear or
comsumer robots will require and what is possible within the constraints
imposed by real products. Co-design of algorithms, processors and sensors will
be needed. We explore the computational structure of current and future Spatial
AI algorithms and consider this within the landscape of ongoing hardware
developments
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection
We describe a learning-based approach to hand-eye coordination for robotic
grasping from monocular images. To learn hand-eye coordination for grasping, we
trained a large convolutional neural network to predict the probability that
task-space motion of the gripper will result in successful grasps, using only
monocular camera images and independently of camera calibration or the current
robot pose. This requires the network to observe the spatial relationship
between the gripper and objects in the scene, thus learning hand-eye
coordination. We then use this network to servo the gripper in real time to
achieve successful grasps. To train our network, we collected over 800,000
grasp attempts over the course of two months, using between 6 and 14 robotic
manipulators at any given time, with differences in camera placement and
hardware. Our experimental evaluation demonstrates that our method achieves
effective real-time control, can successfully grasp novel objects, and corrects
mistakes by continuous servoing.Comment: This is an extended version of "Learning Hand-Eye Coordination for
Robotic Grasping with Large-Scale Data Collection," ISER 2016. Draft modified
to correct typo in Algorithm 1 and add a link to the publicly available
datase
- …