65 research outputs found
On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach
We revisit the problem of visual depth estimation in the context of
autonomous vehicles. Despite the progress on monocular depth estimation in
recent years, we show that the gap between monocular and stereo depth accuracy
remains largea particularly relevant result due to the prevalent reliance
upon monocular cameras by vehicles that are expected to be self-driving. We
argue that the challenges of removing this gap are significant, owing to
fundamental limitations of monocular vision. As a result, we focus our efforts
on depth estimation by stereo. We propose a novel semi-supervised learning
approach to training a deep stereo neural network, along with a novel
architecture containing a machine-learned argmax layer and a custom runtime
(that will be shared publicly) that enables a smaller version of our stereo DNN
to run on an embedded GPU. Competitive results are shown on the KITTI 2015
stereo dataset. We also evaluate the recent progress of stereo algorithms by
measuring the impact upon accuracy of various design criteria.Comment: CVPR 2018 Workshop on Autonomous Driving. For video, see
https://youtu.be/0FPQdVOYoA
Toward Low-Flying Autonomous MAV Trail Navigation using Deep Neural Networks for Environmental Awareness
We present a micro aerial vehicle (MAV) system, built with inexpensive
off-the-shelf hardware, for autonomously following trails in unstructured,
outdoor environments such as forests. The system introduces a deep neural
network (DNN) called TrailNet for estimating the view orientation and lateral
offset of the MAV with respect to the trail center. The DNN-based controller
achieves stable flight without oscillations by avoiding overconfident behavior
through a loss function that includes both label smoothing and entropy reward.
In addition to the TrailNet DNN, the system also utilizes vision modules for
environmental awareness, including another DNN for object detection and a
visual odometry component for estimating depth for the purpose of low-level
obstacle detection. All vision systems run in real time on board the MAV via a
Jetson TX1. We provide details on the hardware and software used, as well as
implementation details. We present experiments showing the ability of our
system to navigate forest trails more robustly than previous techniques,
including autonomous flights of 1 km.Comment: 7 pages, 9 figures, IROS2017 conference submission 1657, accompanying
videos are posted on YouTube at: https://www.youtube.com/watch?v=H7Ym3DMSGms
, https://www.youtube.com/watch?v=USYlt9t0lZ
Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation
We present a new dataset, called Falling Things (FAT), for advancing the
state-of-the-art in object detection and 3D pose estimation in the context of
robotics. By synthetically combining object models and backgrounds of complex
composition and high graphical quality, we are able to generate photorealistic
images with accurate 3D pose annotations for all objects in all images. Our
dataset contains 60k annotated photos of 21 household objects taken from the
YCB dataset. For each image, we provide the 3D poses, per-pixel class
segmentation, and 2D/3D bounding box coordinates for all objects. To facilitate
testing different input modalities, we provide mono and stereo RGB images,
along with registered dense depth images. We describe in detail the generation
process and statistical analysis of the data.Comment: CVPR 2018 Workshop on Real World Challenges and New Benchmarks for
Deep Learning in Robotic Visio
Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects
Using synthetic data for training deep neural networks for robotic
manipulation holds the promise of an almost unlimited amount of pre-labeled
training data, generated safely out of harm's way. One of the key challenges of
synthetic data, to date, has been to bridge the so-called reality gap, so that
networks trained on synthetic data operate correctly when exposed to real-world
data. We explore the reality gap in the context of 6-DoF pose estimation of
known objects from a single RGB image. We show that for this problem the
reality gap can be successfully spanned by a simple combination of domain
randomized and photorealistic data. Using synthetic data generated in this
manner, we introduce a one-shot deep neural network that is able to perform
competitively against a state-of-the-art network trained on a combination of
real and synthetic data. To our knowledge, this is the first deep network
trained only on synthetic data that is able to achieve state-of-the-art
performance on 6-DoF object pose estimation. Our network also generalizes
better to novel environments including extreme lighting conditions, for which
we show qualitative results. Using this network we demonstrate a real-time
system estimating object poses with sufficient accuracy for real-world semantic
grasping of known household objects in clutter by a real robot.Comment: Conference on Robot Learning (CoRL) 201
Efficient Hierarchical Graph-Based Segmentation of RGBD Videos
We present an efficient and scalable algorithm for segmenting 3D RGBD point
clouds by combining depth, color, and temporal information using a multistage,
hierarchical graph-based approach. Our algorithm processes a moving window over
several point clouds to group similar regions over a graph, resulting in an
initial over-segmentation. These regions are then merged to yield a dendrogram
using agglomerative clustering via a minimum spanning tree algorithm. Bipartite
graph matching at a given level of the hierarchical tree yields the final
segmentation of the point clouds by maintaining region identities over
arbitrarily long periods of time. We show that a multistage segmentation with
depth then color yields better results than a linear combination of depth and
color. Due to its incremental processing, our algorithm can process videos of
any length and in a streaming pipeline. The algorithm's ability to produce
robust, efficient segmentation is demonstrated with numerous experimental
results on challenging sequences from our own as well as public RGBD data sets.Comment: CVPR 201
Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs
We present a visually grounded hierarchical planning algorithm for
long-horizon manipulation tasks. Our algorithm offers a joint framework of
neuro-symbolic task planning and low-level motion generation conditioned on the
specified goal. At the core of our approach is a two-level scene graph
representation, namely geometric scene graph and symbolic scene graph. This
hierarchical representation serves as a structured, object-centric abstraction
of manipulation scenes. Our model uses graph neural networks to process these
scene graphs for predicting high-level task plans and low-level motions. We
demonstrate that our method scales to long-horizon tasks and generalizes well
to novel task goals. We validate our method in a kitchen storage task in both
physical simulation and the real world. Our experiments show that our method
achieved over 70% success rate and nearly 90% of subgoal completion rate on the
real robot while being four orders of magnitude faster in computation time
compared to standard search-based task-and-motion planner.Comment: Accepted to ICRA 202
Indirect Object-to-Robot Pose Estimation from an External Monocular RGB Camera
We present a robotic grasping system that uses a single external monocular
RGB camera as input. The object-to-robot pose is computed indirectly by
combining the output of two neural networks: one that estimates the
object-to-camera pose, and another that estimates the robot-to-camera pose.
Both networks are trained entirely on synthetic data, relying on domain
randomization to bridge the sim-to-real gap. Because the latter network
performs online camera calibration, the camera can be moved freely during
execution without affecting the quality of the grasp. Experimental results
analyze the effect of camera placement, image resolution, and pose refinement
in the context of grasping several household objects. We also present results
on a new set of 28 textured household toy grocery objects, which have been
selected to be accessible to other researchers. To aid reproducibility of the
research, we offer 3D scanned textured models, along with pre-trained weights
for pose estimation.Comment: IROS 2020. Video at https://youtu.be/E0J91llX-y
Robust Learning of Tactile Force Estimation through Robot Interaction
Current methods for estimating force from tactile sensor signals are either
inaccurate analytic models or task-specific learned models. In this paper, we
explore learning a robust model that maps tactile sensor signals to force. We
specifically explore learning a mapping for the SynTouch BioTac sensor via
neural networks. We propose a voxelized input feature layer for spatial signals
and leverage information about the sensor surface to regularize the loss
function. To learn a robust tactile force model that transfers across tasks, we
generate ground truth data from three different sources: (1) the BioTac rigidly
mounted to a force torque~(FT) sensor, (2) a robot interacting with a ball
rigidly attached to the same FT sensor, and (3) through force inference on a
planar pushing task by formalizing the mechanics as a system of particles and
optimizing over the object motion. A total of 140k samples were collected from
the three sources. We achieve a median angular accuracy of 3.5 degrees in
predicting force direction (66% improvement over the current state of the art)
and a median magnitude accuracy of 0.06 N (93% improvement) on a test dataset.
Additionally, we evaluate the learned force model in a force feedback grasp
controller performing object lifting and gentle placement. Our results can be
found on https://sites.google.com/view/tactile-force.Comment: accepted to ICRA 2019 (camera ready version
Few-Shot Viewpoint Estimation
Viewpoint estimation for known categories of objects has been improved
significantly thanks to deep networks and large datasets, but generalization to
unknown categories is still very challenging. With an aim towards improving
performance on unknown categories, we introduce the problem of category-level
few-shot viewpoint estimation. We design a novel framework to successfully
train viewpoint networks for new categories with few examples (10 or less). We
formulate the problem as one of learning to estimate category-specific 3D
canonical shapes, their associated depth estimates, and semantic 2D keypoints.
We apply meta-learning to learn weights for our network that are amenable to
category-specific few-shot fine-tuning. Furthermore, we design a flexible
meta-Siamese network that maximizes information sharing during meta-learning.
Through extensive experimentation on the ObjectNet3D and Pascal3D+ benchmark
datasets, we demonstrate that our framework, which we call MetaView,
significantly outperforms fine-tuning the state-of-the-art models with few
examples, and that the specific architectural innovations of our method are
crucial to achieving good performance.Comment: BMVC 201
Multi-View Fusion for Multi-Level Robotic Scene Understanding
We present a system for multi-level scene awareness for robotic manipulation.
Given a sequence of camera-in-hand RGB images, the system calculates three
types of information: 1) a point cloud representation of all the surfaces in
the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown
objects from categories corresponding to primitive shapes (e.g., cuboids and
cylinders); and 3) full 6-DoF pose of known objects. By developing and fusing
recent techniques in these domains, we provide a rich scene representation for
robot awareness. We demonstrate the importance of each of these modules, their
complementary nature, and the potential benefits of the system in the context
of robotic manipulation.Comment: Presented at IROS 2021. Video is at https://youtu.be/FuqMxuODGl
- …