Search CORE

65 research outputs found

On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach

Author: Birchfield Stan
Kamenev Alexey
Smolyanskiy Nikolai
Publication venue
Publication date: 07/07/2020
Field of study

We revisit the problem of visual depth estimation in the context of autonomous vehicles. Despite the progress on monocular depth estimation in recent years, we show that the gap between monocular and stereo depth accuracy remains large

-

a particularly relevant result due to the prevalent reliance upon monocular cameras by vehicles that are expected to be self-driving. We argue that the challenges of removing this gap are significant, owing to fundamental limitations of monocular vision. As a result, we focus our efforts on depth estimation by stereo. We propose a novel semi-supervised learning approach to training a deep stereo neural network, along with a novel architecture containing a machine-learned argmax layer and a custom runtime (that will be shared publicly) that enables a smaller version of our stereo DNN to run on an embedded GPU. Competitive results are shown on the KITTI 2015 stereo dataset. We also evaluate the recent progress of stereo algorithms by measuring the impact upon accuracy of various design criteria.Comment: CVPR 2018 Workshop on Autonomous Driving. For video, see https://youtu.be/0FPQdVOYoA

arXiv.org e-Print Archive

Toward Low-Flying Autonomous MAV Trail Navigation using Deep Neural Networks for Environmental Awareness

Author: Birchfield Stan
Kamenev Alexey
Smith Jeffrey
Smolyanskiy Nikolai
Publication venue
Publication date: 22/07/2017
Field of study

We present a micro aerial vehicle (MAV) system, built with inexpensive off-the-shelf hardware, for autonomously following trails in unstructured, outdoor environments such as forests. The system introduces a deep neural network (DNN) called TrailNet for estimating the view orientation and lateral offset of the MAV with respect to the trail center. The DNN-based controller achieves stable flight without oscillations by avoiding overconfident behavior through a loss function that includes both label smoothing and entropy reward. In addition to the TrailNet DNN, the system also utilizes vision modules for environmental awareness, including another DNN for object detection and a visual odometry component for estimating depth for the purpose of low-level obstacle detection. All vision systems run in real time on board the MAV via a Jetson TX1. We provide details on the hardware and software used, as well as implementation details. We present experiments showing the ability of our system to navigate forest trails more robustly than previous techniques, including autonomous flights of 1 km.Comment: 7 pages, 9 figures, IROS2017 conference submission 1657, accompanying videos are posted on YouTube at: https://www.youtube.com/watch?v=H7Ym3DMSGms , https://www.youtube.com/watch?v=USYlt9t0lZ

arXiv.org e-Print Archive

Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation

Author: Birchfield Stan
To Thang
Tremblay Jonathan
Publication venue
Publication date: 10/07/2018
Field of study

We present a new dataset, called Falling Things (FAT), for advancing the state-of-the-art in object detection and 3D pose estimation in the context of robotics. By synthetically combining object models and backgrounds of complex composition and high graphical quality, we are able to generate photorealistic images with accurate 3D pose annotations for all objects in all images. Our dataset contains 60k annotated photos of 21 household objects taken from the YCB dataset. For each image, we provide the 3D poses, per-pixel class segmentation, and 2D/3D bounding box coordinates for all objects. To facilitate testing different input modalities, we provide mono and stereo RGB images, along with registered dense depth images. We describe in detail the generation process and statistical analysis of the data.Comment: CVPR 2018 Workshop on Real World Challenges and New Benchmarks for Deep Learning in Robotic Visio

arXiv.org e-Print Archive

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

Author: Birchfield Stan
Fox Dieter
Sundaralingam Balakumar
To Thang
Tremblay Jonathan
Xiang Yu
Publication venue
Publication date: 27/09/2018
Field of study

Using synthetic data for training deep neural networks for robotic manipulation holds the promise of an almost unlimited amount of pre-labeled training data, generated safely out of harm's way. One of the key challenges of synthetic data, to date, has been to bridge the so-called reality gap, so that networks trained on synthetic data operate correctly when exposed to real-world data. We explore the reality gap in the context of 6-DoF pose estimation of known objects from a single RGB image. We show that for this problem the reality gap can be successfully spanned by a simple combination of domain randomized and photorealistic data. Using synthetic data generated in this manner, we introduce a one-shot deep neural network that is able to perform competitively against a state-of-the-art network trained on a combination of real and synthetic data. To our knowledge, this is the first deep network trained only on synthetic data that is able to achieve state-of-the-art performance on 6-DoF object pose estimation. Our network also generalizes better to novel environments including extreme lighting conditions, for which we show qualitative results. Using this network we demonstrate a real-time system estimating object poses with sufficient accuracy for real-world semantic grasping of known household objects in clutter by a real robot.Comment: Conference on Robot Learning (CoRL) 201

arXiv.org e-Print Archive

Efficient Hierarchical Graph-Based Segmentation of RGBD Videos

Author: Birchfield Stan
Christensen Henrik
Essa Irfan
Hickson Steven
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/01/2018
Field of study

We present an efficient and scalable algorithm for segmenting 3D RGBD point clouds by combining depth, color, and temporal information using a multistage, hierarchical graph-based approach. Our algorithm processes a moving window over several point clouds to group similar regions over a graph, resulting in an initial over-segmentation. These regions are then merged to yield a dendrogram using agglomerative clustering via a minimum spanning tree algorithm. Bipartite graph matching at a given level of the hierarchical tree yields the final segmentation of the point clouds by maintaining region identities over arbitrarily long periods of time. We show that a multistage segmentation with depth then color yields better results than a linear combination of depth and color. Due to its incremental processing, our algorithm can process videos of any length and in a streaming pipeline. The algorithm's ability to produce robust, efficient segmentation is demonstrated with numerous experimental results on challenging sequences from our own as well as public RGBD data sets.Comment: CVPR 201

arXiv.org e-Print Archive

Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Author: Birchfield Stan
Tremblay Jonathan
Zhu Yifeng
Zhu Yuke
Publication venue
Publication date: 29/03/2021
Field of study

We present a visually grounded hierarchical planning algorithm for long-horizon manipulation tasks. Our algorithm offers a joint framework of neuro-symbolic task planning and low-level motion generation conditioned on the specified goal. At the core of our approach is a two-level scene graph representation, namely geometric scene graph and symbolic scene graph. This hierarchical representation serves as a structured, object-centric abstraction of manipulation scenes. Our model uses graph neural networks to process these scene graphs for predicting high-level task plans and low-level motions. We demonstrate that our method scales to long-horizon tasks and generalizes well to novel task goals. We validate our method in a kitchen storage task in both physical simulation and the real world. Our experiments show that our method achieved over 70% success rate and nearly 90% of subgoal completion rate on the real robot while being four orders of magnitude faster in computation time compared to standard search-based task-and-motion planner.Comment: Accepted to ICRA 202

arXiv.org e-Print Archive

Indirect Object-to-Robot Pose Estimation from an External Monocular RGB Camera

Author: Birchfield Stan
Mosier Terry
Tremblay Jonathan
Tyree Stephen
Publication venue
Publication date: 26/08/2020
Field of study

We present a robotic grasping system that uses a single external monocular RGB camera as input. The object-to-robot pose is computed indirectly by combining the output of two neural networks: one that estimates the object-to-camera pose, and another that estimates the robot-to-camera pose. Both networks are trained entirely on synthetic data, relying on domain randomization to bridge the sim-to-real gap. Because the latter network performs online camera calibration, the camera can be moved freely during execution without affecting the quality of the grasp. Experimental results analyze the effect of camera placement, image resolution, and pose refinement in the context of grasping several household objects. We also present results on a new set of 28 textured household toy grocery objects, which have been selected to be accessible to other researchers. To aid reproducibility of the research, we offer 3D scanned textured models, along with pre-trained weights for pose estimation.Comment: IROS 2020. Video at https://youtu.be/E0J91llX-y

arXiv.org e-Print Archive

Robust Learning of Tactile Force Estimation through Robot Interaction

Author: Birchfield Stan
Boots Byron
Fox Dieter
Handa Ankur
Hermans Tucker
Lambert Alexander
Ratliff Nathan
Sundaralingam Balakumar
Publication venue
Publication date: 05/03/2019
Field of study

Current methods for estimating force from tactile sensor signals are either inaccurate analytic models or task-specific learned models. In this paper, we explore learning a robust model that maps tactile sensor signals to force. We specifically explore learning a mapping for the SynTouch BioTac sensor via neural networks. We propose a voxelized input feature layer for spatial signals and leverage information about the sensor surface to regularize the loss function. To learn a robust tactile force model that transfers across tasks, we generate ground truth data from three different sources: (1) the BioTac rigidly mounted to a force torque~(FT) sensor, (2) a robot interacting with a ball rigidly attached to the same FT sensor, and (3) through force inference on a planar pushing task by formalizing the mechanics as a system of particles and optimizing over the object motion. A total of 140k samples were collected from the three sources. We achieve a median angular accuracy of 3.5 degrees in predicting force direction (66% improvement over the current state of the art) and a median magnitude accuracy of 0.06 N (93% improvement) on a test dataset. Additionally, we evaluate the learned force model in a force feedback grasp controller performing object lifting and gentle placement. Our results can be found on https://sites.google.com/view/tactile-force.Comment: accepted to ICRA 2019 (camera ready version

arXiv.org e-Print Archive

Few-Shot Viewpoint Estimation

Author: Birchfield Stan
De Mello Shalini
Kautz Jan
Liu Sifei
Tremblay Jonathan
Tseng Hung-Yu
Yang Ming-Hsuan
Publication venue
Publication date: 31/07/2019
Field of study

Viewpoint estimation for known categories of objects has been improved significantly thanks to deep networks and large datasets, but generalization to unknown categories is still very challenging. With an aim towards improving performance on unknown categories, we introduce the problem of category-level few-shot viewpoint estimation. We design a novel framework to successfully train viewpoint networks for new categories with few examples (10 or less). We formulate the problem as one of learning to estimate category-specific 3D canonical shapes, their associated depth estimates, and semantic 2D keypoints. We apply meta-learning to learn weights for our network that are amenable to category-specific few-shot fine-tuning. Furthermore, we design a flexible meta-Siamese network that maximizes information sharing during meta-learning. Through extensive experimentation on the ObjectNet3D and Pascal3D+ benchmark datasets, we demonstrate that our framework, which we call MetaView, significantly outperforms fine-tuning the state-of-the-art models with few examples, and that the specific architectural innovations of our method are crucial to achieving good performance.Comment: BMVC 201

arXiv.org e-Print Archive

Multi-View Fusion for Multi-Level Robotic Scene Understanding

Author: Birchfield Stan
Lin Yunzhi
Tremblay Jonathan
Tyree Stephen
Vela Patricio A.
Publication venue
Publication date: 14/10/2021
Field of study

We present a system for multi-level scene awareness for robotic manipulation. Given a sequence of camera-in-hand RGB images, the system calculates three types of information: 1) a point cloud representation of all the surfaces in the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown objects from categories corresponding to primitive shapes (e.g., cuboids and cylinders); and 3) full 6-DoF pose of known objects. By developing and fusing recent techniques in these domains, we provide a rich scene representation for robot awareness. We demonstrate the importance of each of these modules, their complementary nature, and the potential benefits of the system in the context of robotic manipulation.Comment: Presented at IROS 2021. Video is at https://youtu.be/FuqMxuODGl

arXiv.org e-Print Archive