7,719 research outputs found
IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition Report
This report summarizes IROS 2019-Lifelong Robotic Vision Competition
(Lifelong Object Recognition Challenge) with methods and results from the top
finalists (out of over~ teams). The competition dataset (L)ifel(O)ng
(R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is
designed for driving lifelong/continual learning research and application in
robotic vision domain, with everyday objects in home, office, campus, and mall
scenarios. The dataset explicitly quantifies the variants of illumination,
object occlusion, object size, camera-object distance/angles, and clutter
information. Rules are designed to quantify the learning capability of the
robotic vision system when faced with the objects appearing in the dynamic
environments in the contest. Individual reports, dataset information, rules,
and released source code can be found at the project homepage:
"https://lifelong-robotic-vision.github.io/competition/".Comment: 9 pages, 11 figures, 3 tables, accepted into IEEE Robotics and
Automation Magazine. arXiv admin note: text overlap with arXiv:1911.0648
Recognizing and Tracking High-Level, Human-Meaningful Navigation Features of Occupancy Grid Maps
This paper describes a system whereby a robot detects and track
human-meaningful navigational cues as it navigates in an indoor environment. It
is intended as the sensor front-end for a mobile robot system that can
communicate its navigational context with human users. From simulated LiDAR
scan data we construct a set of 2D occupancy grid bitmaps, then hand-label
these with human-scale navigational features such as closed doors, open
corridors and intersections. We train a Convolutional Neural Network (CNN) to
recognize these features on input bitmaps. In our demonstration system, these
features are detected at every time step then passed to a tracking module that
does frame-to-frame data association to improve detection accuracy and identify
stable unique features. We evaluate the system in both simulation and the real
world. We compare the performance of using input occupancy grids obtained
directly from LiDAR data, or incrementally constructed with SLAM, and their
combination.Comment: Video: https://youtu.be/zNnsH9tNUN
Real-Time, Highly Accurate Robotic Grasp Detection using Fully Convolutional Neural Networks with High-Resolution Images
Robotic grasp detection for novel objects is a challenging task, but for the
last few years, deep learning based approaches have achieved remarkable
performance improvements, up to 96.1% accuracy, with RGB-D data. In this paper,
we propose fully convolutional neural network (FCNN) based methods for robotic
grasp detection. Our methods also achieved state-of-the-art detection accuracy
(up to 96.6%) with state-of- the-art real-time computation time for
high-resolution images (6-20ms per 360x360 image) on Cornell dataset. Due to
FCNN, our proposed method can be applied to images with any size for detecting
multigrasps on multiobjects. Proposed methods were evaluated using 4-axis robot
arm with small parallel gripper and RGB-D camera for grasping challenging
small, novel objects. With accurate vision-robot coordinate calibration through
our proposed learning-based, fully automatic approach, our proposed method
yielded 90% success rate.Comment: This work was superceded by arXiv:1812.0776
What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?
The finding that very large networks can be trained efficiently and reliably
has led to a paradigm shift in computer vision from engineered solutions to
learning formulations. As a result, the research challenge shifts from devising
algorithms to creating suitable and abundant training data for supervised
learning. How to efficiently create such training data? The dominant data
acquisition method in visual recognition is based on web data and manual
annotation. Yet, for many computer vision problems, such as stereo or optical
flow estimation, this approach is not feasible because humans cannot manually
enter a pixel-accurate flow field. In this paper, we promote the use of
synthetically generated data for the purpose of training deep networks on such
tasks.We suggest multiple ways to generate such data and evaluate the influence
of dataset properties on the performance and generalization properties of the
resulting networks. We also demonstrate the benefit of learning schedules that
use different types of data at selected stages of the training process.Comment: added references (UCL dataset); added IJCV copyright informatio
The CoSTAR Block Stacking Dataset: Learning with Workspace Constraints
A robot can now grasp an object more effectively than ever before, but once
it has the object what happens next? We show that a mild relaxation of the task
and workspace constraints implicit in existing object grasping datasets can
cause neural network based grasping algorithms to fail on even a simple block
stacking task when executed under more realistic circumstances.
To address this, we introduce the JHU CoSTAR Block Stacking Dataset (BSD),
where a robot interacts with 5.1 cm colored blocks to complete an
order-fulfillment style block stacking task. It contains dynamic scenes and
real time-series data in a less constrained environment than comparable
datasets. There are nearly 12,000 stacking attempts and over 2 million frames
of real data. We discuss the ways in which this dataset provides a valuable
resource for a broad range of other topics of investigation.
We find that hand-designed neural networks that work on prior datasets do not
generalize to this task. Thus, to establish a baseline for this dataset, we
demonstrate an automated search of neural network based models using a novel
multiple-input HyperTree MetaModel, and find a final model which makes
reasonable 3D pose predictions for grasping and stacking on our dataset.
The CoSTAR BSD, code, and instructions are available at
https://sites.google.com/site/costardataset.Comment: This is a major revision refocusing the topic towards the JHU CoSTAR
Block Stacking Dataset, workspace constraints, and a comparison of HyperTrees
with hand-designed algorithms. 12 pages, 10 figures, and 3 table
A Survey on Deep Learning Methods for Robot Vision
Deep learning has allowed a paradigm shift in pattern recognition, from using
hand-crafted features together with statistical classifiers to using
general-purpose learning procedures for learning data-driven representations,
features, and classifiers together. The application of this new paradigm has
been particularly successful in computer vision, in which the development of
deep learning methods for vision applications has become a hot research topic.
Given that deep learning has already attracted the attention of the robot
vision community, the main purpose of this survey is to address the use of deep
learning in robot vision. To achieve this, a comprehensive overview of deep
learning and its usage in computer vision is given, that includes a description
of the most frequently used neural models and their main application areas.
Then, the standard methodology and tools used for designing deep-learning based
vision systems are presented. Afterwards, a review of the principal work using
deep learning in robot vision is presented, as well as current and future
trends related to the use of deep learning in robotics. This survey is intended
to be a guide for the developers of robot vision systems
Bayesian Grasp: Robotic visual stable grasp based on prior tactile knowledge
Robotic grasp detection is a fundamental capability for intelligent
manipulation in unstructured environments. Previous work mainly employed visual
and tactile fusion to achieve stable grasp, while, the whole process depending
heavily on regrasping, which wastes much time to regulate and evaluate. We
propose a novel way to improve robotic grasping: by using learned tactile
knowledge, a robot can achieve a stable grasp from an image. First, we
construct a prior tactile knowledge learning framework with novel grasp quality
metric which is determined by measuring its resistance to external
perturbations. Second, we propose a multi-phases Bayesian Grasp architecture to
generate stable grasp configurations through a single RGB image based on prior
tactile knowledge. Results show that this framework can classify the outcome of
grasps with an average accuracy of 86% on known objects and 79% on novel
objects. The prior tactile knowledge improves the successful rate of 55% over
traditional vision-based strategies.Comment: submitted to ICRA202
The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?
A successful grasp requires careful balancing of the contact forces. Deducing
whether a particular grasp will be successful from indirect measurements, such
as vision, is therefore quite challenging, and direct sensing of contacts
through touch sensing provides an appealing avenue toward more successful and
consistent robotic grasping. However, in order to fully evaluate the value of
touch sensing for grasp outcome prediction, we must understand how touch
sensing can influence outcome prediction accuracy when combined with other
modalities. Doing so using conventional model-based techniques is exceptionally
difficult. In this work, we investigate the question of whether touch sensing
aids in predicting grasp outcomes within a multimodal sensing framework that
combines vision and touch. To that end, we collected more than 9,000 grasping
trials using a two-finger gripper equipped with GelSight high-resolution
tactile sensors on each finger, and evaluated visuo-tactile deep neural network
models to directly predict grasp outcomes from either modality individually,
and from both modalities together. Our experimental results indicate that
incorporating tactile readings substantially improve grasping performance.Comment: 10 pages, accepted at the 1st Annual Conference on Robot Learning
(CoRL
SFV: Reinforcement Learning of Physical Skills from Videos
Data-driven character animation based on motion capture can produce highly
naturalistic behaviors and, when combined with physics simulation, can provide
for natural procedural responses to physical perturbations, environmental
changes, and morphological discrepancies. Motion capture remains the most
popular source of motion data, but collecting mocap data typically requires
heavily instrumented environments and actors. In this paper, we propose a
method that enables physically simulated characters to learn skills from videos
(SFV). Our approach, based on deep pose estimation and deep reinforcement
learning, allows data-driven animation to leverage the abundance of publicly
available video clips from the web, such as those from YouTube. This has the
potential to enable fast and easy design of character controllers simply by
querying for video recordings of the desired behavior. The resulting
controllers are robust to perturbations, can be adapted to new settings, can
perform basic object interactions, and can be retargeted to new morphologies
via reinforcement learning. We further demonstrate that our method can predict
potential human motions from still images, by forward simulation of learned
controllers initialized from the observed pose. Our framework is able to learn
a broad range of dynamic skills, including locomotion, acrobatics, and martial
arts
Curvature: A signature for Action Recognition in Video Sequences
In this paper, a novel signature of human action recognition, namely the
curvature of a video sequence, is introduced. In this way, the distribution of
sequential data is modeled, which enables few-shot learning. Instead of
depending on recognizing features within images, our algorithm views actions as
sequences on the universal time scale across a whole sequence of images. The
video sequence, viewed as a curve in pixel space, is aligned by
reparameterization using the arclength of the curve in pixel space. Once such
curvatures are obtained, statistical indexes are extracted and fed into a
learning-based classifier. Overall, our method is simple but powerful.
Preliminary experimental results show that our method is effective and achieves
state-of-the-art performance in video-based human action recognition. Moreover,
we see latent capacity in transferring this idea into other sequence-based
recognition applications such as speech recognition, machine translation, and
text generation
- …