4,729 research outputs found
Analysis of Hand Segmentation in the Wild
A large number of works in egocentric vision have concentrated on action and
object recognition. Detection and segmentation of hands in first-person videos,
however, has less been explored. For many applications in this domain, it is
necessary to accurately segment not only hands of the camera wearer but also
the hands of others with whom he is interacting. Here, we take an in-depth look
at the hand segmentation problem. In the quest for robust hand segmentation
methods, we evaluated the performance of the state of the art semantic
segmentation methods, off the shelf and fine-tuned, on existing datasets. We
fine-tune RefineNet, a leading semantic segmentation method, for hand
segmentation and find that it does much better than the best contenders.
Existing hand segmentation datasets are collected in the laboratory settings.
To overcome this limitation, we contribute by collecting two new datasets: a)
EgoYouTubeHands including egocentric videos containing hands in the wild, and
b) HandOverFace to analyze the performance of our models in presence of similar
appearance occlusions. We further explore whether conditional random fields can
help refine generated hand segmentations. To demonstrate the benefit of
accurate hand maps, we train a CNN for hand-based activity recognition and
achieve higher accuracy when a CNN was trained using hand maps produced by the
fine-tuned RefineNet. Finally, we annotate a subset of the EgoHands dataset for
fine-grained action recognition and show that an accuracy of 58.6% can be
achieved by just looking at a single hand pose which is much better than the
chance level (12.5%).Comment: Accepted at CVPR 201
Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching
This paper presents a robotic pick-and-place system that is capable of
grasping and recognizing both known and novel objects in cluttered
environments. The key new feature of the system is that it handles a wide range
of object categories without needing any task-specific training data for novel
objects. To achieve this, it first uses a category-agnostic affordance
prediction algorithm to select and execute among four different grasping
primitive behaviors. It then recognizes picked objects with a cross-domain
image classification framework that matches observed images to product images.
Since product images are readily available for a wide range of objects (e.g.,
from the web), the system works out-of-the-box for novel objects without
requiring any additional training data. Exhaustive experimental results
demonstrate that our multi-affordance grasping achieves high success rates for
a wide variety of objects in clutter, and our recognition algorithm achieves
high accuracy for both known and novel grasped objects. The approach was part
of the MIT-Princeton Team system that took 1st place in the stowing task at the
2017 Amazon Robotics Challenge. All code, datasets, and pre-trained models are
available online at http://arc.cs.princeton.eduComment: Project webpage: http://arc.cs.princeton.edu Summary video:
https://youtu.be/6fG7zwGfIk
DeepNav: Learning to Navigate Large Cities
We present DeepNav, a Convolutional Neural Network (CNN) based algorithm for
navigating large cities using locally visible street-view images. The DeepNav
agent learns to reach its destination quickly by making the correct navigation
decisions at intersections. We collect a large-scale dataset of street-view
images organized in a graph where nodes are connected by roads. This dataset
contains 10 city graphs and more than 1 million street-view images. We propose
3 supervised learning approaches for the navigation task and show how A* search
in the city graph can be used to generate supervision for the learning. Our
annotation process is fully automated using publicly available mapping services
and requires no human input. We evaluate the proposed DeepNav models on 4
held-out cities for navigating to 5 different types of destinations. Our
algorithms outperform previous work that uses hand-crafted features and Support
Vector Regression (SVR)[19].Comment: CVPR 2017 camera ready versio
Multi-Modal Human-Machine Communication for Instructing Robot Grasping Tasks
A major challenge for the realization of intelligent robots is to supply them
with cognitive abilities in order to allow ordinary users to program them
easily and intuitively. One way of such programming is teaching work tasks by
interactive demonstration. To make this effective and convenient for the user,
the machine must be capable to establish a common focus of attention and be
able to use and integrate spoken instructions, visual perceptions, and
non-verbal clues like gestural commands. We report progress in building a
hybrid architecture that combines statistical methods, neural networks, and
finite state machines into an integrated system for instructing grasping tasks
by man-machine interaction. The system combines the GRAVIS-robot for visual
attention and gestural instruction with an intelligent interface for speech
recognition and linguistic interpretation, and an modality fusion module to
allow multi-modal task-oriented man-machine communication with respect to
dextrous robot manipulation of objects.Comment: 7 pages, 8 figure
A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction
Picking up objects requested by a human user is a common task in human-robot
interaction. When multiple objects match the user's verbal description, the
robot needs to clarify which object the user is referring to before executing
the action. Previous research has focused on perceiving user's multimodal
behaviour to complement verbal commands or minimising the number of follow up
questions to reduce task time. In this paper, we propose a system for reference
disambiguation based on visualisation and compare three methods to disambiguate
natural language instructions. In a controlled experiment with a YuMi robot, we
investigated real-time augmentations of the workspace in three conditions --
mixed reality, augmented reality, and a monitor as the baseline -- using
objective measures such as time and accuracy, and subjective measures like
engagement, immersion, and display interference. Significant differences were
found in accuracy and engagement between the conditions, but no differences
were found in task time. Despite the higher error rates in the mixed reality
condition, participants found that modality more engaging than the other two,
but overall showed preference for the augmented reality condition over the
monitor and mixed reality conditions
- …