32,428 research outputs found
KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real Transfer for Robotics Manipulation
We present KOVIS, a novel learning-based, calibration-free visual servoing
method for fine robotic manipulation tasks with eye-in-hand stereo camera
system. We train the deep neural network only in the simulated environment; and
the trained model could be directly used for real-world visual servoing tasks.
KOVIS consists of two networks. The first keypoint network learns the keypoint
representation from the image using with an autoencoder. Then the visual
servoing network learns the motion based on keypoints extracted from the camera
image. The two networks are trained end-to-end in the simulated environment by
self-supervised learning without manual data labeling. After training with data
augmentation, domain randomization, and adversarial examples, we are able to
achieve zero-shot sim-to-real transfer to real-world robotic manipulation
tasks. We demonstrate the effectiveness of the proposed method in both
simulated environment and real-world experiment with different robotic
manipulation tasks, including grasping, peg-in-hole insertion with 4mm
clearance, and M13 screw insertion. The demo video is available at
http://youtu.be/gfBJBR2tDzAComment: Accepted by IROS 202
Manifestation of Image Contrast in Deep Networks
Contrast is subject to dramatic changes across the visual field, depending on
the source of light and scene configurations. Hence, the human visual system
has evolved to be more sensitive to contrast than absolute luminance. This
feature is equally desired for machine vision: the ability to recognise
patterns even when aspects of them are transformed due to variation in local
and global contrast. In this work, we thoroughly investigate the impact of
image contrast on prominent deep convolutional networks, both during the
training and testing phase. The results of conducted experiments testify to an
evident deterioration in the accuracy of all state-of-the-art networks at
low-contrast images. We demonstrate that "contrast-augmentation" is a
sufficient condition to endow a network with invariance to contrast. This
practice shows no negative side effects, quite the contrary, it might allow a
model to refrain from other illuminance related over-fittings. This ability can
also be achieved by a short fine-tuning procedure, which opens new lines of
investigation on mechanisms involved in two networks whose weights are over
99.9% correlated, yet astonishingly produce utterly different outcomes. Our
further analysis suggests that the optimisation algorithm is an influential
factor, however with a significantly lower effect; and while the choice of an
architecture manifests a negligible impact on this phenomenon, the first layers
appear to be more critical
A Causal View on Robustness of Neural Networks
We present a causal view on the robustness of neural networks against input
manipulations, which applies not only to traditional classification tasks but
also to general measurement data. Based on this view, we design a deep causal
manipulation augmented model (deep CAMA) which explicitly models possible
manipulations on certain causes leading to changes in the observed effect. We
further develop data augmentation and test-time fine-tuning methods to improve
deep CAMA's robustness. When compared with discriminative deep neural networks,
our proposed model shows superior robustness against unseen manipulations. As a
by-product, our model achieves disentangled representation which separates the
representation of manipulations from those of other latent causes.Comment: NeurIPS 202
Fine-Tuning VGG Neural Network For Fine-grained State Recognition of Food Images
State recognition of food images can be considered as one of the promising
applications of object recognition and fine-grained image classification in
computer vision. In this paper, evidence is provided for the power of
convolutional neural network (CNN) for food state recognition, even with a
small data set. In this study, we fine-tuned a CNN initially trained on a large
natural image recognition dataset (Imagenet ILSVRC) and transferred the learned
feature representations to the food state recognition task. A small-scale
dataset consisting of 5978 images of seven categories was constructed and
annotated manually. Data augmentation was applied to increase the size of the
data.Comment: 5 pages, 7 figure
RGB-D Object Detection and Semantic Segmentation for Autonomous Manipulation in Clutter
Autonomous robotic manipulation in clutter is challenging. A large variety of
objects must be perceived in complex scenes, where they are partially occluded
and embedded among many distractors, often in restricted spaces. To tackle
these challenges, we developed a deep-learning approach that combines object
detection and semantic segmentation. The manipulation scenes are captured with
RGB-D cameras, for which we developed a depth fusion method. Employing
pretrained features makes learning from small annotated robotic data sets
possible. We evaluate our approach on two challenging data sets: one captured
for the Amazon Picking Challenge 2016, where our team NimbRo came in second in
the Stowing and third in the Picking task, and one captured in
disaster-response scenarios. The experiments show that object detection and
semantic segmentation complement each other and can be combined to yield
reliable object perception
Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories
A robot operating in a real-world environment needs to perform reasoning over
a variety of sensor modalities such as vision, language and motion
trajectories. However, it is extremely challenging to manually design features
relating such disparate modalities. In this work, we introduce an algorithm
that learns to embed point-cloud, natural language, and manipulation trajectory
data into a shared embedding space with a deep neural network. To learn
semantically meaningful spaces throughout our network, we use a loss-based
margin to bring embeddings of relevant pairs closer together while driving
less-relevant cases from different modalities further apart. We use this both
to pre-train its lower layers and fine-tune our final embedding space, leading
to a more robust representation. We test our algorithm on the task of
manipulating novel objects and appliances based on prior experience with other
objects. On a large dataset, we achieve significant improvements in both
accuracy and inference time over the previous state of the art. We also perform
end-to-end experiments on a PR2 robot utilizing our learned embedding space.Comment: IEEE International Conference on Robotics and Automation (ICRA), 201
Bayesian Grasp: Robotic visual stable grasp based on prior tactile knowledge
Robotic grasp detection is a fundamental capability for intelligent
manipulation in unstructured environments. Previous work mainly employed visual
and tactile fusion to achieve stable grasp, while, the whole process depending
heavily on regrasping, which wastes much time to regulate and evaluate. We
propose a novel way to improve robotic grasping: by using learned tactile
knowledge, a robot can achieve a stable grasp from an image. First, we
construct a prior tactile knowledge learning framework with novel grasp quality
metric which is determined by measuring its resistance to external
perturbations. Second, we propose a multi-phases Bayesian Grasp architecture to
generate stable grasp configurations through a single RGB image based on prior
tactile knowledge. Results show that this framework can classify the outcome of
grasps with an average accuracy of 86% on known objects and 79% on novel
objects. The prior tactile knowledge improves the successful rate of 55% over
traditional vision-based strategies.Comment: submitted to ICRA202
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Generating musical audio directly with neural networks is notoriously
difficult because it requires coherently modeling structure at many different
timescales. Fortunately, most music is also highly structured and can be
represented as discrete note events played on musical instruments. Herein, we
show that by using notes as an intermediate representation, we can train a
suite of models capable of transcribing, composing, and synthesizing audio
waveforms with coherent musical structure on timescales spanning six orders of
magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large
advance in the state of the art is enabled by our release of the new MAESTRO
(MIDI and Audio Edited for Synchronous TRacks and Organization) dataset,
composed of over 172 hours of virtuosic piano performances captured with fine
alignment (~3 ms) between note labels and audio waveforms. The networks and the
dataset together present a promising approach toward creating new expressive
and interpretable neural models of music.Comment: Examples available at https://goo.gl/magenta/maestro-example
Robobarista: Learning to Manipulate Novel Objects via Deep Multimodal Embedding
There is a large variety of objects and appliances in human environments,
such as stoves, coffee dispensers, juice extractors, and so on. It is
challenging for a roboticist to program a robot for each of these object types
and for each of their instantiations. In this work, we present a novel approach
to manipulation planning based on the idea that many household objects share
similarly-operated object parts. We formulate the manipulation planning as a
structured prediction problem and learn to transfer manipulation strategy
across different objects by embedding point-cloud, natural language, and
manipulation trajectory data into a shared embedding space using a deep neural
network. In order to learn semantically meaningful spaces throughout our
network, we introduce a method for pre-training its lower layers for multimodal
feature embedding and a method for fine-tuning this embedding space using a
loss-based margin. In order to collect a large number of manipulation
demonstrations for different objects, we develop a new crowd-sourcing platform
called Robobarista. We test our model on our dataset consisting of 116 objects
and appliances with 249 parts along with 250 language instructions, for which
there are 1225 crowd-sourced manipulation demonstrations. We further show that
our robot with our model can even prepare a cup of a latte with appliances it
has never seen before.Comment: Journal Versio
Decompose to manipulate: Manipulable Object Synthesis in 3D Medical Images with Structured Image Decomposition
The performance of medical image analysis systems is constrained by the
quantity of high-quality image annotations. Such systems require data to be
annotated by experts with years of training, especially when diagnostic
decisions are involved. Such datasets are thus hard to scale up. In this
context, it is hard for supervised learning systems to generalize to the cases
that are rare in the training set but would be present in real-world clinical
practices. We believe that the synthetic image samples generated by a system
trained on the real data can be useful for improving the supervised learning
tasks in the medical image analysis applications. Allowing the image synthesis
to be manipulable could help synthetic images provide complementary information
to the training data rather than simply duplicating the real-data manifold. In
this paper, we propose a framework for synthesizing 3D objects, such as
pulmonary nodules, in 3D medical images with manipulable properties. The
manipulation is enabled by decomposing of the object of interests into its
segmentation mask and a 1D vector containing the residual information. The
synthetic object is refined and blended into the image context with two
adversarial discriminators. We evaluate the proposed framework on lung nodules
in 3D chest CT images and show that the proposed framework could generate
realistic nodules with manipulable shapes, textures and locations, etc. By
sampling from both the synthetic nodules and the real nodules from 2800 3D CT
volumes during the classifier training, we show the synthetic patches could
improve the overall nodule detection performance by average 8.44% competition
performance metric (CPM) score
- …