346 research outputs found
Dense Piecewise Planar RGB-D SLAM for Indoor Environments
The paper exploits weak Manhattan constraints to parse the structure of
indoor environments from RGB-D video sequences in an online setting. We extend
the previous approach for single view parsing of indoor scenes to video
sequences and formulate the problem of recovering the floor plan of the
environment as an optimal labeling problem solved using dynamic programming.
The temporal continuity is enforced in a recursive setting, where labeling from
previous frames is used as a prior term in the objective function. In addition
to recovery of piecewise planar weak Manhattan structure of the extended
environment, the orthogonality constraints are also exploited by visual
odometry and pose graph optimization. This yields reliable estimates in the
presence of large motions and absence of distinctive features to track. We
evaluate our method on several challenging indoors sequences demonstrating
accurate SLAM and dense mapping of low texture environments. On existing TUM
benchmark we achieve competitive results with the alternative approaches which
fail in our environments.Comment: International Conference on Intelligent Robots and Systems (IROS)
201
Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras
Visual scene understanding is an important capability that enables robots to
purposefully act in their environment. In this paper, we propose a novel
approach to object-class segmentation from multiple RGB-D views using deep
learning. We train a deep neural network to predict object-class semantics that
is consistent from several view points in a semi-supervised way. At test time,
the semantics predictions of our network can be fused more consistently in
semantic keyframe maps than predictions of a network trained on individual
views. We base our network architecture on a recent single-view deep learning
approach to RGB and depth fusion for semantic object-class segmentation and
enhance it with multi-scale loss minimization. We obtain the camera trajectory
using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth
annotated frames in order to enforce multi-view consistency during training. At
test time, predictions from multiple views are fused into keyframes. We propose
and analyze several methods for enforcing multi-view consistency during
training and testing. We evaluate the benefit of multi-view consistency
training and demonstrate that pooling of deep features and fusion over multiple
views outperforms single-view baselines on the NYUDv2 benchmark for semantic
segmentation. Our end-to-end trained network achieves state-of-the-art
performance on the NYUDv2 dataset in single-view segmentation as well as
multi-view semantic fusion.Comment: the 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2017
Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation
In this paper, we address panoramic semantic segmentation which is
under-explored due to two critical challenges: (1) image distortions and object
deformations on panoramas; (2) lack of semantic annotations in the 360-degree
imagery. To tackle these problems, first, we propose the upgraded Transformer
for Panoramic Semantic Segmentation, i.e., Trans4PASS+, equipped with
Deformable Patch Embedding (DPE) and Deformable MLP (DMLPv2) modules for
handling object deformations and image distortions whenever (before or after
adaptation) and wherever (shallow or deep levels). Second, we enhance the
Mutual Prototypical Adaptation (MPA) strategy via pseudo-label rectification
for unsupervised domain adaptive panoramic segmentation. Third, aside from
Pinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS)
with 9,080 panoramic images, facilitating Synthetic-to-Real (Syn2Real)
adaptation scheme in 360-degree imagery. Extensive experiments are conducted,
which cover indoor and outdoor scenarios, and each of them is investigated with
Pin2Pan and Syn2Real regimens. Trans4PASS+ achieves state-of-the-art
performances on four domain adaptive panoramic semantic segmentation
benchmarks. Code is available at https://github.com/jamycheung/Trans4PASS.Comment: Extended version of CVPR 2022 paper arXiv:2203.01452. Code is
available at https://github.com/jamycheung/Trans4PAS
Synthesizing Training Data for Object Detection in Indoor Scenes
Detection of objects in cluttered indoor environments is one of the key
enabling functionalities for service robots. The best performing object
detection approaches in computer vision exploit deep Convolutional Neural
Networks (CNN) to simultaneously detect and categorize the objects of interest
in cluttered scenes. Training of such models typically requires large amounts
of annotated training data which is time consuming and costly to obtain. In
this work we explore the ability of using synthetically generated composite
images for training state-of-the-art object detectors, especially for object
instance detection. We superimpose 2D images of textured object models into
images of real environments at variety of locations and scales. Our experiments
evaluate different superimposition strategies ranging from purely image-based
blending all the way to depth and semantics informed positioning of the object
models into real scenes. We demonstrate the effectiveness of these object
detector training strategies on two publicly available datasets, the
GMU-Kitchens and the Washington RGB-D Scenes v2. As one observation, augmenting
some hand-labeled training data with synthetic examples carefully composed onto
scenes yields object detectors with comparable performance to using much more
hand-labeled data. Broadly, this work charts new opportunities for training
detectors for new objects by exploiting existing object model repositories in
either a purely automatic fashion or with only a very small number of
human-annotated examples.Comment: Added more experiments and link to project webpag
- …