533 research outputs found
Dense Piecewise Planar RGB-D SLAM for Indoor Environments
The paper exploits weak Manhattan constraints to parse the structure of
indoor environments from RGB-D video sequences in an online setting. We extend
the previous approach for single view parsing of indoor scenes to video
sequences and formulate the problem of recovering the floor plan of the
environment as an optimal labeling problem solved using dynamic programming.
The temporal continuity is enforced in a recursive setting, where labeling from
previous frames is used as a prior term in the objective function. In addition
to recovery of piecewise planar weak Manhattan structure of the extended
environment, the orthogonality constraints are also exploited by visual
odometry and pose graph optimization. This yields reliable estimates in the
presence of large motions and absence of distinctive features to track. We
evaluate our method on several challenging indoors sequences demonstrating
accurate SLAM and dense mapping of low texture environments. On existing TUM
benchmark we achieve competitive results with the alternative approaches which
fail in our environments.Comment: International Conference on Intelligent Robots and Systems (IROS)
201
Recurrent Scene Parsing with Perspective Understanding in the Loop
Objects may appear at arbitrary scales in perspective images of a scene,
posing a challenge for recognition systems that process images at a fixed
resolution. We propose a depth-aware gating module that adaptively selects the
pooling field size in a convolutional network architecture according to the
object scale (inversely proportional to the depth) so that small details are
preserved for distant objects while larger receptive fields are used for those
nearby. The depth gating signal is provided by stereo disparity or estimated
directly from monocular input. We integrate this depth-aware gating into a
recurrent convolutional neural network to perform semantic segmentation. Our
recurrent module iteratively refines the segmentation results, leveraging the
depth and semantic predictions from the previous iterations.
Through extensive experiments on four popular large-scale RGB-D datasets, we
demonstrate this approach achieves competitive semantic segmentation
performance with a model which is substantially more compact. We carry out
extensive analysis of this architecture including variants that operate on
monocular RGB but use depth as side-information during training, unsupervised
gating as a generic attentional mechanism, and multi-resolution gating. We find
that gated pooling for joint semantic segmentation and depth yields
state-of-the-art results for quantitative monocular depth estimation
On the Calibration of Active Binocular and RGBD Vision Systems for Dual-Arm Robots
This paper describes a camera and hand-eye
calibration methodology for integrating an active binocular
robot head within a dual-arm robot. For this purpose, we
derive the forward kinematic model of our active robot head
and describe our methodology for calibrating and integrating
our robot head. This rigid calibration provides a closedform
hand-to-eye solution. We then present an approach for
updating dynamically camera external parameters for optimal
3D reconstruction that are the foundation for robotic tasks such
as grasping and manipulating rigid and deformable objects. We
show from experimental results that our robot head achieves
an overall sub millimetre accuracy of less than 0.3 millimetres
while recovering the 3D structure of a scene. In addition, we
report a comparative study between current RGBD cameras
and our active stereo head within two dual-arm robotic testbeds
that demonstrates the accuracy and portability of our proposed
methodology
2D+3D Indoor Scene Understanding from a Single Monocular Image
Scene understanding, as a broad field encompassing many
subtopics, has gained great interest in recent years. Among these
subtopics, indoor scene understanding, having its own specific
attributes and challenges compared to outdoor scene under-
standing, has drawn a lot of attention. It has potential
applications in a wide variety of domains, such as robotic
navigation, object grasping for personal robotics, augmented
reality, etc. To our knowledge, existing research for indoor
scenes typically makes use of depth sensors, such as Kinect, that
is however not always available.
In this thesis, we focused on addressing the indoor scene
understanding tasks in a general case, where only a monocular
color image of the scene is available. Specifically, we first
studied the problem of estimating a detailed depth map from a
monocular image. Then, benefiting from deep-learning-based depth
estimation, we tackled the higher-level tasks of 3D box proposal
generation, and scene parsing with instance segmentation,
semantic labeling and support relationship inference from a
monocular image. Our research on indoor scene understanding
provides a comprehensive scene interpretation at various
perspectives and scales.
For monocular image depth estimation, previous approaches are
limited in that they only reason about depth locally on a single
scale, and do not utilize the important information of geometric
scene structures. Here, we developed a novel graphical model,
which reasons about detailed depth while leveraging geometric
scene structures at multiple scales.
For 3D box proposals, to our best knowledge, our approach
constitutes the first attempt to reason about class-independent
3D box proposals from a single monocular image. To this end, we
developed a novel integrated, differentiable framework that
estimates depth, extracts a volumetric scene representation and
generates 3D proposals. At the core of this framework lies a
novel residual, differentiable truncated signed distance function
module, which is able to handle the relatively low accuracy of
the predicted depth map.
For scene parsing, we tackled its three subtasks of instance
segmentation, se- mantic labeling, and the support relationship
inference on instances. Existing work typically reasons about
these individual subtasks independently. Here, we leverage the
fact that they bear strong connections, which can facilitate
addressing these sub- tasks if modeled properly. To this end, we
developed an integrated graphical model that reasons about the
mutual relationships of the above subtasks.
In summary, in this thesis, we introduced novel and effective
methodologies for each of three indoor scene understanding tasks,
i.e., depth estimation, 3D box proposal generation, and scene
parsing, and exploited the dependencies on depth estimates of the
latter two tasks. Evaluation on several benchmark datasets
demonstrated the effectiveness of our algorithms and the benefits
of utilizing depth estimates for higher-level tasks
Segmentation and semantic labelling of RGBD data with convolutional neural networks and surface fitting
We present an approach for segmentation and semantic labelling of RGBD data exploiting together geometrical cues and deep learning techniques. An initial over-segmentation is performed using spectral clustering and a set of non-uniform rational B-spline surfaces is fitted on the extracted segments. Then a convolutional neural network (CNN) receives in input colour and geometry data together with surface fitting parameters. The network is made of nine convolutional stages followed by a softmax classifier and produces a vector of descriptors for each sample. In the next step, an iterative merging algorithm recombines the output of the over-segmentation into larger regions matching the various elements of the scene. The couples of adjacent segments with higher similarity according to the CNN features are candidate to be merged and the surface fitting accuracy is used to detect which couples of segments belong to the same surface. Finally, a set of labelled segments is obtained by combining the segmentation output with the descriptors from the CNN. Experimental results show how the proposed approach outperforms state-of-the-art methods and provides an accurate segmentation and labelling
- …