42,725 research outputs found
PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation
The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging
vision problem that aims to predict panoptic segmentation and depth in a video
simultaneously. The previous work solves this task by extending the existing
panoptic segmentation method with an extra dense depth prediction and instance
tracking head. However, the relationship between the depth and panoptic
segmentation is not well explored -- simply combining existing methods leads to
competition and needs carefully weight balancing. In this paper, we present
PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS
task and lead to more robust results. Our principal insight is that the depth
can be harmonized with the panoptic segmentation with our proposed new paradigm
of predicting instance level depth maps with object queries. Then the
relationship between the two tasks via query-based learning is explored. From
the experiments, we demonstrate the benefits of our design from both depth
estimation and panoptic segmentation aspects. Since each thing query also
encodes the instance-wise information, it is natural to perform tracking
directly with appearance learning. Our method achieves state-of-the-art results
on two DVPS datasets (Semantic KITTI, Cityscapes), and ranks 1st on the
ICCV-2021 BMTT Challenge video + depth track. Code is available at
https://github.com/HarborYuan/PolyphonicFormer .Comment: Accepted by ECCV 202
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
We present an end-to-end joint training framework that explicitly models
6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular
camera setup without supervision. Our technical contributions are three-fold.
First, we highlight the fundamental difference between inverse and forward
projection while modeling the individual motion of each rigid object, and
propose a geometrically correct projection pipeline using a neural forward
projection module. Second, we design a unified instance-aware photometric and
geometric consistency loss that holistically imposes self-supervisory signals
for every background and object region. Lastly, we introduce a general-purpose
auto-annotation scheme using any off-the-shelf instance segmentation and
optical flow models to produce video instance segmentation maps that will be
utilized as input to our training pipeline. These proposed elements are
validated in a detailed ablation study. Through extensive experiments conducted
on the KITTI and Cityscapes dataset, our framework is shown to outperform the
state-of-the-art depth and motion estimation methods. Our code, dataset, and
models are available at https://github.com/SeokjuLee/Insta-DM .Comment: Accepted to AAAI 2021. Code/dataset/models are available at
https://github.com/SeokjuLee/Insta-DM. arXiv admin note: substantial text
overlap with arXiv:1912.0935
iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects
We address the task of 6D pose estimation of known rigid objects from single
input images in scenarios where the objects are partly occluded. Recent
RGB-D-based methods are robust to moderate degrees of occlusion. For RGB
inputs, no previous method works well for partly occluded objects. Our main
contribution is to present the first deep learning-based system that estimates
accurate poses for partly occluded objects from RGB-D and RGB input. We achieve
this with a new instance-aware pipeline that decomposes 6D object pose
estimation into a sequence of simpler steps, where each step removes specific
aspects of the problem. The first step localizes all known objects in the image
using an instance segmentation network, and hence eliminates surrounding
clutter and occluders. The second step densely maps pixels to 3D object surface
positions, so called object coordinates, using an encoder-decoder network, and
hence eliminates object appearance. The third, and final, step predicts the 6D
pose using geometric optimization. We demonstrate that we significantly
outperform the state-of-the-art for pose estimation of partly occluded objects
for both RGB and RGB-D input
Robust Dense Mapping for Large-Scale Dynamic Environments
We present a stereo-based dense mapping algorithm for large-scale dynamic
urban environments. In contrast to other existing methods, we simultaneously
reconstruct the static background, the moving objects, and the potentially
moving but currently stationary objects separately, which is desirable for
high-level mobile robotic tasks such as path planning in crowded environments.
We use both instance-aware semantic segmentation and sparse scene flow to
classify objects as either background, moving, or potentially moving, thereby
ensuring that the system is able to model objects with the potential to
transition from static to dynamic, such as parked cars. Given camera poses
estimated from visual odometry, both the background and the (potentially)
moving objects are reconstructed separately by fusing the depth maps computed
from the stereo input. In addition to visual odometry, sparse scene flow is
also used to estimate the 3D motions of the detected moving objects, in order
to reconstruct them accurately. A map pruning technique is further developed to
improve reconstruction accuracy and reduce memory consumption, leading to
increased scalability. We evaluate our system thoroughly on the well-known
KITTI dataset. Our system is capable of running on a PC at approximately 2.5Hz,
with the primary bottleneck being the instance-aware semantic segmentation,
which is a limitation we hope to address in future work. The source code is
available from the project website (http://andreibarsan.github.io/dynslam).Comment: Presented at IEEE International Conference on Robotics and Automation
(ICRA), 201
- …