32 research outputs found
Pattern-Affinitive Propagation across Depth, Surface Normal and Semantic Segmentation
In this paper, we propose a novel Pattern-Affinitive Propagation (PAP)
framework to jointly predict depth, surface normal and semantic segmentation.
The motivation behind it comes from the statistic observation that
pattern-affinitive pairs recur much frequently across different tasks as well
as within a task. Thus, we can conduct two types of propagations, cross-task
propagation and task-specific propagation, to adaptively diffuse those similar
patterns. The former integrates cross-task affinity patterns to adapt to each
task therein through the calculation on non-local relationships. Next the
latter performs an iterative diffusion in the feature space so that the
cross-task affinity patterns can be widely-spread within the task. Accordingly,
the learning of each task can be regularized and boosted by the complementary
task-level affinities. Extensive experiments demonstrate the effectiveness and
the superiority of our method on the joint three tasks. Meanwhile, we achieve
the state-of-the-art or competitive results on the three related datasets,
NYUD-v2, SUN-RGBD and KITTI.Comment: 10 pages, 9 figures, CVPR 201
X-PDNet: Accurate Joint Plane Instance Segmentation and Monocular Depth Estimation with Cross-Task Distillation and Boundary Correction
Segmentation of planar regions from a single RGB image is a particularly
important task in the perception of complex scenes. To utilize both visual and
geometric properties in images, recent approaches often formulate the problem
as a joint estimation of planar instances and dense depth through feature
fusion mechanisms and geometric constraint losses. Despite promising results,
these methods do not consider cross-task feature distillation and perform
poorly in boundary regions. To overcome these limitations, we propose X-PDNet,
a framework for the multitask learning of plane instance segmentation and depth
estimation with improvements in the following two aspects. Firstly, we
construct the cross-task distillation design which promotes early information
sharing between dual-tasks for specific task improvements. Secondly, we
highlight the current limitations of using the ground truth boundary to develop
boundary regression loss, and propose a novel method that exploits depth
information to support precise boundary region segmentation. Finally, we
manually annotate more than 3000 images from Stanford 2D-3D-Semantics dataset
and make available for evaluation of plane instance segmentation. Through the
experiments, our proposed methods prove the advantages, outperforming the
baseline with large improvement margins in the quantitative results on the
ScanNet and the Stanford 2D-3D-S dataset, demonstrating the effectiveness of
our proposals.Comment: Accepted to BMVC 202
Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation
Dense depth estimation is essential to scene-understanding for autonomous
driving. However, recent self-supervised approaches on monocular videos suffer
from scale-inconsistency across long sequences. Utilizing data from the
ubiquitously copresent global positioning systems (GPS), we tackle this
challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to
complement the appearance-based losses. We emphasize that the GPS is needed
only during the multimodal training, and not at inference. The relative
distance between frames captured through the GPS provides a scale signal that
is independent of the camera setup and scene distribution, resulting in richer
learned feature representations. Through extensive evaluation on multiple
datasets, we demonstrate scale-consistent and -aware depth estimation during
inference, improving the performance even when training with low-frequency GPS
data.Comment: Accepted at 2021 IEEE International Conference on Robotics and
Automation (ICRA
RigNet: Repetitive Image Guided Network for Depth Completion
Depth completion deals with the problem of recovering dense depth maps from
sparse ones, where color images are often used to facilitate this task. Recent
approaches mainly focus on image guided learning frameworks to predict dense
depth. However, blurry guidance in the image and unclear structure in the depth
still impede the performance of the image guided frameworks. To tackle these
problems, we explore a repetitive design in our image guided network to
gradually and sufficiently recover depth values. Specifically, the repetition
is embodied in both the image guidance branch and depth generation branch. In
the former branch, we design a repetitive hourglass network to extract
discriminative image features of complex environments, which can provide
powerful contextual instruction for depth prediction. In the latter branch, we
introduce a repetitive guidance module based on dynamic convolution, in which
an efficient convolution factorization is proposed to simultaneously reduce its
complexity and progressively model high-frequency structures. Extensive
experiments show that our method achieves superior or competitive results on
KITTI benchmark and NYUv2 dataset.Comment: Accepted by ECCV202
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection
Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a
feasible solution for economical autonomous driving. However, the existing
BEV-based multi-view 3D detectors generally transform all image features into
BEV features, without considering the problem that the large proportion of
background information may submerge the object information. In this paper, we
propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out
background information according to the semantic segmentation of image features
and transform image features into semantic-aware BEV features. Accordingly, we
propose BEV-Paste, an effective data augmentation strategy that closely matches
with semantic-aware BEV feature. In addition, we design a Multi-Scale
Cross-Task (MSCT) head, which combines task-specific and cross-task information
to predict depth distribution and semantic segmentation more accurately,
further improving the quality of semantic-aware BEV feature. Finally, we
integrate the above modules into a novel multi-view 3D object detection
framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves
state-of-the-art performance. Code has been available at
https://github.com/mengtan00/SA-BEV.git
Egocentric Scene Understanding via Multimodal Spatial Rectifier
In this paper, we study a problem of egocentric scene understanding, i.e.,
predicting depths and surface normals from an egocentric image. Egocentric
scene understanding poses unprecedented challenges: (1) due to large head
movements, the images are taken from non-canonical viewpoints (i.e., tilted
images) where existing models of geometry prediction do not apply; (2) dynamic
foreground objects including hands constitute a large proportion of visual
scenes. These challenges limit the performance of the existing models learned
from large indoor datasets, such as ScanNet and NYUv2, which comprise
predominantly upright images of static scenes. We present a multimodal spatial
rectifier that stabilizes the egocentric images to a set of reference
directions, which allows learning a coherent visual representation. Unlike
unimodal spatial rectifier that often produces excessive perspective warp for
egocentric images, the multimodal spatial rectifier learns from multiple
directions that can minimize the impact of the perspective warp. To learn
visual representations of the dynamic foreground objects, we present a new
dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that
comprises more than 500K synchronized RGBD frames and gravity directions.
Equipped with the multimodal spatial rectifier and the EDINA dataset, our
proposed method on single-view depth and surface normal estimation
significantly outperforms the baselines not only on our EDINA dataset, but also
on other popular egocentric datasets, such as First Person Hand Action (FPHA)
and EPIC-KITCHENS.Comment: Appearing in the Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 202