2,487 research outputs found
Plane-Based Optimization of Geometry and Texture for RGB-D Reconstruction of Indoor Scenes
We present a novel approach to reconstruct RGB-D indoor scene with plane
primitives. Our approach takes as input a RGB-D sequence and a dense coarse
mesh reconstructed by some 3D reconstruction method on the sequence, and
generate a lightweight, low-polygonal mesh with clear face textures and sharp
features without losing geometry details from the original scene. To achieve
this, we firstly partition the input mesh with plane primitives, simplify it
into a lightweight mesh next, then optimize plane parameters, camera poses and
texture colors to maximize the photometric consistency across frames, and
finally optimize mesh geometry to maximize consistency between geometry and
planes. Compared to existing planar reconstruction methods which only cover
large planar regions in the scene, our method builds the entire scene by
adaptive planes without losing geometry details and preserves sharp features in
the final mesh. We demonstrate the effectiveness of our approach by applying it
onto several RGB-D scans and comparing it to other state-of-the-art
reconstruction methods.Comment: in International Conference on 3D Vision 2018; Models and Code: see
https://github.com/chaowang15/plane-opt-rgbd. arXiv admin note: text overlap
with arXiv:1905.0885
Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
Pretrained backbones with fine-tuning have been widely adopted in 2D vision
and natural language processing tasks and demonstrated significant advantages
to task-specific networks. In this paper, we present a pretrained 3D backbone,
named {\SST}, which first outperforms all state-of-the-art methods in
downstream 3D indoor scene understanding tasks. Our backbone network is based
on a 3D Swin transformer and carefully designed to efficiently conduct
self-attention on sparse voxels with linear memory complexity and capture the
irregularity of point signals via generalized contextual relative positional
embedding. Based on this backbone design, we pretrained a large {\SST} model on
a synthetic Structed3D dataset that is 10 times larger than the ScanNet dataset
and fine-tuned the pretrained model in various downstream real-world indoor
scene understanding tasks. The results demonstrate that our model pretrained on
the synthetic dataset not only exhibits good generality in both downstream
segmentation and detection on real 3D point datasets, but also surpasses the
state-of-the-art methods on downstream tasks after fine-tuning with +2.3 mIoU
and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +2.1 mIoU on
ScanNet segmentation (val), +1.9 [email protected] on ScanNet detection, +8.1 [email protected] on
S3DIS detection. Our method demonstrates the great potential of pretrained 3D
backbones with fine-tuning for 3D understanding tasks. The code and models are
available at https://github.com/microsoft/Swin3D
Object as Query: Lifting any 2D Object Detector to 3D Detection
3D object detection from multi-view images has drawn much attention over the
past few years. Existing methods mainly establish 3D representations from
multi-view images and adopt a dense detection head for object detection, or
employ object queries distributed in 3D space to localize objects. In this
paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which
can lift any 2D object detector to multi-view 3D object detection. Since 2D
detections can provide valuable priors for object existence, MV2D exploits 2D
detectors to generate object queries conditioned on the rich image semantics.
These dynamically generated queries help MV2D to recall objects in the field of
view and show a strong capability of localizing 3D objects. For the generated
queries, we design a sparse cross attention module to force them to focus on
the features of specific objects, which suppresses interference from noises.
The evaluation results on the nuScenes dataset demonstrate the dynamic object
queries and sparse feature aggregation can promote 3D detection capability.
MV2D also exhibits a state-of-the-art performance among existing methods. We
hope MV2D can serve as a new baseline for future research.Comment: technical repor
Occlusion reasoning for multiple object visual tracking
Thesis (Ph.D.)--Boston UniversityOcclusion reasoning for visual object tracking in uncontrolled environments is a challenging problem. It becomes significantly more difficult when dense groups of indistinguishable objects are present in the scene that cause frequent inter-object interactions and occlusions. We present several practical solutions that tackle the inter-object occlusions for video surveillance applications.
In particular, this thesis proposes three methods. First, we propose "reconstruction-tracking," an online multi-camera spatial-temporal data association method for tracking large groups of objects imaged with low resolution. As a variant of the well-known Multiple-Hypothesis-Tracker, our approach localizes the positions of objects in 3D space with possibly occluded observations from multiple camera views and performs temporal data association in 3D. Second, we develop "track linking," a class of offline batch processing algorithms for long-term occlusions, where the decision has to be made based on the observations from the entire tracking sequence. We construct a graph representation to characterize occlusion events and propose an efficient graph-based/combinatorial algorithm to resolve occlusions.
Third, we propose a novel Bayesian framework where detection and data association are combined into a single module and solved jointly. Almost all traditional tracking systems address the detection and data association tasks separately in sequential order. Such a design implies that the output of the detector has to be reliable in order to make the data association work. Our framework takes advantage of the often complementary nature of the two subproblems, which not only avoids the error propagation issue from which traditional "detection-tracking approaches" suffer but also eschews common heuristics such as "nonmaximum suppression" of hypotheses by modeling the likelihood of the entire image.
The thesis describes a substantial number of experiments, involving challenging, notably distinct simulated and real data, including infrared and visible-light data sets recorded ourselves or taken from data sets publicly available. In these videos, the number of objects ranges from a dozen to a hundred per frame in both monocular and multiple views. The experiments demonstrate that our approaches achieve results comparable to those of state-of-the-art approaches
Unsupervised Learning of Edges
Data-driven approaches for edge detection have proven effective and achieve
top results on modern benchmarks. However, all current data-driven edge
detectors require manual supervision for training in the form of hand-labeled
region segments or object boundaries. Specifically, human annotators mark
semantically meaningful edges which are subsequently used for training. Is this
form of strong, high-level supervision actually necessary to learn to
accurately detect edges? In this work we present a simple yet effective
approach for training edge detectors without human supervision. To this end we
utilize motion, and more specifically, the only input to our method is noisy
semi-dense matches between frames. We begin with only a rudimentary knowledge
of edges (in the form of image gradients), and alternate between improving
motion estimation and edge detection in turn. Using a large corpus of video
data, we show that edge detectors trained using our unsupervised scheme
approach the performance of the same methods trained with full supervision
(within 3-5%). Finally, we show that when using a deep network for the edge
detector, our approach provides a novel pre-training scheme for object
detection.Comment: Camera ready version for CVPR 201
- …