667 research outputs found
FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation
One of the most popular approaches to multi-target tracking is
tracking-by-detection. Current min-cost flow algorithms which solve the data
association problem optimally have three main drawbacks: they are
computationally expensive, they assume that the whole video is given as a
batch, and they scale badly in memory and computation with the length of the
video sequence. In this paper, we address each of these issues, resulting in a
computationally and memory-bounded solution. First, we introduce a dynamic
version of the successive shortest-path algorithm which solves the data
association problem optimally while reusing computation, resulting in
significantly faster inference than standard solvers. Second, we address the
optimal solution to the data association problem when dealing with an incoming
stream of data (i.e., online setting). Finally, we present our main
contribution which is an approximate online solution with bounded memory and
computation which is capable of handling videos of arbitrarily length while
performing tracking in real time. We demonstrate the effectiveness of our
algorithms on the KITTI and PETS2009 benchmarks and show state-of-the-art
performance, while being significantly faster than existing solvers
Natural Virtual Reality User Interface to Define Assembly Sequences for Digital Human Models
Digital human models (DHMs) are virtual representations of human beings. They are used to conduct, among other things, ergonomic assessments in factory layout planning. DHM software tools are challenging in their use and thus require a high amount of training for engineers. In this paper, we present a virtual reality (VR) application that enables engineers to work with DHMs easily. Since VR systems with head-mounted displays (HMDs) are less expensive than CAVE systems, HMDs can be integrated more extensively into the product development process. Our application provides a reality-based interface and allows users to conduct an assembly task in VR and thus to manipulate the virtual scene with their real hands. These manipulations are used as input for the DHM to simulate, on that basis, human ergonomics. Therefore, we introduce a software and hardware architecture, the VATS (virtual action tracking system). This paper furthermore presents the results of a user study in which the VATS was compared to the existing WIMP (Windows, Icons, Menus and Pointer) interface. The results show that the VATS system enables users to conduct tasks in a significantly faster way
Semantic Visual Localization
Robust visual localization under a wide range of viewing conditions is a
fundamental problem in computer vision. Handling the difficult cases of this
problem is not only very challenging but also of high practical relevance,
e.g., in the context of life-long localization for augmented reality or
autonomous robots. In this paper, we propose a novel approach based on a joint
3D geometric and semantic understanding of the world, enabling it to succeed
under conditions where previous approaches failed. Our method leverages a novel
generative model for descriptor learning, trained on semantic scene completion
as an auxiliary task. The resulting 3D descriptors are robust to missing
observations by encoding high-level 3D geometric and semantic information.
Experiments on several challenging large-scale localization datasets
demonstrate reliable localization under extreme viewpoint, illumination, and
geometry changes
Robust Dense Mapping for Large-Scale Dynamic Environments
We present a stereo-based dense mapping algorithm for large-scale dynamic
urban environments. In contrast to other existing methods, we simultaneously
reconstruct the static background, the moving objects, and the potentially
moving but currently stationary objects separately, which is desirable for
high-level mobile robotic tasks such as path planning in crowded environments.
We use both instance-aware semantic segmentation and sparse scene flow to
classify objects as either background, moving, or potentially moving, thereby
ensuring that the system is able to model objects with the potential to
transition from static to dynamic, such as parked cars. Given camera poses
estimated from visual odometry, both the background and the (potentially)
moving objects are reconstructed separately by fusing the depth maps computed
from the stereo input. In addition to visual odometry, sparse scene flow is
also used to estimate the 3D motions of the detected moving objects, in order
to reconstruct them accurately. A map pruning technique is further developed to
improve reconstruction accuracy and reduce memory consumption, leading to
increased scalability. We evaluate our system thoroughly on the well-known
KITTI dataset. Our system is capable of running on a PC at approximately 2.5Hz,
with the primary bottleneck being the instance-aware semantic segmentation,
which is a limitation we hope to address in future work. The source code is
available from the project website (http://andreibarsan.github.io/dynslam).Comment: Presented at IEEE International Conference on Robotics and Automation
(ICRA), 201
OctNetFusion: Learning Depth Fusion from Data
In this paper, we present a learning based approach to depth fusion, i.e.,
dense 3D reconstruction from multiple depth images. The most common approach to
depth fusion is based on averaging truncated signed distance functions, which
was originally proposed by Curless and Levoy in 1996. While this method is
simple and provides great results, it is not able to reconstruct (partially)
occluded surfaces and requires a large number frames to filter out sensor noise
and outliers. Motivated by the availability of large 3D model repositories and
recent advances in deep learning, we present a novel 3D CNN architecture that
learns to predict an implicit surface representation from the input depth maps.
Our learning based method significantly outperforms the traditional volumetric
fusion approach in terms of noise reduction and outlier suppression. By
learning the structure of real world 3D objects and scenes, our approach is
further able to reconstruct occluded regions and to fill in gaps in the
reconstruction. We demonstrate that our learning based approach outperforms
both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric
fusion. Further, we demonstrate state-of-the-art 3D shape completion results.Comment: 3DV 2017, https://github.com/griegler/octnetfusio
Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer
Semantic annotations are vital for training models for object recognition,
semantic segmentation or scene understanding. Unfortunately, pixelwise
annotation of images at very large scale is labor-intensive and only little
labeled data is available, particularly at instance level and for street
scenes. In this paper, we propose to tackle this problem by lifting the
semantic instance labeling task from 2D into 3D. Given reconstructions from
stereo or laser data, we annotate static 3D scene elements with rough bounding
primitives and develop a model which transfers this information into the image
domain. We leverage our method to obtain 2D labels for a novel suburban video
dataset which we have collected, resulting in 400k semantic and instance image
annotations. A comparison of our method to state-of-the-art label transfer
baselines reveals that 3D information enables more efficient annotation while
at the same time resulting in improved accuracy and time-coherent labels.Comment: 10 pages in Conference on Computer Vision and Pattern Recognition
(CVPR), 201
- …