957 research outputs found
Motion Segmentation of Truncated Signed Distance Function based Volumetric Surfaces
© 2015 IEEE.Truncated signed distance function (TSDF) based volumetric surface reconstructions of static environments can be readily acquired using recent RGB-D camera based mapping systems. If objects in the environment move then a previously obtained TSDF reconstruction is no longer current. Handling this problem requires segmenting moving objects from the reconstruction. To this end, we present a novel solution to the motion segmentation of TSDF volumes. The segmentation problem is cast as CRF-based MAP inference in the voxel space. We propose: a novel data term by solving sparse multi-body motion segmentation and computing likelihoods for each motion label in the RGB-D image space, and, a novel pair wise term based on gradients of the TSDF volume. Experimental evaluation shows that the proposed approach achieves successful segmentations on reconstructions acquired with Kinect Fusion. Unlike the existing solutions which only work if the objects move completely from their initially occupied spaces, the proposed method permits segmentation of objects when they start to move
Volume-based Semantic Labeling with Signed Distance Functions
Research works on the two topics of Semantic Segmentation and SLAM
(Simultaneous Localization and Mapping) have been following separate tracks.
Here, we link them quite tightly by delineating a category label fusion
technique that allows for embedding semantic information into the dense map
created by a volume-based SLAM algorithm such as KinectFusion. Accordingly, our
approach is the first to provide a semantically labeled dense reconstruction of
the environment from a stream of RGB-D images. We validate our proposal using a
publicly available semantically annotated RGB-D dataset and a) employing ground
truth labels, b) corrupting such annotations with synthetic noise, c) deploying
a state of the art semantic segmentation algorithm based on Convolutional
Neural Networks.Comment: Submitted to PSIVT201
OctNetFusion: Learning Depth Fusion from Data
In this paper, we present a learning based approach to depth fusion, i.e.,
dense 3D reconstruction from multiple depth images. The most common approach to
depth fusion is based on averaging truncated signed distance functions, which
was originally proposed by Curless and Levoy in 1996. While this method is
simple and provides great results, it is not able to reconstruct (partially)
occluded surfaces and requires a large number frames to filter out sensor noise
and outliers. Motivated by the availability of large 3D model repositories and
recent advances in deep learning, we present a novel 3D CNN architecture that
learns to predict an implicit surface representation from the input depth maps.
Our learning based method significantly outperforms the traditional volumetric
fusion approach in terms of noise reduction and outlier suppression. By
learning the structure of real world 3D objects and scenes, our approach is
further able to reconstruct occluded regions and to fill in gaps in the
reconstruction. We demonstrate that our learning based approach outperforms
both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric
fusion. Further, we demonstrate state-of-the-art 3D shape completion results.Comment: 3DV 2017, https://github.com/griegler/octnetfusio
Robust Dense Mapping for Large-Scale Dynamic Environments
We present a stereo-based dense mapping algorithm for large-scale dynamic
urban environments. In contrast to other existing methods, we simultaneously
reconstruct the static background, the moving objects, and the potentially
moving but currently stationary objects separately, which is desirable for
high-level mobile robotic tasks such as path planning in crowded environments.
We use both instance-aware semantic segmentation and sparse scene flow to
classify objects as either background, moving, or potentially moving, thereby
ensuring that the system is able to model objects with the potential to
transition from static to dynamic, such as parked cars. Given camera poses
estimated from visual odometry, both the background and the (potentially)
moving objects are reconstructed separately by fusing the depth maps computed
from the stereo input. In addition to visual odometry, sparse scene flow is
also used to estimate the 3D motions of the detected moving objects, in order
to reconstruct them accurately. A map pruning technique is further developed to
improve reconstruction accuracy and reduce memory consumption, leading to
increased scalability. We evaluate our system thoroughly on the well-known
KITTI dataset. Our system is capable of running on a PC at approximately 2.5Hz,
with the primary bottleneck being the instance-aware semantic segmentation,
which is a limitation we hope to address in future work. The source code is
available from the project website (http://andreibarsan.github.io/dynslam).Comment: Presented at IEEE International Conference on Robotics and Automation
(ICRA), 201
Scalable Estimation of Precision Maps in a MapReduce Framework
This paper presents a large-scale strip adjustment method for LiDAR mobile
mapping data, yielding highly precise maps. It uses several concepts to achieve
scalability. First, an efficient graph-based pre-segmentation is used, which
directly operates on LiDAR scan strip data, rather than on point clouds.
Second, observation equations are obtained from a dense matching, which is
formulated in terms of an estimation of a latent map. As a result of this
formulation, the number of observation equations is not quadratic, but rather
linear in the number of scan strips. Third, the dynamic Bayes network, which
results from all observation and condition equations, is partitioned into two
sub-networks. Consequently, the estimation matrices for all position and
orientation corrections are linear instead of quadratic in the number of
unknowns and can be solved very efficiently using an alternating least squares
approach. It is shown how this approach can be mapped to a standard key/value
MapReduce implementation, where each of the processing nodes operates
independently on small chunks of data, leading to essentially linear
scalability. Results are demonstrated for a dataset of one billion measured
LiDAR points and 278,000 unknowns, leading to maps with a precision of a few
millimeters.Comment: ACM SIGSPATIAL'16, October 31-November 03, 2016, Burlingame, CA, US
DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks
3D scene understanding is important for robots to interact with the 3D world
in a meaningful way. Most previous works on 3D scene understanding focus on
recognizing geometrical or semantic properties of the scene independently. In
this work, we introduce Data Associated Recurrent Neural Networks (DA-RNNs), a
novel framework for joint 3D scene mapping and semantic labeling. DA-RNNs use a
new recurrent neural network architecture for semantic labeling on RGB-D
videos. The output of the network is integrated with mapping techniques such as
KinectFusion in order to inject semantic information into the reconstructed 3D
scene. Experiments conducted on a real world dataset and a synthetic dataset
with RGB-D videos demonstrate the ability of our method in semantic 3D scene
mapping.Comment: Published in RSS 201
Combining Features and Semantics for Low-level Computer Vision
Visual perception of depth and motion plays a significant role in understanding and navigating the environment.
Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving.
The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information.
Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions.
Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects.
Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem VerstĂ€ndnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die SchĂ€tzung der Bewegung von Videokameras sind von gröĂter Bedeutung fĂŒr Anwendungen, wie das autonome Fahren.
Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfĂŒr sind reflektierende und texturlose OberflĂ€chen oder groĂe Bewegungen, bei denen herkömmliche lokale Methoden hĂ€ufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, groĂe Verzerrungen und schwierige LichtverhĂ€ltnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern.
FĂŒr die binokulare Stereo SchĂ€tzung schlagen wir zuallererst vor zusammenhĂ€ngende Bereiche mit objektklassen-spezifischen DisparitĂ€ts VorschlĂ€gen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spĂ€rlichen DisparitĂ€tsschĂ€tzung und semantischen Segmentierung des Bildes erhalten. Die DisparitĂ€ts VorschlĂ€ge kodieren die Tatsache, dass die GegenstĂ€nde bestimmter Kategorien nicht willkĂŒrlich geformt sind, sondern typischerweise regelmĂ€Ăige Strukturen aufweisen. Wir integrieren sie fĂŒr die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen.
Zweitens nutzen wir fĂŒr die 3D-Rekonstruktion die Tatsache, dass mit der GröĂe der rekonstruierten FlĂ€che auch die Wahrscheinlichkeit steigt, Objekte von Ă€hnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders fĂŒr Szenen im Freien, in denen GebĂ€ude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber Ă€hnlichkeit in der Form aufweisen. Wir nutzen diese Ă€hnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, wĂ€hrend fehlende FlĂ€chen vervollstĂ€ndigt werden, da Objekte Ă€hnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstĂ€dtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen AbhĂ€ngigkeiten zwischen Objekten.
Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, prĂ€sentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusĂ€tzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein groĂes rezeptives Feld besitzt. FĂŒr das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm fĂŒr eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes fĂŒr das Feature Matching
Live User-guided Intrinsic Video For Static Scenes
We present a novel real-time approach for user-guided intrinsic decomposition of static scenes captured by an RGB-D sensor. In the first step, we acquire a three-dimensional representation of the scene using a dense volumetric reconstruction framework. The obtained reconstruction serves as a proxy to densely fuse reflectance estimates and to store user-provided constraints in three-dimensional space. User constraints, in the form of constant shading and reflectance strokes, can be placed directly on the real-world geometry using an intuitive touch-based interaction metaphor, or using interactive mouse strokes. Fusing the decomposition results and constraints in three-dimensional space allows for robust propagation of this information to novel views by re-projection.We leverage this information to improve on the decomposition quality of existing intrinsic video decomposition techniques by further constraining the ill-posed decomposition problem. In addition to improved decomposition quality, we show a variety of live augmented reality applications such as recoloring of objects, relighting of scenes and editing of material appearance
- âŠ