5 research outputs found
Image-Graph-Image Translation via Auto-Encoding
This work presents the first convolutional neural network that learns an
image-to-graph translation task without needing external supervision. Obtaining
graph representations of image content, where objects are represented as nodes
and their relationships as edges, is an important task in scene understanding.
Current approaches follow a fully-supervised approach thereby requiring
meticulous annotations. To overcome this, we are the first to present a
self-supervised approach based on a fully-differentiable auto-encoder in which
the bottleneck encodes the graph's nodes and edges. This self-supervised
approach can currently encode simple line drawings into graphs and obtains
comparable results to a fully-supervised baseline in terms of F1 score on
triplet matching. Besides these promising results, we provide several
directions for future research on how our approach can be extended to cover
more complex imagery
Generalized Contrastive Optimization of Siamese Networks for Place Recognition
Visual place recognition is a challenging task in computer vision and a key
component of camera-based localization and navigation systems. Recently,
Convolutional Neural Networks (CNNs) achieved high results and good
generalization capabilities. They are usually trained using pairs or triplets
of images labeled as either similar or dissimilar, in a binary fashion. In
practice, the similarity between two images is not binary, but rather
continuous. Furthermore, training these CNNs is computationally complex and
involves costly pair and triplet mining strategies.
We propose a Generalized Contrastive loss (GCL) function that relies on image
similarity as a continuous measure, and use it to train a siamese CNN.
Furthermore, we propose three techniques for automatic annotation of image
pairs with labels indicating their degree of similarity, and deploy them to
re-annotate the MSLS, TB-Places, and 7Scenes datasets.
We demonstrate that siamese CNNs trained using the GCL function and the
improved annotations consistently outperform their binary counterparts. Our
models trained on MSLS outperform the state-of-the-art methods, including
NetVLAD, and generalize well on the Pittsburgh, TokyoTM and Tokyo 24/7
datasets. Furthermore, training a siamese network using the GCL function does
not require complex pair mining. We release the source code at
https://github.com/marialeyvallina/generalized_contrastive_loss.Comment: Under revie
Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs
Humans are able to form a complex mental model of the environment they move
in. This mental model captures geometric and semantic aspects of the scene,
describes the environment at multiple levels of abstractions (e.g., objects,
rooms, buildings), includes static and dynamic entities and their relations
(e.g., a person is in a room at a given time). In contrast, current robots'
internal representations still provide a partial and fragmented understanding
of the environment, either in the form of a sparse or dense set of geometric
primitives (e.g., points, lines, planes, voxels) or as a collection of objects.
This paper attempts to reduce the gap between robot and human perception by
introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that
seamlessly captures metric and semantic aspects of a dynamic environment. A DSG
is a layered graph where nodes represent spatial concepts at different levels
of abstraction, and edges represent spatio-temporal relations among nodes. Our
second contribution is Kimera, the first fully automatic method to build a DSG
from visual-inertial data. Kimera includes state-of-the-art techniques for
visual-inertial SLAM, metric-semantic 3D reconstruction, object localization,
human pose and shape estimation, and scene parsing. Our third contribution is a
comprehensive evaluation of Kimera in real-life datasets and photo-realistic
simulations, including a newly released dataset, uHumans2, which simulates a
collection of crowded indoor and outdoor scenes. Our evaluation shows that
Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates
an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a
complex indoor environment with tens of objects and humans in minutes. Our
final contribution shows how to use a DSG for real-time hierarchical semantic
path-planning. The core modules in Kimera are open-source.Comment: 34 pages, 25 figures, 9 tables. arXiv admin note: text overlap with
arXiv:2002.0628