7,907 research outputs found
Visual place recognition using landmark distribution descriptors
Recent work by Suenderhauf et al. [1] demonstrated improved visual place
recognition using proposal regions coupled with features from convolutional
neural networks (CNN) to match landmarks between views. In this work we extend
the approach by introducing descriptors built from landmark features which also
encode the spatial distribution of the landmarks within a view. Matching
descriptors then enforces consistency of the relative positions of landmarks
between views. This has a significant impact on performance. For example, in
experiments on 10 image-pair datasets, each consisting of 200 urban locations
with significant differences in viewing positions and conditions, we recorded
average precision of around 70% (at 100% recall), compared with 58% obtained
using whole image CNN features and 50% for the method in [1].Comment: 13 page
PlaNet - Photo Geolocation with Convolutional Neural Networks
Is it possible to build a system to determine the location where a photo was
taken using just its pixels? In general, the problem seems exceptionally
difficult: it is trivial to construct situations where no location can be
inferred. Yet images often contain informative cues such as landmarks, weather
patterns, vegetation, road markings, and architectural details, which in
combination may allow one to determine an approximate location and occasionally
an exact location. Websites such as GeoGuessr and View from your Window suggest
that humans are relatively good at integrating these cues to geolocate images,
especially en-masse. In computer vision, the photo geolocation problem is
usually approached using image retrieval methods. In contrast, we pose the
problem as one of classification by subdividing the surface of the earth into
thousands of multi-scale geographic cells, and train a deep network using
millions of geotagged images. While previous approaches only recognize
landmarks or perform approximate matching using global image descriptors, our
model is able to use and integrate multiple visible cues. We show that the
resulting model, called PlaNet, outperforms previous approaches and even
attains superhuman levels of accuracy in some cases. Moreover, we extend our
model to photo albums by combining it with a long short-term memory (LSTM)
architecture. By learning to exploit temporal coherence to geolocate uncertain
photos, we demonstrate that this model achieves a 50% performance improvement
over the single-image model
Keyframe-based monocular SLAM: design, survey, and future directions
Extensive research in the field of monocular SLAM for the past fifteen years
has yielded workable systems that found their way into various applications in
robotics and augmented reality. Although filter-based monocular SLAM systems
were common at some time, the more efficient keyframe-based solutions are
becoming the de facto methodology for building a monocular SLAM system. The
objective of this paper is threefold: first, the paper serves as a guideline
for people seeking to design their own monocular SLAM according to specific
environmental constraints. Second, it presents a survey that covers the various
keyframe-based monocular SLAM systems in the literature, detailing the
components of their implementation, and critically assessing the specific
strategies made in each proposed solution. Third, the paper provides insight
into the direction of future research in this field, to address the major
limitations still facing monocular SLAM; namely, in the issues of illumination
changes, initialization, highly dynamic motion, poorly textured scenes,
repetitive textures, map maintenance, and failure recovery
Visual Landmark Recognition from Internet Photo Collections: A Large-Scale Evaluation
The task of a visual landmark recognition system is to identify photographed
buildings or objects in query photos and to provide the user with relevant
information on them. With their increasing coverage of the world's landmark
buildings and objects, Internet photo collections are now being used as a
source for building such systems in a fully automatic fashion. This process
typically consists of three steps: clustering large amounts of images by the
objects they depict; determining object names from user-provided tags; and
building a robust, compact, and efficient recognition index. To this date,
however, there is little empirical information on how well current approaches
for those steps perform in a large-scale open-set mining and recognition task.
Furthermore, there is little empirical information on how recognition
performance varies for different types of landmark objects and where there is
still potential for improvement. With this paper, we intend to fill these gaps.
Using a dataset of 500k images from Paris, we analyze each component of the
landmark recognition pipeline in order to answer the following questions: How
many and what kinds of objects can be discovered automatically? How can we best
use the resulting image clusters to recognize the object in a query? How can
the object be efficiently represented in memory for recognition? How reliably
can semantic information be extracted? And finally: What are the limiting
factors in the resulting pipeline from query to semantics? We evaluate how
different choices of methods and parameters for the individual pipeline steps
affect overall system performance and examine their effects for different query
categories such as buildings, paintings or sculptures
Data-Efficient Decentralized Visual SLAM
Decentralized visual simultaneous localization and mapping (SLAM) is a
powerful tool for multi-robot applications in environments where absolute
positioning systems are not available. Being visual, it relies on cameras,
cheap, lightweight and versatile sensors, and being decentralized, it does not
rely on communication to a central ground station. In this work, we integrate
state-of-the-art decentralized SLAM components into a new, complete
decentralized visual SLAM system. To allow for data association and
co-optimization, existing decentralized visual SLAM systems regularly exchange
the full map data between all robots, incurring large data transfers at a
complexity that scales quadratically with the robot count. In contrast, our
method performs efficient data association in two stages: in the first stage a
compact full-image descriptor is deterministically sent to only one robot. In
the second stage, which is only executed if the first stage succeeded, the data
required for relative pose estimation is sent, again to only one robot. Thus,
data association scales linearly with the robot count and uses highly compact
place representations. For optimization, a state-of-the-art decentralized
pose-graph optimization method is used. It exchanges a minimum amount of data
which is linear with trajectory overlap. We characterize the resulting system
and identify bottlenecks in its components. The system is evaluated on publicly
available data and we provide open access to the code.Comment: 8 pages, submitted to ICRA 201
An accurate retrieval through R-MAC+ descriptors for landmark recognition
The landmark recognition problem is far from being solved, but with the use
of features extracted from intermediate layers of Convolutional Neural Networks
(CNNs), excellent results have been obtained. In this work, we propose some
improvements on the creation of R-MAC descriptors in order to make the
newly-proposed R-MAC+ descriptors more representative than the previous ones.
However, the main contribution of this paper is a novel retrieval technique,
that exploits the fine representativeness of the MAC descriptors of the
database images. Using this descriptors called "db regions" during the
retrieval stage, the performance is greatly improved. The proposed method is
tested on different public datasets: Oxford5k, Paris6k and Holidays. It
outperforms the state-of-the- art results on Holidays and reached excellent
results on Oxford5k and Paris6k, overcame only by approaches based on
fine-tuning strategies
Towards Accurate Camera Geopositioning by Image Matching
In this work, we present a camera geopositioning system based on matching a
query image against a database with panoramic images. For matching, our system
uses memory vectors aggregated from global image descriptors based on
convolutional features to facilitate fast searching in the database. To speed
up searching, a clustering algorithm is used to balance geographical
positioning and computation time. We refine the obtained position from the
query image using a new outlier removal algorithm. The matching of the query
image is obtained with a recall@5 larger than 90% for panorama-to-panorama
matching. We cluster available panoramas from geographically adjacent locations
into a single compact representation and observe computational gains of
approximately 50% at the cost of only a small (approximately 3%) recall loss.
Finally, we present a coordinate estimation algorithm that reduces the median
geopositioning error by up to 20%
Perceptually Motivated Shape Context Which Uses Shape Interiors
In this paper, we identify some of the limitations of current-day shape
matching techniques. We provide examples of how contour-based shape matching
techniques cannot provide a good match for certain visually similar shapes. To
overcome this limitation, we propose a perceptually motivated variant of the
well-known shape context descriptor. We identify that the interior properties
of the shape play an important role in object recognition and develop a
descriptor that captures these interior properties. We show that our method can
easily be augmented with any other shape matching algorithm. We also show from
our experiments that the use of our descriptor can significantly improve the
retrieval rates
- …