6,215 research outputs found
iQPP: A Benchmark for Image Query Performance Prediction
To date, query performance prediction (QPP) in the context of content-based
image retrieval remains a largely unexplored task, especially in the
query-by-example scenario, where the query is an image. To boost the
exploration of the QPP task in image retrieval, we propose the first benchmark
for image query performance prediction (iQPP). First, we establish a set of
four data sets (PASCAL VOC 2012, Caltech-101, ROxford5k and RParis6k) and
estimate the ground-truth difficulty of each query as the average precision or
the precision@k, using two state-of-the-art image retrieval models. Next, we
propose and evaluate novel pre-retrieval and post-retrieval query performance
predictors, comparing them with existing or adapted (from text to image)
predictors. The empirical results show that most predictors do not generalize
across evaluation scenarios. Our comprehensive experiments indicate that iQPP
is a challenging benchmark, revealing an important research gap that needs to
be addressed in future work. We release our code and data as open source at
https://github.com/Eduard6421/iQPP, to foster future research.Comment: Accepted at SIGIR 202
PlaNet - Photo Geolocation with Convolutional Neural Networks
Is it possible to build a system to determine the location where a photo was
taken using just its pixels? In general, the problem seems exceptionally
difficult: it is trivial to construct situations where no location can be
inferred. Yet images often contain informative cues such as landmarks, weather
patterns, vegetation, road markings, and architectural details, which in
combination may allow one to determine an approximate location and occasionally
an exact location. Websites such as GeoGuessr and View from your Window suggest
that humans are relatively good at integrating these cues to geolocate images,
especially en-masse. In computer vision, the photo geolocation problem is
usually approached using image retrieval methods. In contrast, we pose the
problem as one of classification by subdividing the surface of the earth into
thousands of multi-scale geographic cells, and train a deep network using
millions of geotagged images. While previous approaches only recognize
landmarks or perform approximate matching using global image descriptors, our
model is able to use and integrate multiple visible cues. We show that the
resulting model, called PlaNet, outperforms previous approaches and even
attains superhuman levels of accuracy in some cases. Moreover, we extend our
model to photo albums by combining it with a long short-term memory (LSTM)
architecture. By learning to exploit temporal coherence to geolocate uncertain
photos, we demonstrate that this model achieves a 50% performance improvement
over the single-image model
Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade
Camera pose estimation is an important problem in computer vision. Common
techniques either match the current image against keyframes with known poses,
directly regress the pose, or establish correspondences between keypoints in
the image and points in the scene to estimate the pose. In recent years,
regression forests have become a popular alternative to establish such
correspondences. They achieve accurate results, but have traditionally needed
to be trained offline on the target scene, preventing relocalisation in new
environments. Recently, we showed how to circumvent this limitation by adapting
a pre-trained forest to a new scene on the fly. The adapted forests achieved
relocalisation performance that was on par with that of offline forests, and
our approach was able to estimate the camera pose in close to real time. In
this paper, we present an extension of this work that achieves significantly
better relocalisation performance whilst running fully in real time. To achieve
this, we make several changes to the original approach: (i) instead of
accepting the camera pose hypothesis without question, we make it possible to
score the final few hypotheses using a geometric approach and select the most
promising; (ii) we chain several instantiations of our relocaliser together in
a cascade, allowing us to try faster but less accurate relocalisation first,
only falling back to slower, more accurate relocalisation as necessary; and
(iii) we tune the parameters of our cascade to achieve effective overall
performance. These changes allow us to significantly improve upon the
performance our original state-of-the-art method was able to achieve on the
well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional
contributions, we present a way of visualising the internal behaviour of our
forests and show how to entirely circumvent the need to pre-train a forest on a
generic scene.Comment: Tommaso Cavallari, Stuart Golodetz, Nicholas Lord and Julien Valentin
assert joint first authorshi
- …