798 research outputs found
Presence-Only Geographical Priors for Fine-Grained Image Classification
Appearance information alone is often not sufficient to accurately differentiate between fine-grained visual categories. Human experts make use of additional cues such as where, and when, a given image was taken in order to inform their final decision. This contextual information is readily available in many online image collections but has been underutilized by existing image classifiers that focus solely on making predictions based on the image contents. We propose an efficient spatio-temporal prior, that when conditioned on a geographical location and time, estimates the probability that a given object category occurs at that location. Our prior is trained from presence-only observation data and jointly models object categories, their spatio-temporal distributions, and photographer biases. Experiments performed on multiple challenging image classification datasets show that combining our prior with the predictions from image classifiers results in a large improvement in final classification performance
Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells
Unsupervised text encoding models have recently fueled substantial progress
in NLP. The key idea is to use neural networks to convert words in texts to
vector space representations based on word positions in a sentence and their
contexts, which are suitable for end-to-end training of downstream tasks. We
see a strikingly similar situation in spatial analysis, which focuses on
incorporating both absolute positions and spatial contexts of geographic
objects such as POIs into models. A general-purpose representation model for
space is valuable for a multitude of tasks. However, no such general model
exists to date beyond simply applying discretization or feed-forward nets to
coordinates, and little effort has been put into jointly modeling distributions
with vastly different characteristics, which commonly emerges from GIS data.
Meanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in
mammals provide a multi-scale periodic representation that functions as a
metric for location encoding and is critical for recognizing places and for
path-integration. Therefore, we propose a representation learning model called
Space2Vec to encode the absolute positions and spatial relationships of places.
We conduct experiments on two real-world geographic data for two different
tasks: 1) predicting types of POIs given their positions and context, 2) image
classification leveraging their geo-locations. Results show that because of its
multi-scale representations, Space2Vec outperforms well-established ML
approaches such as RBF kernels, multi-layer feed-forward nets, and tile
embedding approaches for location modeling and image classification tasks.
Detailed analysis shows that all baselines can at most well handle distribution
at one scale but show poor performances in other scales. In contrast,
Space2Vec's multi-scale representation can handle distributions at different
scales.Comment: 15 pages; Accepted to ICLR 2020 as a spotlight pape
PlaNet - Photo Geolocation with Convolutional Neural Networks
Is it possible to build a system to determine the location where a photo was
taken using just its pixels? In general, the problem seems exceptionally
difficult: it is trivial to construct situations where no location can be
inferred. Yet images often contain informative cues such as landmarks, weather
patterns, vegetation, road markings, and architectural details, which in
combination may allow one to determine an approximate location and occasionally
an exact location. Websites such as GeoGuessr and View from your Window suggest
that humans are relatively good at integrating these cues to geolocate images,
especially en-masse. In computer vision, the photo geolocation problem is
usually approached using image retrieval methods. In contrast, we pose the
problem as one of classification by subdividing the surface of the earth into
thousands of multi-scale geographic cells, and train a deep network using
millions of geotagged images. While previous approaches only recognize
landmarks or perform approximate matching using global image descriptors, our
model is able to use and integrate multiple visible cues. We show that the
resulting model, called PlaNet, outperforms previous approaches and even
attains superhuman levels of accuracy in some cases. Moreover, we extend our
model to photo albums by combining it with a long short-term memory (LSTM)
architecture. By learning to exploit temporal coherence to geolocate uncertain
photos, we demonstrate that this model achieves a 50% performance improvement
over the single-image model
Part-guided Relational Transformers for Fine-grained Visual Recognition
Fine-grained visual recognition is to classify objects with visually similar
appearances into subcategories, which has made great progress with the
development of deep CNNs. However, handling subtle differences between
different subcategories still remains a challenge. In this paper, we propose to
solve this issue in one unified framework from two aspects, i.e., constructing
feature-level interrelationships, and capturing part-level discriminative
features. This framework, namely PArt-guided Relational Transformers (PART), is
proposed to learn the discriminative part features with an automatic part
discovery module, and to explore the intrinsic correlations with a feature
transformation module by adapting the Transformer models from the field of
natural language processing. The part discovery module efficiently discovers
the discriminative regions which are highly-corresponded to the gradient
descent procedure. Then the second feature transformation module builds
correlations within the global embedding and multiple part embedding, enhancing
spatial interactions among semantic pixels. Moreover, our proposed approach
does not rely on additional part branches in the inference time and reaches
state-of-the-art performance on 3 widely-used fine-grained object recognition
benchmarks. Experimental results and explainable visualizations demonstrate the
effectiveness of our proposed approach. The code can be found at
https://github.com/iCVTEAM/PART.Comment: Published in IEEE TIP 202
Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
Learning feature representations of geographical space is vital for any
machine learning model that integrates geolocated data, spanning application
domains such as remote sensing, ecology, or epidemiology. Recent work mostly
embeds coordinates using sine and cosine projections based on Double Fourier
Sphere (DFS) features -- these embeddings assume a rectangular data domain even
on global data, which can lead to artifacts, especially at the poles. At the
same time, relatively little attention has been paid to the exact design of the
neural network architectures these functional embeddings are combined with.
This work proposes a novel location encoder for globally distributed geographic
data that combines spherical harmonic basis functions, natively defined on
spherical surfaces, with sinusoidal representation networks (SirenNets) that
can be interpreted as learned Double Fourier Sphere embedding. We
systematically evaluate the cross-product of positional embeddings and neural
network architectures across various classification and regression benchmarks
and synthetic evaluation datasets. In contrast to previous approaches that
require the combination of both positional encoding and neural networks to
learn meaningful representations, we show that both spherical harmonics and
sinusoidal representation networks are competitive on their own but set
state-of-the-art performances across tasks when combined. We provide source
code at www.github.com/marccoru/locationencode
- …