18 research outputs found
Improving Image Classification with Location Context
With the widespread availability of cellphones and cameras that have GPS
capabilities, it is common for images being uploaded to the Internet today to
have GPS coordinates associated with them. In addition to research that tries
to predict GPS coordinates from visual features, this also opens up the door to
problems that are conditioned on the availability of GPS coordinates. In this
work, we tackle the problem of performing image classification with location
context, in which we are given the GPS coordinates for images in both the train
and test phases. We explore different ways of encoding and extracting features
from the GPS coordinates, and show how to naturally incorporate these features
into a Convolutional Neural Network (CNN), the current state-of-the-art for
most image classification and recognition problems. We also show how it is
possible to simultaneously learn the optimal pooling radii for a subset of our
features within the CNN framework. To evaluate our model and to help promote
research in this area, we identify a set of location-sensitive concepts and
annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has
GPS coordinates with these concepts, which we make publicly available. By
leveraging location context, we are able to achieve almost a 7% gain in mean
average precision
Evaluation of echosounder data preparation strategies for modern machine learning models
Fish stock assessment and management requires accurate estimates of fish abundance, which are typically derived from echosounder observations using acoustic target classification (ATC). Skilled operators are regularly assisted in classifying acoustic targets by software and there has been an increasing interest toward using machine learning to create improved tools. Recent studies have applied deep learning approaches to acoustic data, however, algorithm data-preparation strategies (influencing model output) are presently poorly understood and standardization is needed to enable collaborative research and management. For example, a common pre-processing technique is to resample backscatter data coming from echosounder measurements from the original resolution to a coarser resolution in the horizontal (time) and vertical (range) directions. Using data values derived from the volume backscattering coefficient obtained during the Norwegian sandeel survey, we investigate which resampling resolutions are suitable for ATC using a convolutional neural network trained to classify single values of backscatter data. This process is known as pixel-level semantic segmentation. Our results indicate that it is possible to downsample the data if important information related to acoustic characteristics is not smoothed out. We also show that the classification performance is improved when providing the network with contextual information relating to range. These findings will provide input to fisheries acoustic data standards and contribute to the on-going development of automated ATC methods.publishedVersio
A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata
Images represent a commonly used form of visual communication among people.
Nevertheless, image classification may be a challenging task when dealing with
unclear or non-common images needing more context to be correctly annotated.
Metadata accompanying images on social-media represent an ideal source of
additional information for retrieving proper neighborhoods easing image
annotation task. To this end, we blend visual features extracted from neighbors
and their metadata to jointly leverage context and visual cues. Our models use
multiple semantic embeddings to achieve the dual objective of being robust to
vocabulary changes between train and test sets and decoupling the architecture
from the low-level metadata representation. Convolutional and recurrent neural
networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors
and query images. We perform comprehensive experiments on the NUS-WIDE dataset
showing that our models outperform state-of-the-art architectures based on
images and metadata, and decrease both sensory and semantic gaps to better
annotate images
MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
In this paper, we introduce the MLM (Multiple Languages and Modalities)
dataset - a new resource to train and evaluate multitask systems on samples in
multiple modalities and three languages. The generation process and inclusion
of semantic data provide a resource that further tests the ability for
multitask systems to learn relationships between entities. The dataset is
designed for researchers and developers who build applications that perform
multiple tasks on data encountered on the web and in digital archives. A second
version of MLM provides a geo-representative subset of the data with weighted
samples for countries of the European Union. We demonstrate the value of the
resource in developing novel applications in the digital humanities with a
motivating use case and specify a benchmark set of tasks to retrieve modalities
and locate entities in the dataset. Evaluation of baseline multitask and single
task systems on the full and geo-representative versions of MLM demonstrate the
challenges of generalising on diverse data. In addition to the digital
humanities, we expect the resource to contribute to research in multimodal
representation learning, location estimation, and scene understanding
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
We propose a metadata-aware self-supervised learning~(SSL)~framework useful
for fine-grained classification and ecological mapping of bird species around
the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL)
and Masked Image Modeling~(MIM), while also enriching the embedding space with
metadata available with ground-level imagery of birds. We separately train
uni-modal and cross-modal ViT on a novel cross-view global bird species dataset
containing ground-level imagery, metadata (location, time), and corresponding
satellite imagery. We demonstrate that our models learn fine-grained and
geographically conditioned features of birds, by evaluating on two downstream
tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval.
Pre-trained models learned using our framework achieve SotA performance on FGVC
of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and
NABirds datasets. Moreover, the impressive cross-modal retrieval performance of
our model enables the creation of species distribution maps across any
geographic region. The dataset and source code will be released at
https://github.com/mvrl/BirdSAT}.Comment: Accepted at WACV 202
Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells
Unsupervised text encoding models have recently fueled substantial progress
in NLP. The key idea is to use neural networks to convert words in texts to
vector space representations based on word positions in a sentence and their
contexts, which are suitable for end-to-end training of downstream tasks. We
see a strikingly similar situation in spatial analysis, which focuses on
incorporating both absolute positions and spatial contexts of geographic
objects such as POIs into models. A general-purpose representation model for
space is valuable for a multitude of tasks. However, no such general model
exists to date beyond simply applying discretization or feed-forward nets to
coordinates, and little effort has been put into jointly modeling distributions
with vastly different characteristics, which commonly emerges from GIS data.
Meanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in
mammals provide a multi-scale periodic representation that functions as a
metric for location encoding and is critical for recognizing places and for
path-integration. Therefore, we propose a representation learning model called
Space2Vec to encode the absolute positions and spatial relationships of places.
We conduct experiments on two real-world geographic data for two different
tasks: 1) predicting types of POIs given their positions and context, 2) image
classification leveraging their geo-locations. Results show that because of its
multi-scale representations, Space2Vec outperforms well-established ML
approaches such as RBF kernels, multi-layer feed-forward nets, and tile
embedding approaches for location modeling and image classification tasks.
Detailed analysis shows that all baselines can at most well handle distribution
at one scale but show poor performances in other scales. In contrast,
Space2Vec's multi-scale representation can handle distributions at different
scales.Comment: 15 pages; Accepted to ICLR 2020 as a spotlight pape
Thinking like a naturalist: enhancing computer vision of citizen science images by harnessing contextual data
1. The accurate identification of species in images submitted by citizen scientists is currently a bottleneck for many data uses. Machine learning tools offer the potential to provide rapid, objective and scalable species identification for the benefit of many aspects of ecological science. Currently, most approaches only make use of image pixel data for classification. However, an experienced naturalist would also use a wide variety of contextual information such as the location and date of recording.
2. Here, we examine the automated identification of ladybird (Coccinellidae) records from the British Isles submitted to the UK Ladybird Survey, a volunteerâled mass participation recording scheme. Each image is associated with metadata; a date, location and recorder ID, which can be crossâreferenced with other data sources to determine local weather at the time of recording, habitat types and the experience of the observer. We built multiâinput neural network models that synthesize metadata and images to identify records to species level.
3. We show that machine learning models can effectively harness contextual information to improve the interpretation of images. Against an imageâonly baseline of 48.2%, we observe a 9.1 percentageâpoint improvement in topâ1 accuracy with a multiâinput model compared to only a 3.6% increase when using an ensemble of image and metadata models. This suggests that contextual data are being used to interpret an image, beyond just providing a prior expectation. We show that our neural network models appear to be utilizing similar pieces of evidence as human naturalists to make identifications.
4. Metadata is a key tool for human naturalists. We show it can also be harnessed by computer vision systems. Contextualization offers considerable extra information, particularly for challenging species, even within small and relatively homogeneous areas such as the British Isles. Although complex relationships between disparate sources of information can be profitably interpreted by simple neural network architectures, there is likely considerable room for further progress. Contextualizing images has the potential to lead to a step change in the accuracy of automated identification tools, with considerable benefits for largeâscale verification of submitted records