2 research outputs found
Memory-Augmented Relation Network for Few-Shot Learning
Metric-based few-shot learning methods concentrate on learning transferable
feature embedding that generalizes well from seen categories to unseen
categories under the supervision of limited number of labelled instances.
However, most of them treat each individual instance in the working context
separately without considering its relationships with the others. In this work,
we investigate a new metric-learning method, Memory-Augmented Relation Network
(MRN), to explicitly exploit these relationships. In particular, for an
instance, we choose the samples that are visually similar from the working
context, and perform weighted information propagation to attentively aggregate
helpful information from the chosen ones to enhance its representation. In MRN,
we also formulate the distance metric as a learnable relation module which
learns to compare for similarity measurement, and augment the working context
with memory slots, both contributing to its generality. We empirically
demonstrate that MRN yields significant improvement over its ancestor and
achieves competitive or even better performance when compared with other
few-shot learning approaches on the two major benchmark datasets, i.e.
miniImagenet and tieredImagenet.Comment: To be submitted to ACM Multimedia 202
Cross-view Geo-localization with Evolving Transformer
In this work, we address the problem of cross-view geo-localization, which
estimates the geospatial location of a street view image by matching it with a
database of geo-tagged aerial images. The cross-view matching task is extremely
challenging due to drastic appearance and geometry differences across views.
Unlike existing methods that predominantly fall back on CNN, here we devise a
novel evolving geo-localization Transformer (EgoTR) that utilizes the
properties of self-attention in Transformer to model global dependencies, thus
significantly decreasing visual ambiguities in cross-view geo-localization. We
also exploit the positional encoding of Transformer to help the EgoTR
understand and correspond geometric configurations between ground and aerial
images. Compared to state-of-the-art methods that impose strong assumption on
geometry knowledge, the EgoTR flexibly learns the positional embeddings through
the training objective and hence becomes more practical in many real-world
scenarios. Although Transformer is well suited to our task, its vanilla
self-attention mechanism independently interacts within image patches in each
layer, which overlooks correlations between layers. Instead, this paper propose
a simple yet effective self-cross attention mechanism to improve the quality of
learned representations. The self-cross attention models global dependencies
between adjacent layers, which relates between image patches while modeling how
features evolve in the previous layer. As a result, the proposed self-cross
attention leads to more stable training, improves the generalization ability
and encourages representations to keep evolving as the network goes deeper.
Extensive experiments demonstrate that our EgoTR performs favorably against
state-of-the-art methods on standard, fine-grained and cross-dataset cross-view
geo-localization tasks.Comment: Under Revie