12 research outputs found
Multi-Modal Fusion by Meta-Initialization
When experience is scarce, models may have insufficient information to adapt
to a new task. In this case, auxiliary information - such as a textual
description of the task - can enable improved task inference and adaptation. In
this work, we propose an extension to the Model-Agnostic Meta-Learning
algorithm (MAML), which allows the model to adapt using auxiliary information
as well as task experience. Our method, Fusion by Meta-Initialization (FuMI),
conditions the model initialization on auxiliary information using a
hypernetwork, rather than learning a single, task-agnostic initialization.
Furthermore, motivated by the shortcomings of existing multi-modal few-shot
learning benchmarks, we constructed iNat-Anim - a large-scale image
classification dataset with succinct and visually pertinent textual class
descriptions. On iNat-Anim, FuMI significantly outperforms uni-modal baselines
such as MAML in the few-shot regime. The code for this project and a dataset
exploration tool for iNat-Anim are publicly available at
https://github.com/s-a-malik/multi-few .Comment: The first two authors contributed equall
LPN: Language-guided Prototypical Network for few-shot classification
Few-shot classification aims to adapt to new tasks with limited labeled
examples. To fully use the accessible data, recent methods explore suitable
measures for the similarity between the query and support images and better
high-dimensional features with meta-training and pre-training strategies.
However, the potential of multi-modality information has barely been explored,
which may bring promising improvement for few-shot classification. In this
paper, we propose a Language-guided Prototypical Network (LPN) for few-shot
classification, which leverages the complementarity of vision and language
modalities via two parallel branches. Concretely, to introduce language
modality with limited samples in the visual task, we leverage a pre-trained
text encoder to extract class-level text features directly from class names
while processing images with a conventional image encoder. Then, a
language-guided decoder is introduced to obtain text features corresponding to
each image by aligning class-level features with visual features. In addition,
to take advantage of class-level features and prototypes, we build a refined
prototypical head that generates robust prototypes in the text branch for
follow-up measurement. Finally, we aggregate the visual and text logits to
calibrate the deviation of a single modality. Extensive experiments demonstrate
the competitiveness of LPN against state-of-the-art methods on benchmark
datasets
Dual Adversarial Alignment for Realistic Support-Query Shift Few-shot Learning
Support-query shift few-shot learning aims to classify unseen examples (query
set) to labeled data (support set) based on the learned embedding in a
low-dimensional space under a distribution shift between the support set and
the query set. However, in real-world scenarios the shifts are usually unknown
and varied, making it difficult to estimate in advance. Therefore, in this
paper, we propose a novel but more difficult challenge, RSQS, focusing on
Realistic Support-Query Shift few-shot learning. The key feature of RSQS is
that the individual samples in a meta-task are subjected to multiple
distribution shifts in each meta-task. In addition, we propose a unified
adversarial feature alignment method called DUal adversarial ALignment
framework (DuaL) to relieve RSQS from two aspects, i.e., inter-domain bias and
intra-domain variance. On the one hand, for the inter-domain bias, we corrupt
the original data in advance and use the synthesized perturbed inputs to train
the repairer network by minimizing distance in the feature level. On the other
hand, for intra-domain variance, we proposed a generator network to synthesize
hard, i.e., less similar, examples from the support set in a self-supervised
manner and introduce regularized optimal transportation to derive a smooth
optimal transportation plan. Lastly, a benchmark of RSQS is built with several
state-of-the-art baselines among three datasets (CIFAR100, mini-ImageNet, and
Tiered-Imagenet). Experiment results show that DuaL significantly outperforms
the state-of-the-art methods in our benchmark.Comment: Best student paper in PAKDD 202