937 research outputs found
Learning Similarity Attention
We consider the problem of learning similarity functions. While there has
been substantial progress in learning suitable distance metrics, these
techniques in general lack decision reasoning, i.e., explaining why the input
set of images is similar or dissimilar. In this work, we solve this key problem
by proposing the first method to generate generic visual similarity
explanations with gradient-based attention. We demonstrate that our technique
is agnostic to the specific similarity model type, e.g., we show applicability
to Siamese, triplet, and quadruplet models. Furthermore, we make our proposed
similarity attention a principled part of the learning process, resulting in a
new paradigm for learning similarity functions. We demonstrate that our
learning mechanism results in more generalizable, as well as explainable,
similarity models. Finally, we demonstrate the generality of our framework by
means of experiments on a variety of tasks, including image retrieval, person
re-identification, and low-shot semantic segmentation.Comment: 10 pages, 7 figures, 4 table
Less than Few: Self-Shot Video Instance Segmentation
The goal of this paper is to bypass the need for labelled examples in
few-shot video understanding at run time. While proven effective, in many
practical video settings even labelling a few examples appears unrealistic.
This is especially true as the level of details in spatio-temporal video
understanding and with it, the complexity of annotations continues to increase.
Rather than performing few-shot learning with a human oracle to provide a few
densely labelled support videos, we propose to automatically learn to find
appropriate support videos given a query. We call this self-shot learning and
we outline a simple self-supervised learning method to generate an embedding
space well-suited for unsupervised retrieval of relevant samples. To showcase
this novel setting, we tackle, for the first time, video instance segmentation
in a self-shot (and few-shot) setting, where the goal is to segment instances
at the pixel-level across the spatial and temporal domains. We provide strong
baseline performances that utilize a novel transformer-based model and show
that self-shot learning can even surpass few-shot and can be positively
combined for further performance gains. Experiments on new benchmarks show that
our approach achieves strong performance, is competitive to oracle support in
some settings, scales to large unlabelled video collections, and can be
combined in a semi-supervised setting.Comment: 25 pages, 5 figures, 13 table
Image Captioning with Unseen Objects
Image caption generation is a long standing and challenging problem at the
intersection of computer vision and natural language processing. A number of
recently proposed approaches utilize a fully supervised object recognition
model within the captioning approach. Such models, however, tend to generate
sentences which only consist of objects predicted by the recognition models,
excluding instances of the classes without labelled training examples. In this
paper, we propose a new challenging scenario that targets the image captioning
problem in a fully zero-shot learning setting, where the goal is to be able to
generate captions of test images containing objects that are not seen during
training. The proposed approach jointly uses a novel zero-shot object detection
model and a template-based sentence generator. Our experiments show promising
results on the COCO dataset.Comment: To appear in British Machine Vision Conference (BMVC) 201
Symbolic Discovery of Optimization Algorithms
We present a method to formulate algorithm discovery as program search, and
apply it to discover optimization algorithms for deep neural network training.
We leverage efficient search techniques to explore an infinite and sparse
program space. To bridge the large generalization gap between proxy and target
tasks, we also introduce program selection and simplification strategies. Our
method discovers a simple and effective optimization algorithm,
(\textit{Evo\textbf{L}\textbf{i}\textbf{o}\textbf{n}tum}).
It is more memory-efficient than Adam as it only keeps track of the momentum.
Different from adaptive optimizers, its update has the same magnitude for each
parameter calculated through the sign operation. We compare Lion with widely
used optimizers, such as Adam and Adafactor, for training a variety of models
on different tasks. On image classification, Lion boosts the accuracy of ViT by
up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On
vision-language contrastive learning, we achieve 88.3% and
91.1% accuracy on ImageNet, surpassing the previous best
results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms
Adam by achieving a better FID score and reducing the training compute by up to
2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion
exhibits a similar or better performance compared to Adam. Our analysis of Lion
reveals that its performance gain grows with the training batch size. It also
requires a smaller learning rate than Adam due to the larger norm of the update
produced by the sign function. Additionally, we examine the limitations of Lion
and identify scenarios where its improvements are small or not statistically
significant. The implementation of Lion is publicly available.Comment: 30 pages, update the tuning instruction
- …