63,893 research outputs found
EventCLIP: Adapting CLIP for Event-based Object Recognition
Recent advances in zero-shot and few-shot classification heavily rely on the
success of pre-trained vision-language models (VLMs) such as CLIP. Due to a
shortage of large-scale datasets, training such models for event camera data
remains infeasible. Thus, adapting existing models across modalities is an
important research challenge. In this work, we introduce EventCLIP, a novel
approach that utilizes CLIP for zero-shot and few-shot event-based object
recognition. We first generalize CLIP's image encoder to event data by
converting raw events to 2D grid-based representations. To further enhance
performance, we propose a feature adapter to aggregate temporal information
over event frames and refine text embeddings to better align with the visual
inputs. We evaluate EventCLIP on N-Caltech, N-Cars, and N-ImageNet datasets,
achieving state-of-the-art few-shot performance. When fine-tuned on the entire
dataset, our method outperforms all existing event classifiers. Moreover, we
explore practical applications of EventCLIP including robust event
classification and label-free event recognition, where our approach surpasses
previous baselines designed specifically for these tasks.Comment: Better few-shot accuracy. Add results on 1) model fine-tuning 2)
compare with concurrent works 3) learning from unlabeled data (unsupervised &
semi-supervised
Noise-Tolerant Unsupervised Adapter for Vision-Language Models
Recent advances in large-scale vision-language models have achieved very
impressive performance in various zero-shot image classification tasks. While
prior studies have demonstrated significant improvements by introducing
few-shot labelled target samples, they still require labelling of target
samples, which greatly degrades their scalability while handling various visual
recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that
allows learning superior target models with few-shot unlabelled target samples.
NtUA works as a key-value cache that formulates visual features and predicted
pseudo-labels of the few-shot unlabelled target samples as key-value pairs. It
consists of two complementary designs. The first is adaptive cache formation
that combats pseudo-label noises by weighting the key-value pairs according to
their prediction confidence. The second is pseudo-label rectification, which
corrects both pair values (i.e., pseudo-labels) and cache weights by leveraging
knowledge distillation from large-scale vision language models. Extensive
experiments show that NtUA achieves superior performance consistently across
multiple widely adopted benchmarks
A Multimodal Prototypical Approach for Unsupervised Sound Classification
In the context of environmental sound classification, the adaptability of
systems is key: which sound classes are interesting depends on the context and
the user's needs. Recent advances in text-to-audio retrieval allow for
zero-shot audio classification, but performance compared to supervised models
remains limited. This work proposes a multimodal prototypical approach that
exploits local audio-text embeddings to provide more relevant answers to audio
queries, augmenting the adaptability of sound detection in the wild. We do this
by first using text to query a nearby community of audio embeddings that best
characterize each query sound, and select the group's centroids as our
prototypes. Second, we compare unseen audio to these prototypes for
classification. We perform multiple ablation studies to understand the impact
of the embedding models and prompts. Our unsupervised approach improves upon
the zero-shot state-of-the-art in three sound recognition benchmarks by an
average of 12%.Comment: Accepted to INTERSPEECH 202
Perceptual Grouping in Contrastive Vision-Language Models
Recent advances in zero-shot image recognition suggest that vision-language
models learn generic visual representations with a high degree of semantic
information that may be arbitrarily probed with natural language phrases.
Understanding an image, however, is not just about understanding what content
resides within an image, but importantly, where that content resides. In this
work we examine how well vision-language models are able to understand where
objects reside within an image and group together visually related parts of the
imagery. We demonstrate how contemporary vision and language representation
learning models based on contrastive losses and large web-based data capture
limited object localization information. We propose a minimal set of
modifications that results in models that uniquely learn both semantic and
spatial information. We measure this performance in terms of zero-shot image
recognition, unsupervised bottom-up and top-down semantic segmentations, as
well as robustness analyses. We find that the resulting model achieves
state-of-the-art results in terms of unsupervised segmentation, and demonstrate
that the learned representations are uniquely robust to spurious correlations
in datasets designed to probe the causal behavior of vision models.Comment: Accepted and presented at ICCV 202
NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
Recognizing entities in texts is a central need in many information-seeking
scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the
most successful examples of a widely adopted NLP task and corresponding NLP
technology. Recent advances in large language models (LLMs) appear to provide
effective solutions (also) for NER tasks that were traditionally handled with
dedicated models, often matching or surpassing the abilities of the dedicated
models. Should NER be considered a solved problem? We argue to the contrary:
the capabilities provided by LLMs are not the end of NER research, but rather
an exciting beginning. They allow taking NER to the next level, tackling
increasingly more useful, and increasingly more challenging, variants. We
present three variants of the NER task, together with a dataset to support
them. The first is a move towards more fine-grained -- and intersectional --
entity types. The second is a move towards zero-shot recognition and extraction
of these fine-grained types based on entity-type labels. The third, and most
challenging, is the move from the recognition setup to a novel retrieval setup,
where the query is a zero-shot entity type, and the expected result is all the
sentences from a large, pre-indexed corpus that contain entities of these
types, and their corresponding spans. We show that all of these are far from
being solved. We provide a large, silver-annotated corpus of 4 million
paragraphs covering 500 entity types, to facilitate research towards all of
these three goals.Comment: Findings of EMNLP 202
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
Semantic Autoencoder for Zero-Shot Learning
Existing zero-shot learning (ZSL) models typically learn a projection
function from a feature space to a semantic embedding space (e.g.~attribute
space). However, such a projection function is only concerned with predicting
the training seen class semantic representation (e.g.~attribute prediction) or
classification. When applied to test data, which in the context of ZSL contains
different (unseen) classes without training data, a ZSL model typically suffers
from the project domain shift problem. In this work, we present a novel
solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the
encoder-decoder paradigm, an encoder aims to project a visual feature vector
into the semantic space as in the existing ZSL models. However, the decoder
exerts an additional constraint, that is, the projection/code must be able to
reconstruct the original visual feature. We show that with this additional
reconstruction constraint, the learned projection function from the seen
classes is able to generalise better to the new unseen classes. Importantly,
the encoder and decoder are linear and symmetric which enable us to develop an
extremely efficient learning algorithm. Extensive experiments on six benchmark
datasets demonstrate that the proposed SAE outperforms significantly the
existing ZSL models with the additional benefit of lower computational cost.
Furthermore, when the SAE is applied to supervised clustering problem, it also
beats the state-of-the-art.Comment: accepted to CVPR201
- …