14 research outputs found
Self-supervised learning of a facial attribute embedding from video
We propose a self-supervised framework for learning facial attributes by
simply watching videos of a human face speaking, laughing, and moving over
time. To perform this task, we introduce a network, Facial Attributes-Net
(FAb-Net), that is trained to embed multiple frames from the same video
face-track into a common low-dimensional space. With this approach, we make
three contributions: first, we show that the network can leverage information
from multiple source frames by predicting confidence/attention masks for each
frame; second, we demonstrate that using a curriculum learning regime improves
the learned embedding; finally, we demonstrate that the network learns a
meaningful face embedding that encodes information about head pose, facial
landmarks and facial expression, i.e. facial attributes, without having been
supervised with any labelled data. We are comparable or superior to
state-of-the-art self-supervised methods on these tasks and approach the
performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at
http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Zero-shot audio captioning aims at automatically generating descriptive
textual captions for audio content without prior training for this task.
Different from speech recognition which translates audio content that contains
spoken language into text, audio captioning is commonly concerned with ambient
sounds, or sounds produced by a human performing an action. Inspired by
zero-shot image captioning methods, we propose ZerAuCap, a novel framework for
summarising such general audio signals in a text caption without requiring
task-specific training. In particular, our framework exploits a pre-trained
large language model (LLM) for generating the text which is guided by a
pre-trained audio-language model to produce captions that describe the audio
content. Additionally, we use audio context keywords that prompt the language
model to generate text that is broadly relevant to sounds. Our proposed
framework achieves state-of-the-art results in zero-shot audio captioning on
the AudioCaps and Clotho datasets. Our code is available at
https://github.com/ExplainableML/ZerAuCap.Comment: NeurIPS 2023 - Machine Learning for Audio Workshop (Oral
Video-adverb retrieval with compositional adverb-action embeddings
Retrieving adverbs that describe an action in a video poses a crucial step
towards fine-grained video understanding. We propose a framework for
video-to-adverb retrieval (and vice versa) that aligns video embeddings with
their matching compositional adverb-action text embedding in a joint embedding
space. The compositional adverb-action text embedding is learned using a
residual gating mechanism, along with a novel training objective consisting of
triplet losses and a regression target. Our method achieves state-of-the-art
performance on five recent benchmarks for video-adverb retrieval. Furthermore,
we introduce dataset splits to benchmark video-adverb retrieval for unseen
adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet
Adverbs datasets. Our proposed framework outperforms all prior works for the
generalisation task of retrieving adverbs from videos for unseen adverb-action
compositions. Code and dataset splits are available at
https://hummelth.github.io/ReGaDa/.Comment: BMVC 2023 (Oral
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Providing explanations in the context of Visual Question Answering (VQA)
presents a fundamental problem in machine learning. To obtain detailed insights
into the process of generating natural language explanations for VQA, we
introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with
natural language explanations. For each image-question pair in the CLEVR
dataset, CLEVR-X contains multiple structured textual explanations which are
derived from the original scene graphs. By construction, the CLEVR-X
explanations are correct and describe the reasoning and visual information that
is necessary to answer a given question. We conducted a user study to confirm
that the ground-truth explanations in our proposed dataset are indeed complete
and relevant. We present baseline results for generating natural language
explanations in the context of VQA using two state-of-the-art frameworks on the
CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation
generation quality for different question and answer types. Additionally, we
study the influence of using different numbers of ground-truth explanations on
the convergence of natural language generation (NLG) metrics. The CLEVR-X
dataset is publicly available at
\url{https://explainableml.github.io/CLEVR-X/}
Addressing caveats of neural persistence with deep graph persistence
Neural Persistence is a prominent measure for quantifying neural network
complexity, proposed in the emerging field of topological data analysis in deep
learning. In this work, however, we find both theoretically and empirically
that the variance of network weights and spatial concentration of large weights
are the main factors that impact neural persistence. Whilst this captures
useful information for linear classifiers, we find that no relevant spatial
structure is present in later layers of deep neural networks, making neural
persistence roughly equivalent to the variance of weights. Additionally, the
proposed averaging procedure across layers for deep neural networks does not
consider interaction between layers. Based on our analysis, we propose an
extension of the filtration underlying neural persistence to the whole neural
network instead of single layers, which is equivalent to calculating neural
persistence on one particular matrix. This yields our deep graph persistence
measure, which implicitly incorporates persistent paths through the network and
alleviates variance-related issues through standardisation. Code is available
at https://github.com/ExplainableML/Deep-Graph-Persistence
Image-free Classifier Injection for Zero-Shot Classification
Zero-shot learning models achieve remarkable results on image classification
for samples from classes that were not seen during training. However, such
models must be trained from scratch with specialised methods: therefore, access
to a training dataset is required when the need for zero-shot classification
arises. In this paper, we aim to equip pre-trained models with zero-shot
classification capabilities without the use of image data. We achieve this with
our proposed Image-free Classifier Injection with Semantics (ICIS) that injects
classifiers for new, unseen classes into pre-trained classification models in a
post-hoc fashion without relying on image data. Instead, the existing
classifier weights and simple class-wise descriptors, such as class names or
attributes, are used. ICIS has two encoder-decoder networks that learn to
reconstruct classifier weights from descriptors (and vice versa), exploiting
(cross-)reconstruction and cosine losses to regularise the decoding process.
Notably, ICIS can be cheaply trained and applied directly on top of pre-trained
classification models. Experiments on benchmark ZSL datasets show that ICIS
produces unseen classifier weights that achieve strong (generalised) zero-shot
classification performance. Code is available at
https://github.com/ExplainableML/ImageFreeZSL .Comment: Accepted at ICCV 202
A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
The visual classification performance of vision-language models such as CLIP
has been shown to benefit from additional semantic knowledge from large
language models (LLMs) such as GPT-3. In particular, averaging over
LLM-generated class descriptors, e.g. "waffle, which has a round shape", can
notably improve generalization performance. In this work, we critically study
this behavior and propose WaffleCLIP, a framework for zero-shot visual
classification which simply replaces LLM-generated descriptors with random
character and word descriptors. Without querying external models, we achieve
comparable performance gains on a large number of visual classification tasks.
This allows WaffleCLIP to both serve as a low-cost alternative, as well as a
sanity check for any future LLM-based vision-language model extensions. We
conduct an extensive experimental study on the impact and shortcomings of
additional semantics introduced with LLM-generated descriptors, and showcase
how - if available - semantic context is better leveraged by querying LLMs for
high-level concepts, which we show can be done to jointly resolve potential
class name ambiguities. Code is available here:
https://github.com/ExplainableML/WaffleCLIP.Comment: Accepted to ICCV 2023. Main paper with 9 page