37 research outputs found
Deep Multimodal Feature Encoding for Video Ordering
True understanding of videos comes from a joint analysis of all its
modalities: the video frames, the audio track, and any accompanying text such
as closed captions. We present a way to learn a compact multimodal feature
representation that encodes all these modalities. Our model parameters are
learned through a proxy task of inferring the temporal ordering of a set of
unordered videos in a timeline. To this end, we create a new multimodal dataset
for temporal ordering that consists of approximately 30K scenes (2-6 clips per
scene) based on the "Large Scale Movie Description Challenge". We analyze and
evaluate the individual and joint modalities on three challenging tasks: (i)
inferring the temporal ordering of a set of videos; and (ii) action
recognition. We demonstrate empirically that multimodal representations are
indeed complementary, and can play a key role in improving the performance of
many applications.Comment: IEEE International Conference on Computer Vision (ICCV) Workshop on
Large Scale Holistic Video Understanding. The datasets and code are available
at https://github.com/vivoutlaw/tcb
Test of Time: Instilling Video-Language Models with a Sense of Time
Modelling and understanding time remains a challenge in contemporary video
understanding models. With language emerging as a key driver towards powerful
generalization, it is imperative for foundational video-language models to have
a sense of time. In this paper, we consider a specific aspect of temporal
understanding: consistency of time order as elicited by before/after relations.
We establish that seven existing video-language models struggle to understand
even such simple temporal relations. We then question whether it is feasible to
equip these foundational models with temporal awareness without re-training
them from scratch. Towards this, we propose a temporal adaptation recipe on top
of one such model, VideoCLIP, based on post-pretraining on a small amount of
video-text data. We conduct a zero-shot evaluation of the adapted models on six
datasets for three downstream tasks which require varying degrees of time
awareness. We observe encouraging performance gains especially when the task
needs higher time awareness. Our work serves as a first step towards probing
and instilling a sense of time in existing video-language models without the
need for data and compute-intense training from scratch.Comment: Accepted for publication at CVPR 2023. Project page:
https://bpiyush.github.io/testoftime-website/index.htm
Eye vs. AI: Human Gaze and Model Attention in Video Memorability
Understanding the factors that determine video memorability has important
applications in areas such as educational technology and advertising. Towards
this goal, we investigate the semantic and temporal attention mechanisms
underlying video memorability. We propose a Transformer-based model with
spatio-temporal attention that matches SoTA performance on video memorability
prediction on a large naturalistic video dataset. More importantly, the
self-attention patterns show us where the model looks to predict memorability.
We compare model attention against human gaze fixation density maps collected
through a small-scale eye-tracking experiment where humans perform a video
memory task. Quantitative saliency metrics show that the model attention and
human gaze follow similar patterns. Furthermore, while panoptic segmentation
confirms that the model and humans attend more to thing classes, stuff classes
that receive increased/decreased attention tend to have higher memorability
scores. We also observe that the model assigns greater importance to the
initial frames, mimicking temporal attention patterns found in humans
Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays
Real-world application of chest X-ray abnormality classification requires
dealing with several challenges: (i) limited training data; (ii) training and
evaluation sets that are derived from different domains; and (iii) classes that
appear during training may have partial overlap with classes of interest during
evaluation. To address these challenges, we present an integrated framework
called Generalized Cross-Domain Multi-Label Few-Shot Learning (GenCDML-FSL).
The framework supports overlap in classes during training and evaluation,
cross-domain transfer, adopts meta-learning to learn using few training
samples, and assumes each chest X-ray image is either normal or associated with
one or more abnormalities. Furthermore, we propose Generalized Episodic
Training (GenET), a training strategy that equips models to operate with
multiple challenges observed in the GenCDML-FSL scenario. Comparisons with
well-established methods such as transfer learning, hybrid transfer learning,
and multi-label meta-learning on multiple datasets show the superiority of our
approach.Comment: 17 page