104 research outputs found
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Computer vision has a great potential to help our daily lives by searching
for lost keys, watering flowers or reminding us to take a pill. To succeed with
such tasks, computer vision methods need to be trained from real and diverse
examples of our daily dynamic scenes. While most of such scenes are not
particularly exciting, they typically do not appear on YouTube, in movies or TV
broadcasts. So how do we collect sufficiently many diverse but boring samples
representing our lives? We propose a novel Hollywood in Homes approach to
collect such data. Instead of shooting videos in the lab, we ensure diversity
by distributing and crowdsourcing the whole process of video creation from
script writing to video recording and annotation. Following this procedure we
collect a new dataset, Charades, with hundreds of people recording videos in
their own homes, acting out casual everyday activities. The dataset is composed
of 9,848 annotated videos with an average length of 30 seconds, showing
activities of 267 people from three continents. Each video is annotated by
multiple free-text descriptions, action labels, action intervals and classes of
interacted objects. In total, Charades provides 27,847 video descriptions,
66,500 temporally localized intervals for 157 action classes and 41,104 labels
for 46 object classes. Using this rich data, we evaluate and provide baseline
results for several tasks including action recognition and automatic
description generation. We believe that the realism, diversity, and casual
nature of this dataset will present unique challenges and new opportunities for
computer vision community
Learning text-to-video retrieval from image captioning
We describe a protocol to study text-to-video retrieval training with
unlabeled videos, where we assume (i) no access to labels for any videos, i.e.,
no access to the set of ground-truth captions, but (ii) access to labeled
images in the form of text. Using image expert models is a realistic scenario
given that annotating images is cheaper therefore scalable, in contrast to
expensive video labeling schemes. Recently, zero-shot image experts such as
CLIP have established a new strong baseline for video understanding tasks. In
this paper, we make use of this progress and instantiate the image experts from
two types of models: a text-to-image retrieval model to provide an initial
backbone, and image captioning models to provide supervision signal into
unlabeled videos. We show that automatically labeling video frames with image
captioning allows text-to-video retrieval training. This process adapts the
features to the target domain at no manual annotation cost, consequently
outperforming the strong zero-shot CLIP baseline. During training, we sample
captions from multiple video frames that best match the visual content, and
perform a temporal pooling over frame representations by scoring frames
according to their relevance to each caption. We conduct extensive ablations to
provide insights and demonstrate the effectiveness of this simple framework by
outperforming the CLIP zero-shot baselines on text-to-video retrieval on three
standard datasets, namely ActivityNet, MSR-VTT, and MSVD.Comment: A short version of this work appeared at CVPR 2023 Workshops. Project
page: https://imagine.enpc.fr/~ventural/multicaps
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
In this paper, we present TMR, a simple yet effective approach for text to 3D
human motion retrieval. While previous work has only treated retrieval as a
proxy evaluation metric, we tackle it as a standalone task. Our method extends
the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a
contrastive loss to better structure the cross-modal latent space. We show that
maintaining the motion generation loss, along with the contrastive training, is
crucial to obtain good performance. We introduce a benchmark for evaluation and
provide an in-depth analysis by reporting results on several protocols. Our
extensive experiments on the KIT-ML and HumanML3D datasets show that TMR
outperforms the prior work by a significant margin, for example reducing the
median rank from 54 to 19. Finally, we showcase the potential of our approach
on moment retrieval. Our code and models are publicly available.Comment: arXiv preprint, project page: https://mathis.petrovich.fr/tmr
A CLIP-Hitchhiker's guide to long video retrieval
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks
SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
Our goal is to synthesize 3D human motions given textual inputs describing
simultaneous actions, for example 'waving hand' while 'walking' at the same
time. We refer to generating such simultaneous movements as performing 'spatial
compositions'. In contrast to temporal compositions that seek to transition
from one action to another, spatial compositing requires understanding which
body parts are involved in which action, to be able to move them
simultaneously. Motivated by the observation that the correspondence between
actions and body parts is encoded in powerful language models, we extract this
knowledge by prompting GPT-3 with text such as "what are the body parts
involved in the action ?", while also providing the parts list and
few-shot examples. Given this action-part mapping, we combine body parts from
two motions together and establish the first automated method to spatially
compose two actions. However, training data with compositional actions is
always limited by the combinatorics. Hence, we further create synthetic data
with this approach, and use it to train a new state-of-the-art text-to-motion
generation model, called SINC ("SImultaneous actioN Compositions for 3D human
motions"). In our experiments, that training with such GPT-guided synthetic
data improves spatial composition generation over baselines. Our code is
publicly available at https://sinc.is.tue.mpg.de/.Comment: ICCV 2023 Camera Read
Lost in translation, found in context: sign language translation with contextual cues
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL – the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results
Sign language video retrieval with free-form textual queries
Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.This work was supported by the project PID2020-117142GB-I00, funded by MCIN/ AEI /10.13039/501100011033, ANR project CorVis ANR-21-CE23-0003- 01, and gifts from Google and Adobe. AD received support from la Caixa Foundation (ID 100010434), fellowship code LCF/BQ/IN18/11660029.Peer ReviewedObjectius de Desenvolupament Sostenible::10 - Reducció de les DesigualtatsObjectius de Desenvolupament Sostenible::10 - Reducció de les Desigualtats::10.2 - Per a 2030, potenciar i promoure la inclusió social, econòmica i política de totes les persones, independentment de l’edat, sexe, discapacitat, raça, ètnia, origen, religió, situació econòmica o altra condicióPostprint (author's final draft
Watch, read and lookup: learning to spot signs from multiple supervisors
The focus of this work is sign spotting - given a video of an isolated sign,
our task is to identify whether and where it has been signed in a continuous,
co-articulated sign language video. To achieve this sign spotting task, we
train a model using multiple types of available supervision by: (1) watching
existing sparsely labelled footage; (2) reading associated subtitles (readily
available translations of the signed content) which provide additional
weak-supervision; (3) looking up words (for which no co-articulated labelled
examples are available) in visual sign language dictionaries to enable novel
sign spotting. These three tasks are integrated into a unified learning
framework using the principles of Noise Contrastive Estimation and Multiple
Instance Learning. We validate the effectiveness of our approach on low-shot
sign spotting benchmarks. In addition, we contribute a machine-readable British
Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to
facilitate study of this task. The dataset, models and code are available at
our project page.Comment: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) -
Oral presentation. 29 page
AutoAD III: The Prequel -- Back to the Pixels
Generating Audio Description (AD) for movies is a challenging task that
requires fine-grained visual understanding and an awareness of the characters
and their names. Currently, visual language models for AD generation are
limited by a lack of suitable training data, and also their evaluation is
hampered by using performance measures not specialized to the AD domain. In
this paper, we make three contributions: (i) We propose two approaches for
constructing AD datasets with aligned video data, and build training and
evaluation datasets using these. These datasets will be publicly released; (ii)
We develop a Q-former-based architecture which ingests raw video and generates
AD, using frozen pre-trained visual encoders and large language models; and
(iii) We provide new evaluation metrics to benchmark AD quality that are
well-matched to human performance. Taken together, we improve the state of the
art on AD generation.Comment: CVPR2024. Project page:
https://www.robots.ox.ac.uk/~vgg/research/autoad
AutoAD II: the sequel -- who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison
- …
