34 research outputs found
Retrieving, annotating and recognizing human activities in web videos
Recent e orts in computer vision tackle the problem of human activity understanding in video sequences. Traditionally, these algorithms require annotated video data to learn models. In this work, we introduce a novel data collection framework, to take advantage of the large amount of video data available on the web. We use this new framework to retrieve videos of human activities, and build training and evaluation datasets for computer vision algorithms. We rely on Amazon Mechanical Turk workers to obtain high accuracy annotations. An agglomerative clustering technique brings the possibility to achieve reliable and consistent annotations for temporal localization of human activities in videos. Using two datasets, Olympics Sports and our novel Daily Human Activities dataset, we show that our collection/annotation framework can make robust annotations of human activities in large amount of video data
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
State-of-the-art temporal action detectors inefficiently search the entire
video for specific actions. Despite the encouraging progress these methods
achieve, it is crucial to design automated approaches that only explore parts
of the video which are the most relevant to the actions being searched for. To
address this need, we propose the new problem of action spotting in video,
which we define as finding a specific action in a video while observing a small
portion of that video. Inspired by the observation that humans are extremely
efficient and accurate in spotting and finding action instances in video, we
propose Action Search, a novel Recurrent Neural Network approach that mimics
the way humans spot actions. Moreover, to address the absence of data recording
the behavior of human annotators, we put forward the Human Searches dataset,
which compiles the search sequences employed by human annotators spotting
actions in the AVA and THUMOS14 datasets. We consider temporal action
localization as an application of the action spotting problem. Experiments on
the THUMOS14 dataset reveal that our model is not only able to explore the
video efficiently (observing on average 17.3% of the video) but it also
accurately finds human activities with 30.8% mAP.Comment: Accepted to ECCV 201
Transcript to Video: Efficient Clip Sequencing from Texts
Among numerous videos shared on the web, well-edited ones always attract more
attention. However, it is difficult for inexperienced users to make well-edited
videos because it requires professional expertise and immense manual labor. To
meet the demands for non-experts, we present Transcript-to-Video -- a
weakly-supervised framework that uses texts as input to automatically create
video sequences from an extensive collection of shots. Specifically, we propose
a Content Retrieval Module and a Temporal Coherent Module to learn
visual-language representations and model shot sequencing styles, respectively.
For fast inference, we introduce an efficient search strategy for real-time
video clip sequencing. Quantitative results and user studies demonstrate
empirically that the proposed learning framework can retrieve content-relevant
shots while creating plausible video sequences in terms of style. Besides, the
run-time performance analysis shows that our framework can support real-world
applications.Comment: Tech Report; Demo and project page at
http://www.xiongyu.me/projects/transcript2video
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Large-scale vision-language models (VLM) have shown impressive results for
language-guided search applications. While these models allow category-level
queries, they currently struggle with personalized searches for moments in a
video where a specific object instance such as ``My dog Biscuit'' appears. We
present the following three contributions to address this problem. First, we
describe a method to meta-personalize a pre-trained VLM, i.e., learning how to
learn to personalize a VLM at test time to search in video. Our method extends
the VLM's token vocabulary by learning novel word embeddings specific to each
instance. To capture only instance-specific features, we represent each
instance embedding as a combination of shared and learned global category
features. Second, we propose to learn such personalization without explicit
human supervision. Our approach automatically identifies moments of named
visual instances in video using transcripts and vision-language similarity in
the VLM's embedding space. Finally, we introduce This-Is-My, a personal video
instance retrieval benchmark. We evaluate our approach on This-Is-My and
DeepFashion2 and show that we obtain a 15% relative improvement over the state
of the art on the latter dataset.Comment: Accepted to CVPR 2023. Project webpage:
https://danielchyeh.github.io/metaper
Localizing Moments in Long Video Via Multimodal Guidance
The recent introduction of the large-scale, long-form MAD and Ego4D datasets
has enabled researchers to investigate the performance of current
state-of-the-art methods for video grounding in the long-form setup, with
interesting findings: current grounding methods alone fail at tackling this
challenging task and setup due to their inability to process long video
sequences. In this paper, we propose a method for improving the performance of
natural language grounding in long videos by identifying and pruning out
non-describable windows. We design a guided grounding framework consisting of a
Guidance Model and a base grounding model. The Guidance Model emphasizes
describable windows, while the base grounding model analyzes short temporal
windows to determine which segments accurately match a given language query. We
offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent,
which balance efficiency and accuracy. Experiments demonstrate that our
proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in
Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to
reproduce our experiments are available at:
https://github.com/waybarrios/guidance-based-video-grounding