329 research outputs found
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Most of existing video-language pre-training methods focus on instance-level
alignment between video clips and captions via global contrastive learning but
neglect rich fine-grained local information, which is of importance to
downstream tasks requiring temporal localization and semantic reasoning. In
this work, we propose a simple yet effective video-language pre-training
framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two
novel designs involving spatiotemporal grounding and temporal grouping promote
learning local region-noun alignment and temporal-aware features
simultaneously. Specifically, spatiotemporal grounding aggregates semantically
similar video tokens and aligns them with noun phrases extracted from the
caption to promote local region-noun correspondences. Moreover, temporal
grouping leverages cut-and-paste to manually create temporal scene changes and
then learns distinguishable features from different scenes. Comprehensive
evaluations demonstrate that G-ViLM performs favorably against existing
approaches on four representative downstream tasks, covering text-video
retrieval, video question answering, video action recognition and temporal
action localization. G-ViLM performs competitively on all evaluated tasks and
in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9%
higher than the state-of-the-art method
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
- …