Search CORE

2,179 research outputs found

Temporal Sentence Grounding in Videos: A Survey and Future Directions

Author: Jing Wei
Sun Aixin
Zhang Hao
Zhou Joey Tianyi
Publication venue
Publication date: 28/10/2022
Field of study

Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table

arXiv.org e-Print Archive

DR-NTU (Digital Repository of NTU)

Boundary Proposal Network for Two-Stage Natural Language Video Localization

Author: Chen Long
Ji Wei
Shao Jian
Xiao Jun
Xiao Shaoning
Ye Lu
Zhang Songyang
Publication venue
Publication date: 18/05/2021
Field of study

We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

Towards Segment-level Video Understanding: Detecting Activities from Untrimmed Videos

Author: Zhang Da
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

We generate massive amounts of video data every day. While most real-world videos are long and untrimmed with sparsely localized segments of interest, existing AI systems that can interpret videos today often rely on static image analysis or can only process temporal information in a short video snippet. To automatically understand the content of long video streams, this thesis mainly describes the efforts to design accurate, efficient, and intelligent deep learning algorithms for temporal activity detection in untrimmed videos. Detecting segments of interest from untrimmed videos is a key step towards segment-level video understanding. Depending on the purposes of tasks being performed, we address three different activity detection tasks: detecting activities of interest from videos without specific purposes (i.e., temporal activity detection); detecting temporal segment that best corresponds to a language query (i.e., natural language moment retrieval); and detecting activities given less supervision (i.e., weakly-supervised or few-shot activity detection).In temporal activity detection, We first propose a highly unified single-shot temporal activity detector based on fully 3D convolutional networks, by eliminating explicit temporal proposal and classification stages. Evaluations show that it achieves state-of-the-art on temporal activity detection while being super efficient to operate at 1271 FPS. We then investigate how to effectively apply a multi-scale architecture to model activities with various temporal length and frequency. We propose three novel architecture designs: (1) dynamic temporal sampling; (2) two-branch feature hierarchy; (3) multi-scale contextual feature fusion, and we combine all these components into a uniform network and achieve the state-of-the-art on a much larger temporal activity detection benchmark.In natural language moment retrieval, we aim to localize the segment that best corresponds to a given language query. We present a language-guided temporal attention module and an iterative graph adjustment network to handle the semantic and structural misalignment between video and language. The proposed model demonstrates superior capability to handle temporal relations, thus, significantly improves the state-of-the-art by a large margin.Finally, we study the problem of weakly-supervised and few-shot temporal activity detection to mitigate the drawbacks of huge amounts of supervision needed to train a temporal detection model. Namely, we answer the question if we can learn a temporal activity detector under weak supervision that is able to localize unseen activity classes. A novel meta-learning based detection method is accordingly proposed by adopting the few-shot learning technique of Relation Network. Results show that our method achieves performance superior or competitive to state-of-the-art approaches with stronger supervision.In summary, we propose a suite of algorithms and solutions to automatically detect segments of interest in long untrimmed videos. We hope our studies could provide insights for researchers to explore new deep learning paradigms for future computer vision research, especially on video-related topics

eScholarship - University of California