1,276 research outputs found
Spatio-temporal Video Re-localization by Warp LSTM
The need for efficiently finding the video content a user wants is increasing
because of the erupting of user-generated videos on the Web. Existing
keyword-based or content-based video retrieval methods usually determine what
occurs in a video but not when and where. In this paper, we make an answer to
the question of when and where by formulating a new task, namely
spatio-temporal video re-localization. Specifically, given a query video and a
reference video, spatio-temporal video re-localization aims to localize
tubelets in the reference video such that the tubelets semantically correspond
to the query. To accurately localize the desired tubelets in the reference
video, we propose a novel warp LSTM network, which propagates the
spatio-temporal information for a long period and thereby captures the
corresponding long-term dependencies. Another issue for spatio-temporal video
re-localization is the lack of properly labeled video datasets. Therefore, we
reorganize the videos in the AVA dataset to form a new dataset for
spatio-temporal video re-localization research. Extensive experimental results
show that the proposed model achieves superior performances over the designed
baselines on the spatio-temporal video re-localization task
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
- …