32 research outputs found
Spatio-temporal Video Re-localization by Warp LSTM
The need for efficiently finding the video content a user wants is increasing
because of the erupting of user-generated videos on the Web. Existing
keyword-based or content-based video retrieval methods usually determine what
occurs in a video but not when and where. In this paper, we make an answer to
the question of when and where by formulating a new task, namely
spatio-temporal video re-localization. Specifically, given a query video and a
reference video, spatio-temporal video re-localization aims to localize
tubelets in the reference video such that the tubelets semantically correspond
to the query. To accurately localize the desired tubelets in the reference
video, we propose a novel warp LSTM network, which propagates the
spatio-temporal information for a long period and thereby captures the
corresponding long-term dependencies. Another issue for spatio-temporal video
re-localization is the lack of properly labeled video datasets. Therefore, we
reorganize the videos in the AVA dataset to form a new dataset for
spatio-temporal video re-localization research. Extensive experimental results
show that the proposed model achieves superior performances over the designed
baselines on the spatio-temporal video re-localization task
Localizing the Common Action Among a Few Videos
This paper strives to localize the temporal extent of an action in a long
untrimmed video. Where existing work leverages many examples with their start,
their ending, and/or the class of the action during training time, we propose
few-shot common action localization. The start and end of an action in a long
untrimmed video is determined based on just a hand-full of trimmed video
examples containing the same action, without knowing their common class label.
To address this task, we introduce a new 3D convolutional network architecture
able to align representations from the support videos with the relevant query
video segments. The network contains: (\textit{i}) a mutual enhancement module
to simultaneously complement the representation of the few trimmed support
videos and the untrimmed query video; (\textit{ii}) a progressive alignment
module that iteratively fuses the support videos into the query branch; and
(\textit{iii}) a pairwise matching module to weigh the importance of different
support videos. Evaluation of few-shot common action localization in untrimmed
videos containing a single or multiple action instances demonstrates the
effectiveness and general applicability of our proposal.Comment: ECCV 202
Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval
With the explosive growth of web videos in recent years, large-scale
Content-Based Video Retrieval (CBVR) becomes increasingly essential in video
filtering, recommendation, and copyright protection. Segment-level CBVR
(S-CBVR) locates the start and end time of similar segments in finer
granularity, which is beneficial for user browsing efficiency and infringement
detection especially in long video scenarios. The challenge of S-CBVR task is
how to achieve high temporal alignment accuracy with efficient computation and
low storage consumption. In this paper, we propose a Segment Similarity and
Alignment Network (SSAN) in dealing with the challenge which is firstly trained
end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video
retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to
reduce redundant frame features, (2) A robust Similarity Pattern Detection
(SPD) module for temporal alignment. In comparison with uniform frame
extraction, SKE not only saves feature storage and search time, but also
introduces comparable accuracy and limited extra computation time. In terms of
temporal alignment, SPD localizes similar segments with higher accuracy and
efficiency than existing deep learning methods. Furthermore, we jointly train
SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two
key modules SKE and SPD can also be effectively inserted into other video
retrieval pipelines and gain considerable performance improvements.
Experimental results on public datasets show that SSAN can obtain higher
alignment accuracy while saving storage and online query computational cost
compared to existing methods.Comment: Accepted by ACM MM 202
Fine-grained action recognition by motion saliency and mid-level patches
Effective extraction of human body parts and operated objects participating in action is the key issue of fine-grained action recognition. However, most of the existing methods require intensive manual annotation to train the detectors of these interaction components. In this paper, we represent videos by mid-level patches to avoid the manual annotation, where each patch corresponds to an action-related interaction component. In order to capture mid-level patches more exactly and rapidly, candidate motion regions are extracted by motion saliency. Firstly, the motion regions containing interaction components are segmented by a threshold adaptively calculated according to the saliency histogram of the motion saliency map. Secondly, we introduce a mid-level patch mining algorithm for interaction component detection, with object proposal generation and mid-level patch detection. The object proposal generation algorithm is used to obtain multi-granularity object proposals inspired by the idea of the Huffman algorithm. Based on these object proposals, the mid-level patch detectors are trained by K-means clustering and SVM. Finally, we build a fine-grained action recognition model using a graph structure to describe relationships between the mid-level patches. To recognize actions, the proposed model calculates the appearance and motion features of mid-level patches and the binary motion cooperation relationships between adjacent patches in the graph. Extensive experiments on the MPII cooking database demonstrate that the proposed method gains better results on fine-grained action recognition