28 research outputs found
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a
sentence describing an activity, our task is to retrieve matching clips from an
untrimmed video. To capture the inherent structures present in both text and
video, we introduce a multilevel model that integrates vision and language
features earlier and more tightly than prior work. First, we inject text
features early on when generating clip proposals, to help eliminate unlikely
clips and thus speed up processing and boost performance. Second, to learn a
fine-grained similarity metric for retrieval, we use visual features to
modulate the processing of query sentences at the word level in a recurrent
neural network. A multi-task loss is also employed by adding query
re-generation as an auxiliary task. Our approach significantly outperforms
prior work on two challenging benchmarks: Charades-STA and ActivityNet
Captions.Comment: AAAI 201
Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
Visual-semantic embedding aims to learn a joint embedding space where related
video and sentence instances are located close to each other. Most existing
methods put instances in a single embedding space. However, they struggle to
embed instances due to the difficulty of matching visual dynamics in videos to
textual features in sentences. A single space is not enough to accommodate
various videos and sentences. In this paper, we propose a novel framework that
maps instances into multiple individual embedding spaces so that we can capture
multiple relationships between instances, leading to compelling video
retrieval. We propose to produce a final similarity between instances by fusing
similarities measured in each embedding space using a weighted sum strategy. We
determine the weights according to a sentence. Therefore, we can flexibly
emphasize an embedding space. We conducted sentence-to-video retrieval
experiments on a benchmark dataset. The proposed method achieved superior
performance, and the results are competitive to state-of-the-art methods. These
experimental results demonstrated the effectiveness of the proposed multiple
embedding approach compared to existing methods.Comment: 8 pages, 5 figure
Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video
Temporally language grounding in untrimmed videos is a newly-raised task in
video understanding. Most of the existing methods suffer from inferior
efficiency, lacking interpretability, and deviating from the human perception
mechanism. Inspired by human's coarse-to-fine decision-making paradigm, we
formulate a novel Tree-Structured Policy based Progressive Reinforcement
Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by
an iterative refinement process. The semantic concepts are explicitly
represented as the branches in the policy, which contributes to efficiently
decomposing complex policies into an interpretable primitive action.
Progressive reinforcement learning provides correct credit assignment via two
task-oriented rewards that encourage mutual promotion within the
tree-structured policy. We extensively evaluate TSP-PRL on the Charades-STA and
ActivityNet datasets, and experimental results show that TSP-PRL achieves
competitive performance over existing state-of-the-art methods.Comment: To appear in AAAI202