10 research outputs found
Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video
Temporally language grounding in untrimmed videos is a newly-raised task in
video understanding. Most of the existing methods suffer from inferior
efficiency, lacking interpretability, and deviating from the human perception
mechanism. Inspired by human's coarse-to-fine decision-making paradigm, we
formulate a novel Tree-Structured Policy based Progressive Reinforcement
Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by
an iterative refinement process. The semantic concepts are explicitly
represented as the branches in the policy, which contributes to efficiently
decomposing complex policies into an interpretable primitive action.
Progressive reinforcement learning provides correct credit assignment via two
task-oriented rewards that encourage mutual promotion within the
tree-structured policy. We extensively evaluate TSP-PRL on the Charades-STA and
ActivityNet datasets, and experimental results show that TSP-PRL achieves
competitive performance over existing state-of-the-art methods.Comment: To appear in AAAI202
Location-aware Graph Convolutional Networks for Video Question Answering
We addressed the challenging task of video question answering, which requires
machines to answer questions about videos in a natural language form. Previous
state-of-the-art methods attempt to apply spatio-temporal attention mechanism
on video frame features without explicitly modeling the location and relations
among object interaction occurred in videos. However, the relations between
object interaction and their location information are very critical for both
action recognition and question reasoning. In this work, we propose to
represent the contents in the video as a location-aware graph by incorporating
the location information of an object into the graph construction. Here, each
node is associated with an object represented by its appearance and location
features. Based on the constructed graph, we propose to use graph convolution
to infer both the category and temporal locations of an action. As the graph is
built on objects, our method is able to focus on the foreground action contents
for better video question answering. Lastly, we leverage an attention mechanism
to combine the output of graph convolution and encoded question features for
final answer reasoning. Extensive experiments demonstrate the effectiveness of
the proposed methods. Specifically, our method significantly outperforms
state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
Code and pre-trained models are publicly available at:
https://github.com/SunDoge/L-GC