17 research outputs found
Location-aware Graph Convolutional Networks for Video Question Answering
We addressed the challenging task of video question answering, which requires
machines to answer questions about videos in a natural language form. Previous
state-of-the-art methods attempt to apply spatio-temporal attention mechanism
on video frame features without explicitly modeling the location and relations
among object interaction occurred in videos. However, the relations between
object interaction and their location information are very critical for both
action recognition and question reasoning. In this work, we propose to
represent the contents in the video as a location-aware graph by incorporating
the location information of an object into the graph construction. Here, each
node is associated with an object represented by its appearance and location
features. Based on the constructed graph, we propose to use graph convolution
to infer both the category and temporal locations of an action. As the graph is
built on objects, our method is able to focus on the foreground action contents
for better video question answering. Lastly, we leverage an attention mechanism
to combine the output of graph convolution and encoded question features for
final answer reasoning. Extensive experiments demonstrate the effectiveness of
the proposed methods. Specifically, our method significantly outperforms
state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
Code and pre-trained models are publicly available at:
https://github.com/SunDoge/L-GC
Generating Visually Aligned Sound from Videos
We focus on the task of generating sound from natural videos, and the sound
should be both temporally and content-wise aligned with visual signals. This
task is extremely challenging because some sounds generated \emph{outside} a
camera can not be inferred from video content. The model may be forced to learn
an incorrect mapping between visual content and these irrelevant sounds. To
address this challenge, we propose a framework named REGNET. In this framework,
we first extract appearance and motion features from video frames to better
distinguish the object that emits sound from complex background information. We
then introduce an innovative audio forwarding regularizer that directly
considers the real sound as input and outputs bottlenecked sound features.
Using both visual and bottlenecked sound features for sound prediction during
training provides stronger supervision for the sound prediction. The audio
forwarding regularizer can control the irrelevant sound component and thus
prevent the model from learning an incorrect mapping between video frames and
sound emitted by the object that is out of the screen. During testing, the
audio forwarding regularizer is removed to ensure that REGNET can produce
purely aligned sound only from visual features. Extensive evaluations based on
Amazon Mechanical Turk demonstrate that our method significantly improves both
temporal and content-wise alignment. Remarkably, our generated sound can fool
the human with a 68.12% success rate. Code and pre-trained models are publicly
available at https://github.com/PeihaoChen/regnetComment: Published in IEEE Transactions on Image Processing, 2020. Code,
pre-trained models and demo video: https://github.com/PeihaoChen/regne