22 research outputs found
How to Train Your Agent to Read and Write
Reading and writing research papers is one of the most privileged abilities
that a qualified researcher should master. However, it is difficult for new
researchers (\eg{students}) to fully {grasp} this ability. It would be
fascinating if we could train an intelligent agent to help people read and
summarize papers, and perhaps even discover and exploit the potential knowledge
clues to write novel papers. Although there have been existing works focusing
on summarizing (\emph{i.e.}, reading) the knowledge in a given text or
generating (\emph{i.e.}, writing) a text based on the given knowledge, the
ability of simultaneously reading and writing is still under development.
Typically, this requires an agent to fully understand the knowledge from the
given text materials and generate correct and fluent novel paragraphs, which is
very challenging in practice. In this paper, we propose a Deep ReAder-Writer
(DRAW) network, which consists of a \textit{Reader} that can extract knowledge
graphs (KGs) from input paragraphs and discover potential knowledge, a
graph-to-text \textit{Writer} that generates a novel paragraph, and a
\textit{Reviewer} that reviews the generated paragraph from three different
aspects. Extensive experiments show that our DRAW network outperforms
considered baselines and several state-of-the-art methods on AGENDA and
M-AGENDA datasets. Our code and supplementary are released at
https://github.com/menggehe/DRAW
Location-aware Graph Convolutional Networks for Video Question Answering
We addressed the challenging task of video question answering, which requires
machines to answer questions about videos in a natural language form. Previous
state-of-the-art methods attempt to apply spatio-temporal attention mechanism
on video frame features without explicitly modeling the location and relations
among object interaction occurred in videos. However, the relations between
object interaction and their location information are very critical for both
action recognition and question reasoning. In this work, we propose to
represent the contents in the video as a location-aware graph by incorporating
the location information of an object into the graph construction. Here, each
node is associated with an object represented by its appearance and location
features. Based on the constructed graph, we propose to use graph convolution
to infer both the category and temporal locations of an action. As the graph is
built on objects, our method is able to focus on the foreground action contents
for better video question answering. Lastly, we leverage an attention mechanism
to combine the output of graph convolution and encoded question features for
final answer reasoning. Extensive experiments demonstrate the effectiveness of
the proposed methods. Specifically, our method significantly outperforms
state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
Code and pre-trained models are publicly available at:
https://github.com/SunDoge/L-GC