1 research outputs found
Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
Open-ended video question answering aims to automatically generate the
natural-language answer from referenced video contents according to the given
question. Currently, most existing approaches focus on short-form video
question answering with multi-modal recurrent encoder-decoder networks.
Although these works have achieved promising performance, they may still be
ineffectively applied to long-form video question answering due to the lack of
long-range dependency modeling and the suffering from the heavy computational
cost. To tackle these problems, we propose a fast Hierarchical Convolutional
Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a
hierarchical convolutional self-attention encoder to efficiently model
long-form video contents, which builds the hierarchical structure for video
sequences and captures question-aware long-range dependencies from video
context. We then devise a multi-scale attentive decoder to incorporate
multi-layer video representations for answer generation, which avoids the
information missing of the top encoder layer. The extensive experiments show
the effectiveness and efficiency of our method.Comment: Accepted by IJCAI 2019 as a poster pape