Search CORE

2 research outputs found

Video Question Answering on Screencast Tutorials

Author: Jin Hailin
Kim Seokhwan
Xu Ning
Zhao Wentian
Publication venue: 'International Joint Conferences on Artificial Intelligence'
Publication date: 02/08/2020
Field of study

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge

arXiv.org e-Print Archive

Crossref

Spatio-temporal relational reasoning for video question answering

Author: Singh Gursimran
Publication venue: University of British Columbia Press
Publication date: 01/11/2019
Field of study

Video question answering is the task of automatically answering questions about videos. Apart from direct practical interest, it provides a good way to benchmark our progress on various tasks in video understanding. A successful algorithm must ground objects of interest and model relationships among them in both the spatial and temporal domains jointly. We show that the existing state-of-the-art approaches, which are based on Convolutional Neural Networks or Recurrent Neural Networks, are not effective at joint reasoning in both spatial and temporal domains. Moreover, they are short-sighted and struggle with long-range dependencies in videos. To address these challenges, we present a novel spatio-temporal reasoning neural module that models complex multi-entity relationships in space and long-term dependencies in time. Our model captures both time-changing object interactions and action dynamics of individual objects in an effective way. We evaluate our module on two benchmark datasets which require spatio-temporal reasoning: TGIF-QA and SVQA. We achieve state-of-the-art performance on both datasets. More significantly, we achieve substantial improvements in some of the most challenging question types, like counting, which demonstrate the effectiveness of our proposed spatio-temporal relational module.Science, Faculty ofComputer Science, Department ofGraduat

University of British Columbia: cIRcle - UBC's Information Repository