2 research outputs found
The Forgettable-Watcher Model for Video Question Answering
A number of visual question answering approaches have been proposed recently,
aiming at understanding the visual scenes by answering the natural language
questions. While the image question answering has drawn significant attention,
video question answering is largely unexplored.
Video-QA is different from Image-QA since the information and the events are
scattered among multiple frames. In order to better utilize the temporal
structure of the videos and the phrasal structures of the answers, we propose
two mechanisms: the re-watching and the re-reading mechanisms and combine them
into the forgettable-watcher model. Then we propose a TGIF-QA dataset for video
question answering with the help of automatic question generation. Finally, we
evaluate the models on our dataset. The experimental results show the
effectiveness of our proposed models
Data augmentation by morphological mixup for solving Raven's Progressive Matrices
Raven's Progressive Matrices (RPMs) are frequently used in testing human's
visual reasoning ability. Recent advances of RPM-like datasets and solution
models partially address the challenges of visually understanding the RPM
questions and logically reasoning the missing answers. In view of the poor
generalization performance due to insufficient samples in RPM datasets, we
propose an effective scheme, namely Candidate Answer Morphological Mixup
(CAM-Mix). CAM-Mix serves as a data augmentation strategy by gray-scale image
morphological mixup, which regularizes various solution methods and overcomes
the model overfitting problem. By creating new negative candidate answers
semantically similar to the correct answers, a more accurate decision boundary
could be defined. By applying the proposed data augmentation method, a
significant and consistent performance improvement is achieved on various
RPM-like datasets compared with the state-of-the-art models.Comment: Under revie