5,854 research outputs found

    Learning by Asking Questions

    Full text link
    We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions

    A reinforcement learning formulation to the complex question answering problem

    Get PDF
    International audienceWe use extractive multi-document summarization techniques to perform complex question answering and formulate it as a reinforcement learning problem. Given a set of complex questions, a list of relevant documents per question, and the corresponding human generated summaries (i.e. answers to the questions) as training data, the reinforcement learning module iteratively learns a number of feature weights in order to facilitate the automatic generation of summaries i.e. answers to previously unseen complex questions. A reward function is used to measure the similarities between the candidate (machine generated) summary sentences and the abstract summaries. In the training stage, the learner iteratively selects the important document sentences to be included in the candidate summary, analyzes the reward function and updates the related feature weights accordingly. The final weights are used to generate summaries as answers to unseen complex questions in the testing stage. Evaluation results show the effectiveness of our system. We also incorporate user interaction into the reinforcement learner to guide the candidate summary sentence selection process. Experiments reveal the positive impact of the user interaction component on the reinforcement learning framework

    Plenty is Plague: Fine-Grained Learning for Visual Question Answering

    Get PDF
    纪荣嵘教授团队的论文提出了一种基于强化学习的细粒度学习策略FG-A1C,旨在通过分析视觉问答任务中的样本多样性及标签的冗余性问题来针对性地挑选训练样本以提高模型的训练效率及减少标记支出。 该论文由厦门大学媒体分析与计算实验室的周奕毅博士后助理研究员、纪荣嵘教授(通信作者)、孙晓帅副教授、苏劲松副教授,以及西安交通大学孟德宇教授、清华大学高跃副教授和澳大利亚阿德莱德大学沈春华教授等合作完成。Visual Question Answering (VQA) has attracted extensive research focus recently. Along with the ever-increasing data scale and model complexity, the enormous training cost has become an emerging challenge for VQA. In this paper, we show such a massive training cost is indeed plague. In contrast, a fine-grained design of the learning paradigm can be extremely beneficial in terms of both training efficiency and model accuracy. In particular, we argue that there exist two essential and unexplored issues in the existing VQA training paradigm that randomly samples data in each epoch, namely, the "difficulty diversity" and the "label redundancy". Concretely, "difficulty diversity" refers to the varying difficulty levels of different question types, while "label redundancy" refers to the redundant and noisy labels contained in individual question type. To tackle these two issues, in this paper we propose a fine-grained VQA learning paradigm with an actor-critic based learning agent, termed FG-A1C. Instead of using all training data from scratch, FG-A1C includes a learning agent that adaptively and intelligently schedules the most difficult question types in each training epoch. Subsequently, two curriculum learning based schemes are further designed to identify the most useful data to be learned within each inidividual question type. We conduct extensive experiments on the VQA2.0 and VQA-CP v2 datasets, which demonstrate the significant benefits of our approach. For instance, on VQA-CP v2, with less than 75% of the training data, our learning paradigms can help the model achieves better performance than using the whole dataset. Meanwhile, we also shows the effectivenesss of our method in guiding data labeling. Finally, the proposed paradigm can be seamlessly integrated with any cutting-edge VQA models, without modifying their structures.This work is supported by the National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Nature Sci- ence Foundation of China (No.U1705262, No.61772443, and No.61572410), Post Doctoral Innovative Talent Support Pro-gram under Grant BX201600094, China Post-Doctoral Sci- ence Foundation under Grant 2017M612134, Scientific Re-search Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fu-jian Province, China (No. 2017J01125 and No. 2018J01106). 本项研究得到了厦门大学“人工智能分析引擎”双一流重大专项的支持、国家重点研发专项和国家自然科学基金海峡基金等项目的支持

    Move Forward and Tell: A Progressive Generator of Video Descriptions

    Full text link
    We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption conditioned on a single embedding. On the contrary, we consider videos with rich temporal structures and aim to generate paragraph descriptions that can preserve the story flow while being coherent and concise. Towards this goal, we propose a new approach, which produces a descriptive paragraph by assembling temporally localized descriptions. Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner. Particularly, the selection of clips and the production of sentences are done jointly and progressively driven by a recurrent network -- what to describe next depends on what have been said before. Here, the recurrent network is learned via self-critical sequence training with both sentence-level and paragraph-level rewards. On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos. Compared to those by other methods, the descriptions produced by our method are often more relevant, more coherent, and more concise.Comment: Accepted by ECCV 201
    corecore