128,917 research outputs found

    QUESTION ANSWERING, GROUNDING, AND GENERATION FOR VISION AND LANGUAGE

    Get PDF
    One ultimate goal of AI is to develop an artificial intelligent (AI) system that can communicate with people in a natural way. Such communication includes but is not limited to asking we humans questions, answering our questions, conducting dialogue with human beings, and performing some actions to better serve people. Imagine in the future where the service robot is everywhere, and we could ask our home robot to “grab me the red cup on the table.” To perform this command, the AI system needs to understand this spoken English sentence, perceive the visual world, navigate to the right place “table”, recognize the right object “the red cup”, then grab it and finally return it back to the commander. Just for this single command, it already involves many techniques, such as speech recognition, language understanding, scene understanding, embodied navigation, object recognition, pose estimation, robot manipulation, etc. Each of these techniques are not well solved yet, but we are on a rapid way toward the success. This thesis is in advancing our knowledge to explore various connections between vision, language and even beyond to push forward this ultimate goal. We study 3 popular vision and language tasks, including visual question answering, language grounding, and image-to-text language generation. Inside each, we will introduce our proposed novel task, accompanied with high-quality dataset and well-performing data-driven approaches. Specifically, we first introduce Visual Madlibs for image-based and region-based question answering. Then we introduce referring expressions, where we study both referring expression comprehension and generation, covering both language grounding and generation. Next, we study album summarization, which not only selects the key photos inside an album but also generates a natural language story describing the whole album. Last but not least, we describe multi-target embodied question answering, a task that is even closer to our ultimate goal that requires both language understanding and navigation ability from the AI system.Doctor of Philosoph

    Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

    Full text link
    Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon

    Follow-up question handling in the IMIX and Ritel systems: A comparative study

    Get PDF
    One of the basic topics of question answering (QA) dialogue systems is how follow-up questions should be interpreted by a QA system. In this paper, we shall discuss our experience with the IMIX and Ritel systems, for both of which a follow-up question handling scheme has been developed, and corpora have been collected. These two systems are each other's opposites in many respects: IMIX is multimodal, non-factoid, black-box QA, while Ritel is speech, factoid, keyword-based QA. Nevertheless, we will show that they are quite comparable, and that it is fruitful to examine the similarities and differences. We shall look at how the systems are composed, and how real, non-expert, users interact with the systems. We shall also provide comparisons with systems from the literature where possible, and indicate where open issues lie and in what areas existing systems may be improved. We conclude that most systems have a common architecture with a set of common subtasks, in particular detecting follow-up questions and finding referents for them. We characterise these tasks using the typical techniques used for performing them, and data from our corpora. We also identify a special type of follow-up question, the discourse question, which is asked when the user is trying to understand an answer, and propose some basic methods for handling it

    Object Referring in Visual Scene with Spoken Language

    Full text link
    Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201

    Supervised and Unsupervised Transfer Learning for Question Answering

    Full text link
    Although transfer learning has been shown to be successful for tasks like object and speech recognition, its applicability to question answering (QA) has yet to be well-studied. In this paper, we conduct extensive experiments to investigate the transferability of knowledge learned from a source QA dataset to a target dataset using two QA models. The performance of both models on a TOEFL listening comprehension test (Tseng et al., 2016) and MCTest (Richardson et al., 2013) is significantly improved via a simple transfer learning technique from MovieQA (Tapaswi et al., 2016). In particular, one of the models achieves the state-of-the-art on all target datasets; for the TOEFL listening comprehension test, it outperforms the previous best model by 7%. Finally, we show that transfer learning is helpful even in unsupervised scenarios when correct answers for target QA dataset examples are not available.Comment: To appear in NAACL HLT 2018 (long paper
    corecore