21,295 research outputs found

    Learning to Interpret and Apply Multimodal Descriptions

    Get PDF
    Han T. Learning to Interpret and Apply Multimodal Descriptions. Bielefeld: Universität Bielefeld; 2018.Enabling computers to understand natural human communication is a goal researchers have been long aspired to in artificial intelligence. Since the concept demonstration of “Put-That- There” in 1980s, significant achievements have been made in developing multimodal interfaces that can process human communication such as speech, eye gaze, facial emotion, co-verbal hand gestures and pen input. State-of-the-art multimodal interfaces are able to process pointing gestures, symbolic gestures with conventional meanings, as well as gesture commands with pre-defined meanings (e.g., circling for “select”). However, in natural communication, co- verbal gestures/pen input rarely convey meanings via conventions or pre-defined rules, but embody meanings relatable to the accompanying speech. For example, in route given tasks, people often describe landmarks verbally (e.g., two buildings), while demonstrating the relative position with two hands facing each other in the space. Interestingly, when the same gesture is accompanied by the utterance a ball, it may indicate the size of the ball. Hence, the interpretation of such co-verbal hand gestures largely depends on the accompanied verbal content. Similarly, when describing objects, while verbal utterances are most convenient for describing colour and category (e.g., a brown elephant), hand-drawn sketches are often deployed to convey iconic information such as the exact shape of the elephant’s trunk, which is typically difficult to encode in language. This dissertation concerns the task of learning to interpret multimodal descriptions com- posed of verbal utterances and hand gestures/sketches, and apply corresponding interpretations to tasks such as image retrieval. Specifically, we aim to address following research questions: 1) For co-verbal gestures that embody meanings relatable to accompanied verbal content, how can we use natural language information to interpret the semantics of such co-verbal gestures, e.g., does a gesture indicate relative position or size? 2) As an integral system of commu- nication, speech and gestures not only bear close semantic relations, but also close temporal relations. To what degree and on which dimensions can hand gestures benefit the task of inter- preting multimodal descriptions? 3) While it’s obvious that iconic information in hand-drawn sketches enriches verbal content in object descriptions, how to model the joint contributions of such multimodal descriptions and to what degree can verbal descriptions compensate reduced iconic details in hand-drawn sketches? To address the above questions, we first introduce three multimodal description corpora: a spatial description corpus composed of natural language and placing gestures (also referred as abstract deictics), a multimodal object description corpus composed of natural language and hand-drawn sketches, and an existing corpus - the Bielefeld Speech and Gesture Alignment Corpus (SAGA). 3 4 We frame the problem of learning gesture semantics as a multi-label classification task us- ing natural language information and hand gesture features. We conducted an experiment with the SAGA corpus. The results show that natural language is informative for the interpretation of hand gestures. Further more, we describe a system that models the interpretation and application of spatial descriptions and explored three variants of representation methods of the verbal content. When representing the verbal content in the descriptions with a set of automatically learned symbols, the system’s performance is on par with representations with manually defined symbols (e.g., pre-defined object properties). We show that abstract deictic gestures not only lead to better understanding of spatial descriptions, but also result in earlier correct decisions of the system, which can be used to trigger immediate reactions in dialogue systems. Finally, we investigate the interplay of semantics between symbolic (natural language) and iconic (sketches) modes in multimodal object descriptions, where natural language and sketches jointly contribute to the communications. We model the meaning of natural language and sketches two existing models and combine the meanings from both modalities with a late fusion approach. The results show that even adding reduced sketches (30% of full sketches) can help in the retrieval task. Moreover, in current setup, natural language descriptions can compensate around 30% of reduced sketches

    MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network

    Full text link
    The inability to interpret the model prediction in semantically and visually meaningful ways is a well-known shortcoming of most existing computer-aided diagnosis methods. In this paper, we propose MDNet to establish a direct multimodal mapping between medical images and diagnostic reports that can read images, generate diagnostic reports, retrieve images by symptom descriptions, and visualize attention, to provide justifications of the network diagnosis process. MDNet includes an image model and a language model. The image model is proposed to enhance multi-scale feature ensembles and utilization efficiency. The language model, integrated with our improved attention mechanism, aims to read and explore discriminative image feature descriptions from reports to learn a direct mapping from sentence words to image pixels. The overall network is trained end-to-end by using our developed optimization strategy. Based on a pathology bladder cancer images and its diagnostic reports (BCIDR) dataset, we conduct sufficient experiments to demonstrate that MDNet outperforms comparative baselines. The proposed image model obtains state-of-the-art performance on two CIFAR datasets as well.Comment: CVPR2017 Ora

    Multimodal Grounding for Language Processing

    Get PDF
    This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference of Computational Linguistics. Please refer to this version for citations: https://www.aclweb.org/anthology/papers/C/C18/C18-1197

    Picturing another culture: Developing language proficiency, empathy, and visual literacy through art

    Full text link
    Integrating art (paintings, sculptures, photography, and other types of images) into second language (L2) instruction, can have a positive effect on language acquisition and developing intercultural understanding. As the instructor provides visual scaffolding, learners at all proficiency levels have the opportunity to engage more deeply with L2 course materials. Not only can students learn to interpret imagery and create their own effective combinations of visuals and texts, they also develop some familiarity with seminal artwork from the target culture. This article outlines how spiraling art through a language curriculum can aid vocabulary retention, illustrate poetic language, and raise awareness of diversity and inclusion. As a result, learners investigate and interact with the products, practices, and perspectives of the L2 culture while simultaneously developing visual analysis skills - the latter essential in an age in which both authentic and digitally manipulated imagery dominate the media and social discourse.Accepted manuscrip
    • …
    corecore