10,218 research outputs found

    Multimodal knowledge capture from text and diagrams

    Full text link
    Many information sources use multiple modalities, such as textbooks, which contain both text and diagrams. Each captures information that is hard to express in the other, and evidence suggests that multimodal information leads to bet-ter retention and transfer in human learners. This paper describes a system that captures textbook knowledge, using simplified English text and sketched versions of diagrams. We present experimental results showing it can use cap-tured knowledge to answer questions from the textbook’s curriculum. Categories and Subject Descriptors I.2.4 Knowledge Representation Formalisms and Method

    Towards a Unified Knowledge-Based Approach to Modality Choice

    Get PDF
    This paper advances a unified knowledge-based approach to the process of choosing the most appropriate modality or combination of modalities in multimodal output generation. We propose a Modality Ontology (MO) that models the knowledge needed to support the two most fundamental processes determining modality choice – modality allocation (choosing the modality or set of modalities that can best support a particular type of information) and modality combination (selecting an optimal final combination of modalities). In the proposed ontology we model the main levels which collectively determine the characteristics of each modality and the specific relationships between different modalities that are important for multi-modal meaning making. This ontology aims to support the automatic selection of modalities and combinations of modalities that are suitable to convey the meaning of the intended message

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page

    Tac-tiles: multimodal pie charts for visually impaired users

    Get PDF
    Tac-tiles is an accessible interface that allows visually impaired users to browse graphical information using tactile and audio feedback. The system uses a graphics tablet which is augmented with a tangible overlay tile to guide user exploration. Dynamic feedback is provided by a tactile pin-array at the fingertips, and through speech/non-speech audio cues. In designing the system, we seek to preserve the affordances and metaphors of traditional, low-tech teaching media for the blind, and combine this with the benefits of a digital representation. Traditional tangible media allow rapid, non-sequential access to data, promote easy and unambiguous access to resources such as axes and gridlines, allow the use of external memory, and preserve visual conventions, thus promoting collaboration with sighted colleagues. A prototype system was evaluated with visually impaired users, and recommendations for multimodal design were derived

    Fourteenth Biennial Status Report: MĂ€rz 2017 - February 2019

    No full text

    AI2D-RST : A multimodal corpus of 1000 primary school science diagrams

    Get PDF
    This article introduces AI2D-RST, a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowdsourced descriptions, which was originally developed to support research on automatic diagram understanding and visual question answering. Building on the segmentation of diagram layouts in AI2D, the AI2D-RST corpus presents a new multi-layer annotation schema that provides a rich description of their multimodal structure. Annotated by trained experts, the layers describe (1) the grouping of diagram elements into perceptual units, (2) the connections set up by diagrammatic elements such as arrows and lines, and (3) the discourse relations between diagram elements, which are described using Rhetorical Structure Theory (RST). Each annotation layer in AI2D-RST is represented using a graph. The corpus is freely available for research and teaching.Peer reviewe
    • 

    corecore