1,321 research outputs found

    On intelligible multimodal visual analysis

    Get PDF
    Analyzing data becomes an important skill in a more and more digital world. Yet, many users are facing knowledge barriers preventing them to independently conduct their data analysis. To tear down some of these barriers, multimodal interaction for visual analysis has been proposed. Multimodal interaction through speech and touch enables not only experts, but also novice users to effortlessly interact with such kind of technology. However, current approaches do not take the user differences into account. In fact, whether visual analysis is intelligible ultimately depends on the user. In order to close this research gap, this dissertation explores how multimodal visual analysis can be personalized. To do so, it takes a holistic view. First, an intelligible task space of visual analysis tasks is defined by considering personalization potentials. This task space provides an initial basis for understanding how effective personalization in visual analysis can be approached. Second, empirical analyses on speech commands in visual analysis as well as used visualizations from scientific publications further reveal patterns and structures. These behavior-indicated findings help to better understand expectations towards multimodal visual analysis. Third, a technical prototype is designed considering the previous findings. Enriching the visual analysis by a persistent dialogue and a transparency of the underlying computations, conducted user studies show not only advantages, but address the relevance of considering the user’s characteristics. Finally, both communications channels – visualizations and dialogue – are personalized. Leveraging linguistic theory and reinforcement learning, the results highlight a positive effect of adjusting to the user. Especially when the user’s knowledge is exceeded, personalizations helps to improve the user experience. Overall, this dissertations confirms not only the importance of considering the user’s characteristics in multimodal visual analysis, but also provides insights on how an intelligible analysis can be achieved. By understanding the use of input modalities, a system can focus only on the user’s needs. By understanding preferences on the output modalities, the system can better adapt to the user. Combining both directions imporves user experience and contributes towards an intelligible multimodal visual analysis

    A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System

    Full text link
    Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system that handles the task of understanding the user by capturing the necessary information in the form of slots and generating an appropriate response in accordance with the extracted information. Recently, dialogue systems integrated with complementary information such as images, audio, or video have gained immense popularity. In this work, we propose an end-to-end framework with the capability to extract necessary slot values from the utterance and generate a coherent response, thereby assisting the user to achieve their desired goals in a multimodal dialogue system having both textual and visual information. The task of extracting the necessary information is dependent not only on the text but also on the visual cues present in the dialogue. Similarly, for the generation, the previous dialog context comprising multimodal information is significant for providing coherent and informative responses. We employ a multimodal hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge base (Kb) to provide a stronger context for both the tasks. Finally, we design a slot attention mechanism to focus on the necessary information in a given utterance. Lastly, a decoder generates the corresponding response for the given dialogue context and the extracted slot values. Experimental results on the Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms the baselines approaches in both the tasks. The code is available at https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    Improved Instruction Ordering in Recipe-Grounded Conversation

    Full text link
    In this paper, we study the task of instructional dialogue and focus on the cooking domain. Analyzing the generated output of the GPT-J model, we reveal that the primary challenge for a recipe-grounded dialog system is how to provide the instructions in the correct order. We hypothesize that this is due to the model's lack of understanding of user intent and inability to track the instruction state (i.e., which step was last instructed). Therefore, we propose to explore two auxiliary subtasks, namely User Intent Detection and Instruction State Tracking, to support Response Generation with improved instruction grounding. Experimenting with our newly collected dataset, ChattyChef, shows that incorporating user intent and instruction state information helps the response generation model mitigate the incorrect order issue. Furthermore, to investigate whether ChatGPT has completely solved this task, we analyze its outputs and find that it also makes mistakes (10.7% of the responses), about half of which are out-of-order instructions. We will release ChattyChef to facilitate further research in this area at: https://github.com/octaviaguo/ChattyChef.Comment: Accepted at ACL 2023 main conferenc

    Chatbot-Based Natural Language Interfaces for Data Visualisation: A Scoping Review

    Full text link
    Rapid growth in the generation of data from various sources has made data visualisation a valuable tool for analysing data. However, visual analysis can be a challenging task, not only due to intricate dashboards but also when dealing with complex and multidimensional data. In this context, advances in Natural Language Processing technologies have led to the development of Visualisation-oriented Natural Language Interfaces (V-NLIs). In this paper, we carry out a scoping review that analyses synergies between the fields of Data Visualisation and Natural Language Interaction. Specifically, we focus on chatbot-based V-NLI approaches and explore and discuss three research questions. The first two research questions focus on studying how chatbot-based V-NLIs contribute to interactions with the Data and Visual Spaces of the visualisation pipeline, while the third seeks to know how chatbot-based V-NLIs enhance users' interaction with visualisations. Our findings show that the works in the literature put a strong focus on exploring tabular data with basic visualisations, with visual mapping primarily reliant on fixed layouts. Moreover, V-NLIs provide users with restricted guidance strategies, and few of them support high-level and follow-up queries. We identify challenges and possible research opportunities for the V-NLI community such as supporting high-level queries with complex data, integrating V-NLIs with more advanced systems such as Augmented Reality (AR) or Virtual Reality (VR), particularly for advanced visualisations, expanding guidance strategies beyond current limitations, adopting intelligent visual mapping techniques, and incorporating more sophisticated interaction methods

    Practical aspects of designing and developing a multimodal embodied agent

    Get PDF
    2021 Spring.Includes bibliographical references.This thesis reviews key elements that went into the design and construction of the CSU CwC Embodied agent, also known as the Diana System. The Diana System has been developed over five years by a joint team of researchers at three institutions – Colorado State University, Brandeis University and the University of Florida. Over that time, I contributed to this overall effort and in this thesis, I present a practical review of key elements involved in designing and constructing the system. Particular attention is paid to Diana's multimodal capabilities that engage asynchronously and concurrently to support realistic interactions with the user. Diana can communicate in visual as well as auditory modalities. She can understand a variety of hand gestures for object manipulation, deixis, etc. and can gesture in return. Diana can also hold a conversation with the user in spoken and/or written English. Gestures and speech are often at play simultaneously, supplementing and complementing each other. Diana conveys her attention through several non-verbal cues like slower blinking when inattentive, keeping her gaze on the subject of her attention, etc. Finally, her ability to express emotions with facial expressions adds another crucial human element to any user interaction with the system. Central to Diana's capabilities is a blackboard architecture coordinating a hierarchy of modular components, each controlling a part of Diana's perceptual, cognitive, and motor abilities. The modular design facilitates contributions from multiple disciplines, namely VoxSim/VoxML with Text-to-speech/Automatic Speech Recognition systems for natural language understanding, deep neural networks for gesture recognition, 3D computer animation systems, etc. – all integrated within the Unity game engine to create an embodied, intelligent agent that is Diana. The primary contribution of this thesis is to provide a detailed explanation of Diana's internal working along with a thorough background of the research that supports these technologies

    Toward Widely-Available and Usable Multimodal Conversational Interfaces

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 159-166).Multimodal conversational interfaces, which allow humans to interact with a computer using a combination of spoken natural language and a graphical interface, offer the potential to transform the manner by which humans communicate with computers. While researchers have developed myriad such interfaces, none have made the transition out of the laboratory and into the hands of a significant number of users. This thesis makes progress toward overcoming two intertwined barriers preventing more widespread adoption: availability and usability. Toward addressing the problem of availability, this thesis introduces a new platform for building multimodal interfaces that makes it easy to deploy them to users via the World Wide Web. One consequence of this work is City Browser, the first multimodal conversational interface made publicly available to anyone with a web browser and a microphone. City Browser serves as a proof-of-concept that significant amounts of usage data can be collected in this way, allowing a glimpse of how users interact with such interfaces outside of a laboratory environment. City Browser, in turn, has served as the primary platform for deploying and evaluating three new strategies aimed at improving usability. The most pressing usability challenge for conversational interfaces is their limited ability to accurately transcribe and understand spoken natural language. The three strategies developed in this thesis - context-sensitive language modeling, response confidence scoring, and user behavior shaping - each attack the problem from a different angle, but they are linked in that each critically integrates information from the conversational context.by Alexander Gruenstein.Ph.D
    • …
    corecore