857 research outputs found

    Improving Audio Caption Fluency with Automatic Error Correction

    Full text link
    Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information.Comment: Accepted by NCMMSC 202

    Supporting Stylized Language Models Using Multi-Modality Features

    Get PDF
    As AI and machine learning systems become more common in our everyday lives, there is an increased desire to construct systems that are able to seamlessly interact and communicate with humans. This typically means creating systems that are able to communicate with humans via natural language. Given the variance of natural language, this can be a very challenging task. In this thesis, I explored the topic of humanlike language generation in the context of stylized language generation. Stylized language generation involves producing some text that exhibits a specific, desired style. In this dissertation, I specifically explored the use of multi-modality features as a means to provide sufficient information to produce high-quality stylized text output. I also explored how these multi-modality features can be used to identify and explain errors in the generated output. Finally, I constructed an automated language evaluation metric that can evaluate stylized language models

    A Mixed-Methods Study Examining the Difference Between Closed Captioning and Lexile Levels

    Get PDF
    This experimental mixed-methods study explores what happens to student Lexile scores when they use closed captioning. Since the emergence of closed captioning tools in the 1980s, closed captioning has become more mainstream and easier to access today than at any other time in history (Rickelman et al., 1991). Thus, it is through harnessing this technology and bringing it into the classroom setting that the researcher of this study hopes to provide new approaches for educators that want to improve their student Lexile levels, while also incorporating the SAMR model within our increasingly technologically-focused classrooms (Crompton & Burke, 2018). The quantitative data analysis procedures involved in this experimental study consisted of utilizing two-sample t-tests to compare the iReady Lexile scores of the participants [n=38] to that of the researched district students [n=810] that were not using closed captioning in this study. The researcher required participants to complete a baseline iReady test to determine their preexisting Lexile levels. Then after the study, participants both in the researched district and in the study, itself were required to complete an iReady post-test to determine their respective Lexile growth in the four areas of reading, which are overall growth, vocabulary, comprehension of literary text, and comprehension of informational text. The independent variable in this study was the use of the enabled closed captioning tool found on the participants\u27 devices. The dependent variable was the Lexile scores that were computed using the iReady Lexile exam. The researcher collected the qualitative data using a variety of observational logs personal interviews, and pre- and post-surveys that the researcher disseminated to students using the Qualtrics system. Once these data were collected, theming and phenomenology analysis were used to identify themes and student emotions/reactions that emerged throughout this study. The themes that emerged from participants involved in the study included the belief in increasing Lexile levels, no effect on vocabulary, and enjoyment of using closed captioning

    Improving fairness in machine learning systems: What do industry practitioners need?

    Full text link
    The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in developing fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs.Comment: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019

    Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

    Full text link
    We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo

    Knowledge Graph Extraction from Videos

    Full text link
    Nearly all existing techniques for automated video annotation (or captioning) describe videos using natural language sentences. However, this has several shortcomings: (i) it is very hard to then further use the generated natural language annotations in automated data processing, (ii) generating natural language annotations requires to solve the hard subtask of generating semantically precise and syntactically correct natural language sentences, which is actually unrelated to the task of video annotation, (iii) it is difficult to quantitatively measure performance, as standard metrics (e.g., accuracy and F1-score) are inapplicable, and (iv) annotations are language-specific. In this paper, we propose the new task of knowledge graph extraction from videos, i.e., producing a description in the form of a knowledge graph of the contents of a given video. Since no datasets exist for this task, we also include a method to automatically generate them, starting from datasets where videos are annotated with natural language. We then describe an initial deep-learning model for knowledge graph extraction from videos, and report results on MSVD* and MSR-VTT*, two datasets obtained from MSVD and MSR-VTT using our method.Comment: 10 pages, 4 figure
    • …
    corecore