288 research outputs found

    Gesture in Automatic Discourse Processing

    Get PDF
    Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning.My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing

    Investigating the role of linguistic knowledge in vision and language tasks

    Get PDF
    Artificial Intelligence (AI) has transformed the way we interact with technology e.g., chatbots, voice-based assistants, smart devices, and so on. One particular area that has gained tremendous attention and importance is learning through multimodal data sources within AI systems. By incorporating multimodal learning into AI systems, we can bridge the gap between human and machine communication, enabling more intuitive and natural interactions. Multimodal learning is the integration of multiple sensory modalities, such as text, images, speech, and gestures, to enable machines to understand and interpret humans and the world around us more comprehensively. In this thesis we develop strategies to exploit multimodal data (specifically text and images) along with linguistic knowledge, making multimodal systems more reliable and accurate for various vision and language tasks. In the first part of the thesis, we focus on developing AI systems that can understand the visual world around us and respond in a more natural and human-like manner. This task is popularly known as image captioning. Despite the significant progress in this task, the image captions generated by the models are extremely generic and template-like for visually similar images. We address this limitation and generate detailed and image-specific captions by exploiting prior and implicit linguistic knowledge, without the need for more labeled data or computational overhead. Unlike previous work, our proposed method generates captions that reflect the image in detail. To further allow AI models to better understand and interpret context, in the second part of the thesis we leverage information from multiple modalities to gather a more comprehensive understanding of the visual data by generating scene graphs. Unlike image captioning that provides a high-level interpretation of the scene, in this setting a key question is – how do different objects/entities in the scene interact with each other? Collecting large amounts of labeled data that can capture every possible interaction is very expensive and infeasible. Hence, we propose an efficient training strategy that generates complete and informative scene graphs from incomplete and missing labels using the knowledge of label informativeness from linguistics. In the third part of the thesis, we study the narrative descriptions of images generated from human speech i.e., natural language, to enable natural interaction between humans and machines. One fundamental and challenging problem when dealing with natural language is the task of coreference resolution. For example, in the sentence “John saw a dog. He petted it,” coreference resolution determines that “he” refers to “John” and “it” refers to the “dog.” While coreference resolution may seem straightforward to humans, it poses several significant challenges for AI systems. Without proper coreference resolution, models will struggle to derive the correct meaning and produce coherent outputs. To address this important and complex problem, we propose a novel benchmark dataset for multimodal coreference resolution to evaluate coreference resolution in text and narrative grounding in images. We also propose a weakly supervised method with rule-based linguistic knowledge to address multimodal coreference resolution without a large supervised training dataset. Finally, we address the limitations of the weakly supervised learning setup in multimodal coreference resolution by proposing a semi-supervised learning strategy. By using a small labeled and a large unlabeled dataset with robust self-supervised and pseudo-labeled loss functions, we achieve strong performance gains for coreference resolution and narrative grounding in a data-efficient way. Our work addresses important aspects in vision and language and paves the way for interesting future avenues. In the last part of the thesis, we discuss in more detail directions for the future that are important for advancing the field and unlocking its full potential. Hence, continued research is needed to push the boundaries of multimodal learning

    A history and theory of textual event detection and recognition

    Get PDF

    OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

    Full text link
    Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding, which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding. This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection, VG, and phrase localization frameworks. Surprisingly, we discovered that state-of-the-art methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection and Language-Guided Feature Attention. These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG

    Multi-modal Machine Learning in Engineering Design: A Review and Future Directions

    Full text link
    In the rapidly advancing field of multi-modal machine learning (MMML), the convergence of multiple data modalities has the potential to reshape various applications. This paper presents a comprehensive overview of the current state, advancements, and challenges of MMML within the sphere of engineering design. The review begins with a deep dive into five fundamental concepts of MMML:multi-modal information representation, fusion, alignment, translation, and co-learning. Following this, we explore the cutting-edge applications of MMML, placing a particular emphasis on tasks pertinent to engineering design, such as cross-modal synthesis, multi-modal prediction, and cross-modal information retrieval. Through this comprehensive overview, we highlight the inherent challenges in adopting MMML in engineering design, and proffer potential directions for future research. To spur on the continued evolution of MMML in engineering design, we advocate for concentrated efforts to construct extensive multi-modal design datasets, develop effective data-driven MMML techniques tailored to design applications, and enhance the scalability and interpretability of MMML models. MMML models, as the next generation of intelligent design tools, hold a promising future to impact how products are designed

    XTQA: Span-Level Explanations of the Textbook Question Answering

    Full text link
    Textbook Question Answering (TQA) is a task that one should answer a diagram/non-diagram question given a large multi-modal context consisting of abundant essays and diagrams. We argue that the explainability of this task should place students as a key aspect to be considered. To address this issue, we devise a novel architecture towards span-level eXplanations of the TQA (XTQA) based on our proposed coarse-to-fine grained algorithm, which can provide not only the answers but also the span-level evidences to choose them for students. This algorithm first coarsely chooses top MM paragraphs relevant to questions using the TF-IDF method, and then chooses top KK evidence spans finely from all candidate spans within these paragraphs by computing the information gain of each span to questions. Experimental results shows that XTQA significantly improves the state-of-the-art performance compared with baselines. The source code is available at https://github.com/keep-smile-001/opentqaComment: 10 page

    Survey of Social Bias in Vision-Language Models

    Full text link
    In recent years, the rapid advancement of machine learning (ML) models, particularly transformer-based pre-trained models, has revolutionized Natural Language Processing (NLP) and Computer Vision (CV) fields. However, researchers have discovered that these models can inadvertently capture and reinforce social biases present in their training datasets, leading to potential social harms, such as uneven resource allocation and unfair representation of specific social groups. Addressing these biases and ensuring fairness in artificial intelligence (AI) systems has become a critical concern in the ML community. The recent introduction of pre-trained vision-and-language (VL) models in the emerging multimodal field demands attention to the potential social biases present in these models as well. Although VL models are susceptible to social bias, there is a limited understanding compared to the extensive discussions on bias in NLP and CV. This survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL. By examining these perspectives, the survey aims to offer valuable guidelines on how to approach and mitigate social bias in both unimodal and multimodal settings. The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models in various applications and research endeavors
    corecore