288 research outputs found
Gesture in Automatic Discourse Processing
Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning.My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing
Investigating the role of linguistic knowledge in vision and language tasks
Artificial Intelligence (AI) has transformed the way we interact with technology e.g., chatbots, voice-based assistants, smart devices, and so on. One particular area that has gained tremendous attention and importance is learning through multimodal data sources within AI systems. By incorporating multimodal learning into AI systems, we can bridge the gap between human and machine communication, enabling more intuitive and natural interactions. Multimodal learning is the integration of multiple sensory modalities, such as text, images, speech, and gestures, to enable machines to understand and interpret humans and the world around us more comprehensively. In this thesis we develop strategies to exploit multimodal data (specifically text and images) along with linguistic knowledge, making multimodal systems more reliable and accurate for various vision and language tasks. In the first part of the thesis, we focus on developing AI systems that can understand the visual world around us and respond in a more natural and human-like manner. This task is popularly known as image captioning. Despite the significant progress in this task, the image captions generated by the models are extremely generic and template-like for visually similar images. We address this limitation and generate detailed and image-specific captions by exploiting prior and implicit linguistic knowledge, without the need for more labeled data or computational overhead. Unlike previous work, our proposed method generates captions that reflect the image in detail. To further allow AI models to better understand and interpret context, in the second part of the thesis we leverage information from multiple modalities to gather a more comprehensive understanding of the visual data by generating scene graphs. Unlike image captioning that provides a high-level interpretation of the scene, in this setting a key question is – how do different objects/entities in the scene interact with each other? Collecting large amounts of labeled data that can capture every possible interaction is very expensive and infeasible. Hence, we propose an efficient training strategy that generates complete and informative scene graphs from incomplete and missing labels using the knowledge of label informativeness from linguistics.
In the third part of the thesis, we study the narrative descriptions of images generated from human speech i.e., natural language, to enable natural interaction between humans and machines. One fundamental and challenging problem when dealing with natural language is the task of coreference resolution. For example, in the sentence “John saw a dog. He petted it,” coreference resolution determines that “he” refers to “John” and “it” refers to the “dog.” While coreference resolution may seem straightforward to humans, it poses several significant challenges for AI systems. Without proper coreference resolution, models will struggle to derive the correct meaning and produce coherent outputs. To address this important and complex problem, we propose a novel benchmark dataset for multimodal coreference resolution to evaluate coreference resolution in text and narrative grounding in images. We also propose a weakly supervised method with rule-based linguistic knowledge to address multimodal coreference resolution without a large supervised training dataset. Finally, we address the limitations of the weakly supervised learning setup in multimodal coreference resolution by proposing a semi-supervised learning strategy. By using a small labeled and a large unlabeled dataset with robust self-supervised and pseudo-labeled loss functions, we achieve strong performance gains for coreference resolution and narrative grounding in a data-efficient way. Our work addresses important aspects in vision and language and paves the way for interesting future avenues. In the last part of the thesis, we discuss in more detail directions for the future that are important for advancing the field and unlocking its full potential. Hence, continued research is needed to push the boundaries of multimodal learning
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Open-vocabulary learning has emerged as a cutting-edge research area,
particularly in light of the widespread adoption of vision-based foundational
models. Its primary objective is to comprehend novel concepts that are not
encompassed within a predefined vocabulary. One key facet of this endeavor is
Visual Grounding, which entails locating a specific region within an image
based on a corresponding language description. While current foundational
models excel at various visual language tasks, there's a noticeable absence of
models specifically tailored for open-vocabulary visual grounding. This
research endeavor introduces novel and challenging OV tasks, namely
Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The
overarching aim is to establish connections between language descriptions and
the localization of novel objects. To facilitate this, we have curated a
comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000
OV-PL images. In our pursuit of addressing these challenges, we delved into
various baseline methodologies rooted in existing open-vocabulary object
detection, VG, and phrase localization frameworks. Surprisingly, we discovered
that state-of-the-art methods often falter in diverse scenarios. Consequently,
we developed a novel framework that integrates two critical components:
Text-Image Query Selection and Language-Guided Feature Attention. These modules
are designed to bolster the recognition of novel categories and enhance the
alignment between visual and linguistic information. Extensive experiments
demonstrate the efficacy of our proposed framework, which consistently attains
SOTA performance across the OV-VG task. Additionally, ablation studies provide
further evidence of the effectiveness of our innovative models. Codes and
datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
In the rapidly advancing field of multi-modal machine learning (MMML), the
convergence of multiple data modalities has the potential to reshape various
applications. This paper presents a comprehensive overview of the current
state, advancements, and challenges of MMML within the sphere of engineering
design. The review begins with a deep dive into five fundamental concepts of
MMML:multi-modal information representation, fusion, alignment, translation,
and co-learning. Following this, we explore the cutting-edge applications of
MMML, placing a particular emphasis on tasks pertinent to engineering design,
such as cross-modal synthesis, multi-modal prediction, and cross-modal
information retrieval. Through this comprehensive overview, we highlight the
inherent challenges in adopting MMML in engineering design, and proffer
potential directions for future research. To spur on the continued evolution of
MMML in engineering design, we advocate for concentrated efforts to construct
extensive multi-modal design datasets, develop effective data-driven MMML
techniques tailored to design applications, and enhance the scalability and
interpretability of MMML models. MMML models, as the next generation of
intelligent design tools, hold a promising future to impact how products are
designed
XTQA: Span-Level Explanations of the Textbook Question Answering
Textbook Question Answering (TQA) is a task that one should answer a
diagram/non-diagram question given a large multi-modal context consisting of
abundant essays and diagrams. We argue that the explainability of this task
should place students as a key aspect to be considered. To address this issue,
we devise a novel architecture towards span-level eXplanations of the TQA
(XTQA) based on our proposed coarse-to-fine grained algorithm, which can
provide not only the answers but also the span-level evidences to choose them
for students. This algorithm first coarsely chooses top paragraphs relevant
to questions using the TF-IDF method, and then chooses top evidence spans
finely from all candidate spans within these paragraphs by computing the
information gain of each span to questions. Experimental results shows that
XTQA significantly improves the state-of-the-art performance compared with
baselines. The source code is available at
https://github.com/keep-smile-001/opentqaComment: 10 page
Survey of Social Bias in Vision-Language Models
In recent years, the rapid advancement of machine learning (ML) models,
particularly transformer-based pre-trained models, has revolutionized Natural
Language Processing (NLP) and Computer Vision (CV) fields. However, researchers
have discovered that these models can inadvertently capture and reinforce
social biases present in their training datasets, leading to potential social
harms, such as uneven resource allocation and unfair representation of specific
social groups. Addressing these biases and ensuring fairness in artificial
intelligence (AI) systems has become a critical concern in the ML community.
The recent introduction of pre-trained vision-and-language (VL) models in the
emerging multimodal field demands attention to the potential social biases
present in these models as well. Although VL models are susceptible to social
bias, there is a limited understanding compared to the extensive discussions on
bias in NLP and CV. This survey aims to provide researchers with a high-level
insight into the similarities and differences of social bias studies in
pre-trained models across NLP, CV, and VL. By examining these perspectives, the
survey aims to offer valuable guidelines on how to approach and mitigate social
bias in both unimodal and multimodal settings. The findings and recommendations
presented here can benefit the ML community, fostering the development of
fairer and non-biased AI models in various applications and research endeavors
- …