13 research outputs found

    Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

    Full text link
    Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks without requiring task-specific training data. This capability has been particularly effective in settings such as narrative question answering, where the diversity of tasks is immense, but the available supervision data is small. In this work, we investigate if such language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content such as drama, movies, and animation, where the story plays an essential role. We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question. We also propose to enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.Comment: Published in BMVC 202

    Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

    Full text link
    Commonsense norms are defeasible by context: reading books is usually great, but not when driving a car. While contexts can be explicitly described in language, in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both visual understanding and reasoning about commonsense norms. We construct a new multimodal benchmark for studying visual-grounded commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment? and (2) how well can models explain their predicted judgments? We find that state-of-the-art model judgments and explanations are not well-aligned with human annotation. Additionally, we present a new approach to better align models with humans by distilling social commonsense knowledge from large language models. The data and code are released at https://seungjuhan.me/normlens.Comment: Published as a conference paper at EMNLP 2023 (long

    Transitional adaptation of pretrained models for visual storytelling

    No full text
    © 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pretrained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.N

    ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

    No full text
    © 2021 IEEEThe natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised approaches rely on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audiovisual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.N

    Multimodal Knowledge Alignment with Reinforcement Learning

    Full text link
    Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from CLIP, and thus requires no additional explicitly paired (image, caption) data. Because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect+release, ESP dataset, which tasks models with generating several diversely-styled captions for each image

    SARS-CoV-2 hijacks neutralizing dimeric IgA for nasal infection and injury in Syrian hamsters1

    No full text
    ABSTRACTPrevention of robust severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infection in nasal turbinate (NT) requires in vivo evaluation of IgA neutralizing antibodies. Here, we report the efficacy of receptor binding domain (RBD)-specific monomeric B8-mIgA1 and B8-mIgA2, and dimeric B8-dIgA1, B8-dIgA2 and TH335-dIgA1 against intranasal SARS-CoV-2 challenge in Syrian hamsters. These antibodies exhibited comparable neutralization potency against authentic virus by competing with human angiotensin converting enzyme-2 (ACE2) receptor for RBD binding. While reducing viral loads in lungs significantly, prophylactic intranasal B8-dIgA unexpectedly led to high amount of infectious viruses and extended damage in NT compared to controls. Mechanistically, B8-dIgA failed to inhibit SARS-CoV-2 cell-to-cell transmission, but was hijacked by the virus through dendritic cell-mediated trans-infection of NT epithelia leading to robust nasal infection. Cryo-EM further revealed B8 as a class II antibody binding trimeric RBDs in 3-up or 2-up/1-down conformation. Neutralizing dIgA, therefore, may engage an unexpected mode of SARS-CoV-2 nasal infection and injury

    Activation and Reaction Volumes in Solution. 3

    No full text
    corecore