829 research outputs found

    Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions

    Full text link
    Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.Comment: TPAMI 2022. arXiv admin note: text overlap with arXiv:1812.0250

    Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

    Full text link
    Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are generally more susceptible when only video is perturbed as opposed to when only text is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language learning. The benchmark introduced in this study along with the code and datasets is available at https://bit.ly/3CNOly4.Comment: NeurIPS 2022 Datasets and Benchmarks Track. This projects webpage is located at https://bit.ly/3CNOly

    Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

    Full text link
    In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high dimensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven empirically to be the leading representation for a large variety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distributions should be applied for different datasets, we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interesting property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks.Comment: new version includes text synthesis by an RNN and experiments with the COCO benchmar

    CLMSM: A Multi-Task Learning Framework for Pre-training on Procedural Text

    Full text link
    In this paper, we propose CLMSM, a domain-specific, continual pre-training framework, that learns from a large set of procedural recipes. CLMSM uses a Multi-Task Learning Framework to optimize two objectives - a) Contrastive Learning using hard triplets to learn fine-grained differences across entities in the procedures, and b) a novel Mask-Step Modelling objective to learn step-wise context of a procedure. We test the performance of CLMSM on the downstream tasks of tracking entities and aligning actions between two procedures on three datasets, one of which is an open-domain dataset not conforming with the pre-training dataset. We show that CLMSM not only outperforms baselines on recipes (in-domain) but is also able to generalize to open-domain procedural NLP tasks.Comment: Accepted to EMNLP Findings 2023, 14 pages, 4 figure

    Neural models for stepwise text illustration

    Get PDF
    In this thesis, we investigate the task of sequence-to-sequence (seq2seq) retrieval: given a sequence (of text passages) as the query, retrieve a sequence (of images) that best describes and aligns with the query. This is a step beyond the traditional cross-modal retrieval which treats each image-text pair independently and ignores broader context. Since this is a difficult task, we break it into steps. We start with caption generation for images in news articles. Different from traditional image captioning task where a text description is generated given an image, here, a caption is generated conditional on both image and the news articles where it appears. We propose a novel neural-networks based methodology to take into account both news article content and image semantics to generate a caption best describing the image and its surrounding text context. Our results outperform existing approaches to image captioning generation. We then introduce two new novel datasets, GutenStories and Stepwise Recipe datasets for the task of story picturing and sequential text illustration. GutenStories consists of around 90k text paragraphs, each accompanied with an image, aligned in around 18k visual stories. It consists of a wide variety of images and story content styles. StepwiseRecipe is a similar dataset having sequenced image-text pairs, but having only domain-constrained images, namely food-related. It consists of 67k text paragraphs (cooking instructions), each accompanied by an image describing the step, aligned in 10k recipes. Both datasets are web-scrawled and systematically filtered and cleaned. We propose a novel variational recurrent seq2seq (VRSS) retrieval model. xii The model encodes two streams of information at every step: the contextual information from both text and images retrieved in previous steps, and the semantic meaning of the current input (text) as a latent vector. These together guide the retrieval of a relevant image from the repository to match the semantics of the given text. The model has been evaluated on both the Stepwise Recipe and GutenStories datasets. The results on several automatic evaluation measures show that our model outperforms several competitive and relevant baselines. We also qualitatively analyse the model both using human evaluation and by visualizing the representation space to judge the semantical meaningfulness. We further discuss the challenges faced on the more difficult GutenStories and outline possible solutions

    Generative Pretraining in Multimodality

    Full text link
    We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.Comment: Code and Demo: https://github.com/baaivision/Em

    A survey on knowledge-enhanced multimodal learning

    Full text link
    Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models

    Making Multimodal Generation Easier: When Diffusion Models Meet LLMs

    Full text link
    We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen is built upon a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by leveraging the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting. The source code is available at https://github.com/zxy556677/EasyGen
    • …