829 research outputs found
Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions
Can we teach a robot to recognize and make predictions for activities that it
has never seen before? We tackle this problem by learning models for video from
text. This paper presents a hierarchical model that generalizes instructional
knowledge from large-scale text corpora and transfers the knowledge to video.
Given a portion of an instructional video, our model recognizes and predicts
coherent and plausible actions multiple steps into the future, all in rich
natural language. To demonstrate the capabilities of our model, we introduce
the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot
learning, recognition and anticipation. Extensive experiments with various
evaluation metrics demonstrate the potential of our method for generalization,
given limited video data for training models.Comment: TPAMI 2022. arXiv admin note: text overlap with arXiv:1812.0250
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Joint visual and language modeling on large-scale datasets has recently shown
good progress in multi-modal tasks when compared to single modal learning.
However, robustness of these approaches against real-world perturbations has
not been studied. In this work, we perform the first extensive robustness study
of video-language models against various real-world perturbations. We focus on
text-to-video retrieval and propose two large-scale benchmark datasets,
MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different
text perturbations. The study reveals some interesting initial findings from
the studied models: 1) models are generally more susceptible when only video is
perturbed as opposed to when only text is perturbed, 2) models that are
pre-trained are more robust than those trained from scratch, 3) models attend
more to scene and objects rather than motion and action. We hope this study
will serve as a benchmark and guide future research in robust video-language
learning. The benchmark introduced in this study along with the code and
datasets is available at https://bit.ly/3CNOly4.Comment: NeurIPS 2022 Datasets and Benchmarks Track. This projects webpage is
located at https://bit.ly/3CNOly
Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation
In the traditional object recognition pipeline, descriptors are densely
sampled over an image, pooled into a high dimensional non-linear representation
and then passed to a classifier. In recent years, Fisher Vectors have proven
empirically to be the leading representation for a large variety of
applications. The Fisher Vector is typically taken as the gradients of the
log-likelihood of descriptors, with respect to the parameters of a Gaussian
Mixture Model (GMM). Motivated by the assumption that different distributions
should be applied for different datasets, we present two other Mixture Models
and derive their Expectation-Maximization and Fisher Vector expressions. The
first is a Laplacian Mixture Model (LMM), which is based on the Laplacian
distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian
Mixture Model (HGLMM) which is based on a weighted geometric mean of the
Gaussian and Laplacian distribution. An interesting property of the
Expectation-Maximization algorithm for the latter is that in the maximization
step, each dimension in each component is chosen to be either a Gaussian or a
Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we
achieve state-of-the-art results for both the image annotation and the image
search by a sentence tasks.Comment: new version includes text synthesis by an RNN and experiments with
the COCO benchmar
CLMSM: A Multi-Task Learning Framework for Pre-training on Procedural Text
In this paper, we propose CLMSM, a domain-specific, continual pre-training
framework, that learns from a large set of procedural recipes. CLMSM uses a
Multi-Task Learning Framework to optimize two objectives - a) Contrastive
Learning using hard triplets to learn fine-grained differences across entities
in the procedures, and b) a novel Mask-Step Modelling objective to learn
step-wise context of a procedure. We test the performance of CLMSM on the
downstream tasks of tracking entities and aligning actions between two
procedures on three datasets, one of which is an open-domain dataset not
conforming with the pre-training dataset. We show that CLMSM not only
outperforms baselines on recipes (in-domain) but is also able to generalize to
open-domain procedural NLP tasks.Comment: Accepted to EMNLP Findings 2023, 14 pages, 4 figure
Neural models for stepwise text illustration
In this thesis, we investigate the task of sequence-to-sequence (seq2seq) retrieval: given a sequence (of text passages) as the query, retrieve a sequence (of images) that best describes and aligns with the query. This is a step beyond the traditional cross-modal retrieval which treats each image-text pair independently and ignores broader context. Since this is a difficult task, we break it into steps.
We start with caption generation for images in news articles. Different from traditional image captioning task where a text description is generated given an image, here, a caption is generated conditional on both image and the news articles where it appears. We propose a novel neural-networks based methodology to take into account both news article content and image semantics to generate a caption best describing the image and its surrounding text context. Our results outperform existing approaches to image captioning generation.
We then introduce two new novel datasets, GutenStories and Stepwise Recipe datasets for the task of story picturing and sequential text illustration. GutenStories consists of around 90k text paragraphs, each accompanied with an image, aligned in around 18k visual stories. It consists of a wide variety of images and story content styles. StepwiseRecipe is a similar dataset having sequenced image-text pairs, but having only domain-constrained images, namely food-related. It consists of 67k text paragraphs (cooking instructions), each accompanied by an image describing the step, aligned in 10k recipes. Both datasets are web-scrawled and systematically filtered and cleaned.
We propose a novel variational recurrent seq2seq (VRSS) retrieval model. xii The model encodes two streams of information at every step: the contextual information from both text and images retrieved in previous steps, and the semantic meaning of the current input (text) as a latent vector. These together guide the retrieval of a relevant image from the repository to match the semantics of the given text. The model has been evaluated on both the Stepwise Recipe and GutenStories datasets. The results on several automatic evaluation measures show that our model outperforms several competitive and relevant baselines. We also qualitatively analyse the model both using human evaluation and by visualizing the representation space to judge the semantical meaningfulness. We further discuss the challenges faced on the more difficult GutenStories and outline possible solutions
Generative Pretraining in Multimodality
We present Emu, a Transformer-based multimodal foundation model, which can
seamlessly generate images and texts in multimodal context. This omnivore model
can take in any single-modality or multimodal data input indiscriminately
(e.g., interleaved image, text and video) through a one-model-for-all
autoregressive training process. First, visual signals are encoded into
embeddings, and together with text tokens form an interleaved input sequence.
Emu is then end-to-end trained with a unified objective of classifying the next
text token or regressing the next visual embedding in the multimodal sequence.
This versatile multimodality empowers the exploration of diverse pretraining
data sources at scale, such as videos with interleaved frames and text,
webpages with interleaved images and text, as well as web-scale image-text
pairs and video-text pairs. Emu can serve as a generalist multimodal interface
for both image-to-text and text-to-image tasks, and supports in-context image
and text generation. Across a broad range of zero-shot/few-shot tasks including
image captioning, visual question answering, video question answering and
text-to-image generation, Emu demonstrates superb performance compared to
state-of-the-art large multimodal models. Extended capabilities such as
multimodal assistants via instruction tuning are also demonstrated with
impressive performance.Comment: Code and Demo: https://github.com/baaivision/Em
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
Making Multimodal Generation Easier: When Diffusion Models Meet LLMs
We present EasyGen, an efficient model designed to enhance multimodal
understanding and generation by harnessing the capabilities of diffusion models
and large language models (LLMs). Unlike existing multimodal models that
predominately depend on encoders like CLIP or ImageBind and need ample amounts
of training data to bridge the gap between modalities, EasyGen is built upon a
bidirectional conditional diffusion model named BiDiffuser, which promotes more
efficient interactions between modalities. EasyGen handles image-to-text
generation by integrating BiDiffuser and an LLM via a simple projection layer.
Unlike most existing multimodal models that are limited to generating text
responses, EasyGen can also facilitate text-to-image generation by leveraging
the LLM to create textual descriptions, which can be interpreted by BiDiffuser
to generate appropriate visual responses. Extensive quantitative and
qualitative experiments demonstrate the effectiveness of EasyGen, whose
training can be easily achieved in a lab setting. The source code is available
at https://github.com/zxy556677/EasyGen
- …