247 research outputs found
The Role of Syntactic Planning in Compositional Image Captioning
Image captioning has focused on generalizing to images drawn from the same
distribution as the training set, and not to the more challenging problem of
generalizing to different distributions of images. Recently, Nikolaus et al.
(2019) introduced a dataset to assess compositional generalization in image
captioning, where models are evaluated on their ability to describe images with
unseen adjective-noun and noun-verb compositions. In this work, we investigate
different methods to improve compositional generalization by planning the
syntactic structure of a caption. Our experiments show that jointly modeling
tokens and syntactic tags enhances generalization in both RNN- and
Transformer-based models, while also improving performance on standard metrics.Comment: Accepted at EACL 202
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Recent work in vision-and-language pretraining has investigated supervised
signals from object detection data to learn better, fine-grained multimodal
representations. In this work, we take a step further and explore how we can
tap into supervision from small-scale visual relation data. In particular, we
propose two pretraining approaches to contextualise visual entities in a
multimodal setup. With verbalised scene graphs, we transform visual relation
triplets into structured captions, and treat them as additional image
descriptions. With masked relation prediction, we further encourage relating
entities from image regions with visually masked contexts. When applied to
strong baselines pretrained on large amounts of Web data, zero-shot evaluations
on both coarse-grained and fine-grained tasks show the efficacy of our methods
in learning multimodal representations from weakly-supervised relations data.Comment: EMNLP 202
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Pretrained machine learning models are known to perpetuate and even amplify
existing biases in data, which can result in unfair outcomes that ultimately
impact user experience. Therefore, it is crucial to understand the mechanisms
behind those prejudicial biases to ensure that model performance does not
result in discriminatory behaviour toward certain groups or populations. In
this work, we define gender bias as our case study. We quantify bias
amplification in pretraining and after fine-tuning on three families of
vision-and-language models. We investigate the connection, if any, between the
two learning stages, and evaluate how bias amplification reflects on model
performance. Overall, we find that bias amplification in pretraining and after
fine-tuning are independent. We then examine the effect of continued
pretraining on gender-neutral data, finding that this reduces group
disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without
significantly compromising task performance.Comment: To appear in EMNLP 202
On the Interplay between Fairness and Explainability
In order to build reliable and trustworthy NLP applications, models need to
be both fair across different demographics and explainable. Usually these two
objectives, fairness and explainability, are optimized and/or examined
independently of each other. Instead, we argue that forthcoming, trustworthy
NLP systems should consider both. In this work, we perform a first study to
understand how they influence each other: do fair(er) models rely on more
plausible rationales? and vice versa. To this end, we conduct experiments on
two English multi-class text classification datasets, BIOS and ECtHR, that
provide information on gender and nationality, respectively, as well as
human-annotated rationales. We fine-tune pre-trained language models with
several methods for (i) bias mitigation, which aims to improve fairness; (ii)
rationale extraction, which aims to produce plausible explanations. We find
that bias mitigation algorithms do not always lead to fairer models. Moreover,
we discover that empirical fairness and explainability are orthogonal.Comment: 15 pages (incl Appendix), 4 figures, 8 table
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Large-scale pretraining and task-specific fine-tuning is now the standard
methodology for many tasks in computer vision and natural language processing.
Recently, a multitude of methods have been proposed for pretraining vision and
language BERTs to tackle challenges at the intersection of these two key areas
of AI. These models can be categorised into either single-stream or dual-stream
encoders. We study the differences between these two categories, and show how
they can be unified under a single theoretical framework. We then conduct
controlled experiments to discern the empirical differences between five V&L
BERTs. Our experiments show that training data and hyperparameters are
responsible for most of the differences between the reported results, but they
also reveal that the embedding layer plays a crucial role in these massive
models.Comment: To appear in TACL 202
Language Modelling with Pixels
Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches, instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT,
further confirming the benefits of modelling language with pixels.Comment: work in progres
Sustainable Urban Transformation and the Green Urban Economy
This chapter explores the connections between the concepts of sustainable urban transformation and the green urban economy, proposes a framework for understanding how these concepts “fit” together, and makes some practical suggestions for local governments (and national and international policy)
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
Generating video stories from text prompts is a complex task. In addition to
having high visual quality, videos need to realistically adhere to a sequence
of text prompts whilst being consistent throughout the frames. Creating a
benchmark for video generation requires data annotated over time, which
contrasts with the single caption used often in video datasets. To fill this
gap, we collect comprehensive human annotations on three existing datasets, and
introduce StoryBench: a new, challenging multi-task benchmark to reliably
evaluate forthcoming text-to-video models. Our benchmark includes three video
generation tasks of increasing difficulty: action execution, where the next
action must be generated starting from a conditioning video; story
continuation, where a sequence of actions must be executed starting from a
conditioning video; and story generation, where a video must be generated from
only text prompts. We evaluate small yet strong text-to-video baselines, and
show the benefits of training on story-like data algorithmically generated from
existing video captions. Finally, we establish guidelines for human evaluation
of video stories, and reaffirm the need of better automatic metrics for video
generation. StoryBench aims at encouraging future research efforts in this
exciting new area
Peristaltic Pumping of Blood Through Small Vessels of Varying Cross-section
The paper is devoted to a study of the peristaltic motion of blood in the
micro-circulatory system. The vessel is considered to be of varying
cross-section. The progressive peristaltic waves are taken to be of sinusoidal
nature. Blood is considered to be a Herschel-Bulkley fluid. Of particular
concern here is to investigate the effects of amplitude ratio, mean pressure
gradient, yield stress and the power law index on the velocity distribution,
streamline pattern and wall shear stress. On the basis of the derived
analytical expression, extensive numerical calculations have been made. The
study reveals that velocity of blood and wall shear stress are appreciably
affected due to the non-uniform geometry of blood vessels. They are also highly
sensitive to the magnitude of the amplitude ratio and the value of the fluid
index.Comment: Accepted for publication in ASME journal of Applied Mechanics. arXiv
admin note: text overlap with arXiv:1108.1285v
- …