15 research outputs found
Can RNNs trained on harder subject-verb agreement instances still perform well on easier ones?
The main subject and the associated verb in English must agree in grammatical number as per the Subject-Verb Agreement (SVA) phenomenon. It has been found that the presence of a noun between the verb and the main subject, whose grammatical number is opposite to that of the main subject, can cause speakers to produce a verb that agrees with the intervening noun rather than the main noun; the former thus acts as an agreement attractor. Such attractors have also been shown to pose a challenge for RNN models without explicit hierarchical bias to perform well on SVA tasks. Previous work suggests that syntactic cues in the input can aid such models to choose hierarchical rules over linear rules for number agreement. In this work, we investigate the effects of the choice of training data, training algorithm, and architecture on hierarchical generalization. We observe that the models under consideration fail to perform well on sentences with no agreement attractor when trained solely on natural sentences with at least one attractor. Even in the presence of this biased training set, implicit hierarchical bias in the architecture (as in the Ordered Neurons LSTM) is not enough to capture syntax-sensitive dependencies. These results suggest that current RNNs do not capture the underlying hierarchical rules of natural language, but rather use shallower heuristics for their predictions
How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?
Text-to-image generative models have achieved unprecedented success in
generating high-quality images based on natural language descriptions. However,
it is shown that these models tend to favor specific social groups when
prompted with neutral text descriptions (e.g., 'a photo of a lawyer').
Following Zhao et al. (2021), we study the effect on the diversity of the
generated images when adding ethical intervention that supports equitable
judgment (e.g., 'if all individuals can be a lawyer irrespective of their
gender') in the input prompts. To this end, we introduce an Ethical NaTural
Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset
to evaluate the change in image generations conditional on ethical
interventions across three social axes -- gender, skin color, and culture.
Through ENTIGEN framework, we find that the generations from minDALL.E,
DALL.E-mini and Stable Diffusion cover diverse social groups while preserving
the image quality. Preliminary studies indicate that a large change in the
model predictions is triggered by certain phrases such as 'irrespective of
gender' in the context of gender bias in the ethical interventions. We release
code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.Comment: 13 pages, 8 figures, 6 tables. Accepted as Oral Presentation at EMNLP
202
VideoCon: Robust Video-Language Alignment via Contrast Captions
Despite being (pre)trained on a massive amount of data, state-of-the-art
video-language alignment models are not robust to semantically-plausible
contrastive changes in the video captions. Our work addresses this by
identifying a broad spectrum of contrast misalignments, such as replacing
entities, actions, and flipping event order, which alignment models should be
robust against. To this end, we introduce the VideoCon, a video-language
alignment dataset constructed by a large language model that generates
plausible contrast video captions and explanations for differences between
original and contrast video captions. Then, a generative video-language model
is finetuned with VideoCon to assess video-language entailment and generate
explanations. Our VideoCon-based alignment model significantly outperforms
current models. It exhibits a 12-point increase in AUC for the video-language
alignment task on human-generated contrast captions. Finally, our model sets
new state of the art zero-shot performance in temporally-extensive
video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video
question answering (ATP-Hard). Moreover, our model shows superior performance
on novel videos and human-crafted captions and explanations. Our code and data
are available at https://github.com/Hritikbansal/videocon.Comment: 22 pages, 19 Figures, 7 Table
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
Language models have been shown to perform better with an increase in scale
on a wide variety of tasks via the in-context learning paradigm. In this paper,
we investigate the hypothesis that the ability of a large language model to
in-context learn-perform a task is not uniformly spread across all of its
underlying components. Using a 66 billion parameter language model (OPT-66B)
across a diverse set of 14 downstream tasks, we find this is indeed the case:
70% of attention heads and 20% of feed forward networks can be
removed with minimal decline in task performance. We find substantial overlap
in the set of attention heads (un)important for in-context learning across
tasks and number of in-context examples. We also address our hypothesis through
a task-agnostic lens, finding that a small set of attention heads in OPT-66B
score highly on their ability to perform primitive induction operations
associated with in-context learning, namely, prefix matching and copying. These
induction heads overlap with task-specific important heads, reinforcing
arguments by Olsson et al. (arXiv:2209.11895) regarding induction head
generality to more sophisticated behaviors associated with in-context learning.
Overall, our study provides several insights that indicate large language
models may be under-trained for in-context learning and opens up questions on
how to pre-train language models to more effectively perform in-context
learning.Comment: Accepted at Annual Meeting of the Association for Computational
Linguistics (ACL) 2023, Main Proceeding
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Recent advances in contrastive representation learning over paired image-text
data have led to models such as CLIP that achieve state-of-the-art performance
for zero-shot classification and distributional robustness. Such models
typically require joint reasoning in the image and text representation spaces
for downstream inference tasks. Contrary to prior beliefs, we demonstrate that
the image and text representations learned via a standard contrastive objective
are not interchangeable and can lead to inconsistent downstream predictions. To
mitigate this issue, we formalize consistency and propose CyCLIP, a framework
for contrastive representation learning that explicitly optimizes for the
learned representations to be geometrically consistent in the image and text
space. In particular, we show that consistent representations can be learned by
explicitly symmetrizing (a) the similarity between the two mismatched
image-text pairs (cross-modal consistency); and (b) the similarity between the
image-image pair and the text-text pair (in-modal consistency). Empirically, we
show that the improved consistency in CyCLIP translates to significant gains
over CLIP, with gains ranging from 10%-24% for zero-shot classification
accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10%-27%
for robustness to various natural distribution shifts. The code is available at
https://github.com/goel-shashank/CyCLIP.Comment: 19 pages, 13 tables, 6 figures, Oral at NeuRIPS 202
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than $12 USD by calling GPT-3.5-turbo for generating 800K
instruction tuning samples; 2) it provides high-quality data for instruction
tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform
with comparable data sizes); and 3) it supports the continuous improvement of
models by generating instruction-tuning data when a new annotated dataset
becomes available. We further investigate a continual learning scheme for
learning with the ever-growing instruction-tuning dataset, and demonstrate that
replaying tasks with diverse instruction embeddings not only helps mitigate
forgetting issues but generalizes to unseen tasks better.
Code and data are available at https://github.com/WadeYin9712/Dynosaur.Comment: EMNLP 2023. Code and data are available at
https://github.com/WadeYin9712/Dynosau