48 research outputs found
Keep it Neutral: Using Natural Language Inference to Improve Generation
We explore incorporating natural language inference (NLI) into the text
generative pipeline by using a pre-trained NLI model to assess whether a
generated sentence entails, contradicts, or is neutral to the prompt and
preceding text. First, we show that the NLI task is predictive of generation
errors made by GPT-3. We use these results to develop an NLI-informed
generation procedure for GPT-J. Then, we evaluate these generations by
obtaining human annotations on error types and overall quality. We find that an
NLI strategy of maximizing entailment improves text generation when the nucleus
sampling randomness parameter value is high, while one which maximizes
contradiction is in fact productive when the parameter value is low. Overall,
though, we demonstrate that an NLI strategy of maximizing the neutral class
provides the highest quality of generated text (significantly better than the
vanilla generations), regardless of parameter value
Recommended from our members
When Classifying Arguments, BERT Doesn\u27t Care About Word Order... Except When It Matters
We probe nouns in BERT contextual embedding space for grammatical role (subject vs. object of a clause), and examine how probing results vary between prototypical examples, where the role matches what we would expect from seeing that word in the context, and non-prototypical examples, where the role is mostly imparted by the context. In this way, engage with the contrast that has arisen in the literature, between studies that show contextual models as grammatically sensitive, and others that show that these models are robust to changes in word order. Our experiments yield three results: 1) Grammatical role is recovered in later layers for difficult non-prototypical cases, while prototypical cases are accurate without many layers of context 2) When we switch the subject and the object of a sentence around (eg, The chef cut the onion, The onion cut the chef), we see that the same word (eg, onion) can be fluently identified as both a subject and an object 3) Subjecthood probing breaks if we ablate local word order by shuffle words locally and break grammaticality
Concord begets concord: A Bayesian model of nominal concord typology
Nominal concord is a phenomenon whereby nominal modifiers (e.g., adjectives, demonstratives, numerals) agree with their nominals along various dimensions (e.g., gender, number, case, definiteness). Here, drawing on a rich and typologically diverse database of nominal concord (Norris 2020), we build a Bayesian mixed effect model of nominal concord. Specifically, we consider two competing hypotheses regarding the statistical relationship between different types of concord within a language: (1) concord begets concord: the presence of some type of concord in a language makes it more likely that it has other types of concord vs. (2) a little concord goes a long way: if a language has some kind of concord, it is less likely to have other types of concord. We present evidence strongly in favor of the first hypothesis, that concord begets concord. Languages with nominal concord tend to have concord in more than one place and of more than one type. Using posterior draws from our model, we also provide quantitative evidence for a number of the tendencies described by Norris (2019a). Future work will build on this model to understand the functional role of nominal concord in language systems, how it evolves, and how it co-evolves with other typological features
A Method for Studying Semantic Construal in Grammatical Constructions with Interpretable Contextual Embedding Spaces
We study semantic construal in grammatical constructions using large language
models. First, we project contextual word embeddings into three interpretable
semantic spaces, each defined by a different set of psycholinguistic feature
norms. We validate these interpretable spaces and then use them to
automatically derive semantic characterizations of lexical items in two
grammatical constructions: nouns in subject or object position within the same
sentence, and the AANN construction (e.g., `a beautiful three days'). We show
that a word in subject position is interpreted as more agentive than the very
same word in object position, and that the nouns in the AANN construction are
interpreted as more measurement-like than when in the canonical alternation.
Our method can probe the distributional meaning of syntactic constructions at a
templatic level, abstracted away from specific lexemes
Counterfactually Probing Language Identity in Multilingual Models
Techniques in causal analysis of language models illuminate how linguistic
information is organized in LLMs. We use one such technique, AlterRep, a method
of counterfactual probing, to explore the internal structure of multilingual
models (mBERT and XLM-R). We train a linear classifier on a binary language
identity task, to classify tokens between Language X and Language Y. Applying a
counterfactual probing procedure, we use the classifier weights to project the
embeddings into the null space and push the resulting embeddings either in the
direction of Language X or Language Y. Then we evaluate on a masked language
modeling task. We find that, given a template in Language X, pushing towards
Language Y systematically increases the probability of Language Y words, above
and beyond a third-party control language. But it does not specifically push
the model towards translation-equivalent words in Language Y. Pushing towards
Language X (the same direction as the template) has a minimal effect, but
somewhat degrades these models. Overall, we take these results as further
evidence of the rich structure of massive multilingual language models, which
include both a language-specific and language-general component. And we show
that counterfactual probing can be fruitfully applied to multilingual models.Comment: 12 pages, 5 figures, MRL Workshop @ EMNLP 202
SNAP judgments: A small N acceptability paradigm (SNAP) for linguistic acceptability judgments
While published linguistic judgments sometimes differ from the judgments found in large-scale formal experiments with naive participants, there is not a consensus as to how often these errors occur nor as to how often formal experiments should be used in syntax and semantics research. In this article, we first present the results of a large-scale replication of the Sprouse et al. 2013 study on 100 English contrasts randomly sampled from Linguistic Inquiry 2001–2010 and tested in both a forced-choice experiment and an acceptability rating experiment. Like Sprouse, Schütze, and Almeida, we find that the effect sizes of published linguistic acceptability judgments are not uniformly large or consistent but rather form a continuum from very large effects to small or nonexistent effects. We then use this data as a prior in a Bayesian framework to propose a small n acceptability paradigm for linguistic acceptability judgments (SNAP Judgments). This proposal makes it easier and cheaper to obtain meaningful quantitative data in syntax and semantics research. Specifically, for a contrast of linguistic interest for which a researcher is confident that sentence A is better than sentence B, we recommend that the researcher should obtain judgments from at least five unique participants, using at least five unique sentences of each type. If all participants in the sample agree that sentence A is better than sentence B, then the researcher can be confident that the result of a full forced-choice experiment would likely be 75% or more agreement in favor of sentence A (with a mean of 93%). We test this proposal by sampling from the existing data and find that it gives reliable performance.*American Society for Engineering Education. National Defense Science and Engineering Graduate Fellowshi
Elaborative Simplification as Implicit Questions Under Discussion
Automated text simplification, a technique useful for making text more
accessible to people such as children and emergent bilinguals, is often thought
of as a monolingual translation task from complex sentences to simplified
sentences using encoder-decoder models. This view fails to account for
elaborative simplification, where new information is added into the simplified
text. This paper proposes to view elaborative simplification through the lens
of the Question Under Discussion (QUD) framework, providing a robust way to
investigate what writers elaborate upon, how they elaborate, and how
elaborations fit into the discourse context by viewing elaborations as explicit
answers to implicit questions. We introduce ElabQUD, consisting of 1.3K
elaborations accompanied with implicit QUDs, to study these phenomena. We show
that explicitly modeling QUD (via question generation) not only provides
essential understanding of elaborative simplification and how the elaborations
connect with the rest of the discourse, but also substantially improves the
quality of elaboration generation.Comment: Equal contribution by Yating Wu and William Sheffield. This the EMNLP
2023 Main camera-ready versio
Recommended from our members
Multilingual BERT, Ergativity, and Grammatical Subjecthood
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a subject ) is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode a purely syntactic subjecthood, but a continuous subjecthood as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs
Are LLMs cultural technologies like photocopiers or printing presses, which transmit information but cannot create new content? A challenge for this idea, which we call "bibliotechnism", is that LLMs often do generate entirely novel text. We begin by defending bibliotechnism against this challenge, showing how novel text may be meaningful only in a derivative sense, so that the content of this generated text depends in an important sense on the content of original human text. We go on to present a different, novel challenge for bibliotechnism, stemming from examples in which LLMs generate “novel reference”, using novel names to refer to novel entities. Such examples could be smoothly explained if LLMs were not cultural technologies but possessed a limited form of agency (beliefs, desires, and intentions). According to interpretationism in the philosophy of mind, a system has beliefs, desires and intentions if and only if its behavior is well-explained by the hypothesis that it has such states. In line with this view, we argue that cases of novel reference provide evidence that LLMs do in fact have beliefs, desires, and intentions, and thus have a limited form of agency
Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained
our masked language models with three ingredients: an initial pretraining with
music data, training on shorter sequences before training on longer ones, and
masking specific tokens to target some of the BLiMP subtasks. Overall, our
baseline models performed above chance, but far below the performance levels of
larger LLMs trained on more data. We found that training on short sequences
performed better than training on longer sequences.Pretraining on music may
help performance marginally, but, if so, the effect seems small. Our targeted
Masked Language Modeling augmentation did not seem to improve model performance
in general, but did seem to help on some of the specific BLiMP tasks that we
were targeting (e.g., Negative Polarity Items). Training performant LLMs on
small amounts of data is a difficult but potentially informative task. While
some of our techniques showed some promise, more work is needed to explore
whether they can improve performance more than the modest gains here. Our code
is available at https://github.com/venkatasg/Lil-Bevo and out models at
https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873aComment: Proceedings of the BabyLM Challeng