480 research outputs found
If beam search is the answer, what was the question?
Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural
language generators frequently leads to low-quality results. Rather, most
state-of-the-art results on language generation tasks are attained using beam
search despite its overwhelmingly high search error rate. This implies that the
MAP objective alone does not express the properties we desire in text, which
merits the question: if beam search is the answer, what was the question? We
frame beam search as the exact solution to a different decoding objective in
order to gain insights into why high probability under a model alone may not
indicate adequacy. We find that beam search enforces uniform information
density in text, a property motivated by cognitive science. We suggest a set of
decoding objectives that explicitly enforce this property and find that exact
decoding with these objectives alleviates the problems encountered when
decoding poorly calibrated language generation models. Additionally, we analyze
the text produced using various decoding strategies and see that, in our neural
machine translation experiments, the extent to which this property is adhered
to strongly correlates with BLEU.Comment: EMNLP 202
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
A good automatic evaluation metric for language generation ideally correlates
highly with human judgements of text quality. Yet, there is a dearth of such
metrics, which inhibits the rapid and efficient progress of language
generators. One exception is the recently proposed Mauve. In theory, Mauve
measures an information-theoretic divergence between two probability
distributions over strings: one representing the language generator under
evaluation; the other representing the true natural language distribution.
Mauve's authors argue that its success comes from the qualitative properties of
their proposed divergence. Yet in practice, as this divergence is uncomputable,
Mauve approximates it by measuring the divergence between multinomial
distributions over clusters instead, where cluster assignments are attained by
grouping strings based on a pre-trained language model's embeddings. As we
show, however, this is not a tight approximation -- in either theory or
practice. This begs the question: why does Mauve work so well? In this work, we
show that Mauve was right for the wrong reasons, and that its newly proposed
divergence is not necessary for its high performance. In fact, classical
divergences paired with its proposed cluster-based approximation may actually
serve as better evaluation metrics. We finish the paper with a probing
analysis; this analysis leads us to conclude that -- by encoding syntactic- and
coherence-level features of text, while ignoring surface-level features -- such
cluster-based substitutes to string distributions may simply be better for
evaluating state-of-the-art language generators.Comment: Tiago Pimentel and Clara Meister contributed equally to this wor
Determinantal Beam Search
Beam search is a go-to strategy for decoding neural sequence models. The
algorithm can naturally be viewed as a subset optimization problem, albeit one
where the corresponding set function does not reflect interactions between
candidates. Empirically, this leads to sets often exhibiting high overlap,
e.g., strings may differ by only a single word. Yet in use-cases that call for
multiple solutions, a diverse or representative set is often desired. To
address this issue, we propose a reformulation of beam search, which we call
determinantal beam search. Determinantal beam search has a natural relationship
to determinantal point processes (DPPs), models over sets that inherently
encode intra-set interactions. By posing iterations in beam search as a series
of subdeterminant maximization problems, we can turn the algorithm into a
diverse subset selection process. In a case study, we use the string
subsequence kernel to explicitly encourage n-gram coverage in text generated
from a sequence model. We observe that our algorithm offers competitive
performance against other diverse set generation strategies in the context of
language generation, while providing a more general approach to optimizing for
diversity
Estimating the Entropy of Linguistic Distributions
Shannon entropy is often a quantity of interest to linguists studying the
communicative capacity of human language. However, entropy must typically be
estimated from observed data because researchers do not have access to the
underlying probability distribution that gives rise to these data. While
entropy estimation is a well-studied problem in other fields, there is not yet
a comprehensive exploration of the efficacy of entropy estimators for use with
linguistic data. In this work, we fill this void, studying the empirical
effectiveness of different entropy estimators for linguistic distributions. In
a replication of two recent information-theoretic linguistic studies, we find
evidence that the reported effect size is over-estimated due to over-reliance
on poor entropy estimators. Finally, we end our paper with concrete
recommendations for entropy estimation depending on distribution type and data
availability.Comment: 21 pages (5 pages main text). 4 figures. Accepted to ACL 202
On the Effect of Anticipation on Reading Times
Over the past two decades, numerous studies have demonstrated how less
predictable (i.e., higher surprisal) words take more time to read. In general,
these studies have implicitly assumed the reading process is purely responsive:
Readers observe a new word and allocate time to process it as required. We
argue that prior results are also compatible with a reading process that is at
least partially anticipatory: Readers could make predictions about a future
word and allocate time to process it based on their expectation. In this work,
we operationalize this anticipation as a word's contextual entropy. We assess
the effect of anticipation on reading by comparing how well surprisal and
contextual entropy predict reading times on four naturalistic reading datasets:
two self-paced and two eye-tracking. Experimentally, across datasets and
analyses, we find substantial evidence for effects of contextual entropy over
surprisal on a word's reading time (RT): in fact, entropy is sometimes better
than surprisal in predicting a word's RT. Spillover effects, however, are
generally not captured by entropy, but only by surprisal. Further, we
hypothesize four cognitive mechanisms through which contextual entropy could
impact RTs -- three of which we are able to design experiments to analyze.
Overall, our results support a view of reading that is not just responsive, but
also anticipatory.Comment: This is a pre-MIT Press publication version of the paper. Code is
available in https://github.com/rycolab/anticipation-on-reading-time
Revisiting the Optimality of Word Lengths
Zipf (1935) posited that wordforms are optimized to minimize utterances'
communicative costs. Under the assumption that cost is given by an utterance's
length, he supported this claim by showing that words' lengths are inversely
correlated with their frequencies. Communicative cost, however, can be
operationalized in different ways. Piantadosi et al. (2011) claim that cost
should be measured as the distance between an utterance's information rate and
channel capacity, which we dub the channel capacity hypothesis (CCH) here.
Following this logic, they then proposed that a word's length should be
proportional to the expected value of its surprisal (negative log-probability
in context). In this work, we show that Piantadosi et al.'s derivation does not
minimize CCH's cost, but rather a lower bound, which we term CCH-lower. We
propose a novel derivation, suggesting an improved way to minimize CCH's cost.
Under this method, we find that a language's word lengths should instead be
proportional to the surprisal's expectation plus its variance-to-mean ratio.
Experimentally, we compare these three communicative cost functions: Zipf's,
CCH-lower , and CCH. Across 13 languages and several experimental settings, we
find that length is better predicted by frequency than either of the other
hypotheses. In fact, when surprisal's expectation, or expectation plus
variance-to-mean ratio, is estimated using better language models, it leads to
worse word length predictions. We take these results as evidence that Zipf's
longstanding hypothesis holds.Comment: Published at EMNLP 202
A Natural Bias for Language Generation Models
After just a few hundred training updates, a standard probabilistic model for
language generation has likely not yet learnt many semantic or syntactic rules
of natural language, making it difficult to estimate the probability
distribution over next tokens. Yet around this point, these models have
identified a simple, loss-minimising behaviour: to output the unigram
distribution of the target training corpus. The use of such a heuristic raises
the question: Can we initialise our models with this behaviour and save
precious compute resources and model capacity? Here we show that we can
effectively endow standard neural language generation models with a separate
module that reflects unigram frequency statistics as prior knowledge, simply by
initialising the bias term in a model's final linear layer with the log-unigram
distribution. We use neural machine translation as a test bed for this simple
technique and observe that it: (i) improves learning efficiency; (ii) achieves
better overall performance; and perhaps most importantly (iii) appears to
disentangle strong frequency effects by encouraging the model to specialise in
non-frequency-related aspects of language.Comment: Main conference paper at ACL 202
Testing the Predictions of Surprisal Theory in 11 Languages
A fundamental result in psycholinguistics is that less predictable words take
a longer time to process. One theoretical explanation for this finding is
Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's
predictability as its surprisal, i.e. its negative log-probability given a
context. While evidence supporting the predictions of Surprisal Theory have
been replicated widely, most have focused on a very narrow slice of data:
native English speakers reading English texts. Indeed, no comprehensive
multilingual analysis exists. We address this gap in the current literature by
investigating the relationship between surprisal and reading times in eleven
different languages, distributed across five language families. Deriving
estimates from language models trained on monolingual and multilingual corpora,
we test three predictions associated with surprisal theory: (i) whether
surprisal is predictive of reading times; (ii) whether expected surprisal, i.e.
contextual entropy, is predictive of reading times; (iii) and whether the
linking function between surprisal and reading times is linear. We find that
all three predictions are borne out crosslinguistically. By focusing on a more
diverse set of languages, we argue that these results offer the most robust
link to-date between information theory and incremental language processing
across languages.Comment: This is a pre-MIT Press publication version of the pape
- …