479 research outputs found
If beam search is the answer, what was the question?
Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural
language generators frequently leads to low-quality results. Rather, most
state-of-the-art results on language generation tasks are attained using beam
search despite its overwhelmingly high search error rate. This implies that the
MAP objective alone does not express the properties we desire in text, which
merits the question: if beam search is the answer, what was the question? We
frame beam search as the exact solution to a different decoding objective in
order to gain insights into why high probability under a model alone may not
indicate adequacy. We find that beam search enforces uniform information
density in text, a property motivated by cognitive science. We suggest a set of
decoding objectives that explicitly enforce this property and find that exact
decoding with these objectives alleviates the problems encountered when
decoding poorly calibrated language generation models. Additionally, we analyze
the text produced using various decoding strategies and see that, in our neural
machine translation experiments, the extent to which this property is adhered
to strongly correlates with BLEU.Comment: EMNLP 202
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
A good automatic evaluation metric for language generation ideally correlates
highly with human judgements of text quality. Yet, there is a dearth of such
metrics, which inhibits the rapid and efficient progress of language
generators. One exception is the recently proposed Mauve. In theory, Mauve
measures an information-theoretic divergence between two probability
distributions over strings: one representing the language generator under
evaluation; the other representing the true natural language distribution.
Mauve's authors argue that its success comes from the qualitative properties of
their proposed divergence. Yet in practice, as this divergence is uncomputable,
Mauve approximates it by measuring the divergence between multinomial
distributions over clusters instead, where cluster assignments are attained by
grouping strings based on a pre-trained language model's embeddings. As we
show, however, this is not a tight approximation -- in either theory or
practice. This begs the question: why does Mauve work so well? In this work, we
show that Mauve was right for the wrong reasons, and that its newly proposed
divergence is not necessary for its high performance. In fact, classical
divergences paired with its proposed cluster-based approximation may actually
serve as better evaluation metrics. We finish the paper with a probing
analysis; this analysis leads us to conclude that -- by encoding syntactic- and
coherence-level features of text, while ignoring surface-level features -- such
cluster-based substitutes to string distributions may simply be better for
evaluating state-of-the-art language generators.Comment: Tiago Pimentel and Clara Meister contributed equally to this wor
Determinantal Beam Search
Beam search is a go-to strategy for decoding neural sequence models. The
algorithm can naturally be viewed as a subset optimization problem, albeit one
where the corresponding set function does not reflect interactions between
candidates. Empirically, this leads to sets often exhibiting high overlap,
e.g., strings may differ by only a single word. Yet in use-cases that call for
multiple solutions, a diverse or representative set is often desired. To
address this issue, we propose a reformulation of beam search, which we call
determinantal beam search. Determinantal beam search has a natural relationship
to determinantal point processes (DPPs), models over sets that inherently
encode intra-set interactions. By posing iterations in beam search as a series
of subdeterminant maximization problems, we can turn the algorithm into a
diverse subset selection process. In a case study, we use the string
subsequence kernel to explicitly encourage n-gram coverage in text generated
from a sequence model. We observe that our algorithm offers competitive
performance against other diverse set generation strategies in the context of
language generation, while providing a more general approach to optimizing for
diversity
Estimating the Entropy of Linguistic Distributions
Shannon entropy is often a quantity of interest to linguists studying the
communicative capacity of human language. However, entropy must typically be
estimated from observed data because researchers do not have access to the
underlying probability distribution that gives rise to these data. While
entropy estimation is a well-studied problem in other fields, there is not yet
a comprehensive exploration of the efficacy of entropy estimators for use with
linguistic data. In this work, we fill this void, studying the empirical
effectiveness of different entropy estimators for linguistic distributions. In
a replication of two recent information-theoretic linguistic studies, we find
evidence that the reported effect size is over-estimated due to over-reliance
on poor entropy estimators. Finally, we end our paper with concrete
recommendations for entropy estimation depending on distribution type and data
availability.Comment: 21 pages (5 pages main text). 4 figures. Accepted to ACL 202
On the Effect of Anticipation on Reading Times
Over the past two decades, numerous studies have demonstrated how less
predictable (i.e., higher surprisal) words take more time to read. In general,
these studies have implicitly assumed the reading process is purely responsive:
Readers observe a new word and allocate time to process it as required. We
argue that prior results are also compatible with a reading process that is at
least partially anticipatory: Readers could make predictions about a future
word and allocate time to process it based on their expectation. In this work,
we operationalize this anticipation as a word's contextual entropy. We assess
the effect of anticipation on reading by comparing how well surprisal and
contextual entropy predict reading times on four naturalistic reading datasets:
two self-paced and two eye-tracking. Experimentally, across datasets and
analyses, we find substantial evidence for effects of contextual entropy over
surprisal on a word's reading time (RT): in fact, entropy is sometimes better
than surprisal in predicting a word's RT. Spillover effects, however, are
generally not captured by entropy, but only by surprisal. Further, we
hypothesize four cognitive mechanisms through which contextual entropy could
impact RTs -- three of which we are able to design experiments to analyze.
Overall, our results support a view of reading that is not just responsive, but
also anticipatory.Comment: This is a pre-MIT Press publication version of the paper. Code is
available in https://github.com/rycolab/anticipation-on-reading-time
A Natural Bias for Language Generation Models
After just a few hundred training updates, a standard probabilistic model for
language generation has likely not yet learnt many semantic or syntactic rules
of natural language, making it difficult to estimate the probability
distribution over next tokens. Yet around this point, these models have
identified a simple, loss-minimising behaviour: to output the unigram
distribution of the target training corpus. The use of such a heuristic raises
the question: Can we initialise our models with this behaviour and save
precious compute resources and model capacity? Here we show that we can
effectively endow standard neural language generation models with a separate
module that reflects unigram frequency statistics as prior knowledge, simply by
initialising the bias term in a model's final linear layer with the log-unigram
distribution. We use neural machine translation as a test bed for this simple
technique and observe that it: (i) improves learning efficiency; (ii) achieves
better overall performance; and perhaps most importantly (iii) appears to
disentangle strong frequency effects by encouraging the model to specialise in
non-frequency-related aspects of language.Comment: Main conference paper at ACL 202
Testing the Predictions of Surprisal Theory in 11 Languages
A fundamental result in psycholinguistics is that less predictable words take
a longer time to process. One theoretical explanation for this finding is
Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's
predictability as its surprisal, i.e. its negative log-probability given a
context. While evidence supporting the predictions of Surprisal Theory have
been replicated widely, most have focused on a very narrow slice of data:
native English speakers reading English texts. Indeed, no comprehensive
multilingual analysis exists. We address this gap in the current literature by
investigating the relationship between surprisal and reading times in eleven
different languages, distributed across five language families. Deriving
estimates from language models trained on monolingual and multilingual corpora,
we test three predictions associated with surprisal theory: (i) whether
surprisal is predictive of reading times; (ii) whether expected surprisal, i.e.
contextual entropy, is predictive of reading times; (iii) and whether the
linking function between surprisal and reading times is linear. We find that
all three predictions are borne out crosslinguistically. By focusing on a more
diverse set of languages, we argue that these results offer the most robust
link to-date between information theory and incremental language processing
across languages.Comment: This is a pre-MIT Press publication version of the pape
Revisiting the Uniform Information Density Hypothesis
The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signal -- or lack thereof -- should be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID's predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or document -- a finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel
- …