Likelihood, although useful as a training loss, is a poor search objective
for guiding open-ended generation from language models (LMs). Existing
generation algorithms must avoid both unlikely strings, which are incoherent,
and highly likely ones, which are short and repetitive. We propose contrastive
decoding (CD), a more reliable search objective that returns the difference
between likelihood under a large LM (called the expert, e.g. OPT-13b) and a
small LM (called the amateur, e.g. OPT-125m). CD is inspired by the fact that
the failures of larger LMs are even more prevalent in smaller LMs, and that
this difference signals exactly which texts should be preferred. CD requires
zero training, and produces higher quality text than decoding from the larger
LM alone. It also generalizes across model types (OPT and GPT2) and
significantly outperforms four strong decoding algorithms in automatic and
human evaluations