Words of estimative probability (WEP) are expressions of a statement's
plausibility (probably, maybe, likely, doubt, likely, unlikely, impossible...).
Multiple surveys demonstrate the agreement of human evaluators when assigning
numerical probability levels to WEP. For example, highly likely corresponds to
a median chance of 0.90+-0.08 in Fagen-Ulmschneider (2015)'s survey. In this
work, we measure the ability of neural language processing models to capture
the consensual probability level associated to each WEP. Firstly, we use the
UNLI dataset (Chen et al., 2020) which associates premises and hypotheses with
their perceived joint probability p, to construct prompts, e.g. "[PREMISE].
[WEP], [HYPOTHESIS]." and assess whether language models can predict whether
the WEP consensual probability level is close to p. Secondly, we construct a
dataset of WEP-based probabilistic reasoning, to test whether language models
can reason with WEP compositions. When prompted "[EVENTA] is likely. [EVENTB]
is impossible.", a causal language model should not express that [EVENTA&B] is
likely. We show that both tasks are unsolved by off-the-shelf English language
models, but that fine-tuning leads to transferable improvement