Masked language models conventionally use a masking rate of 15% due to the
belief that more masking would provide insufficient context to learn good
representations, and less masking would make training too expensive.
Surprisingly, we find that masking up to 40% of input tokens can outperform the
15% baseline, and even masking 80% can preserve most of the performance, as
measured by finetuning on downstream tasks. Increasing the masking rates has
two distinct effects, which we investigate through careful ablations: (1) A
larger proportion of input tokens are corrupted, reducing the context size and
creating a harder task, and (2) models perform more predictions, which benefits
training. We observe that larger models with more capacity to tackle harder
tasks in particular favor higher masking rates. We also find that even more
sophisticated masking schemes such as span masking or PMI masking can benefit
from higher masking rates, albeit to a smaller extent. Our results contribute
to a better understanding of masked language modeling and shed light on more
efficient language pre-training.Comment: The code and pre-trained models are available at
https://github.com/princeton-nlp/DinkyTrai