Most works on transformers trained with the Masked Language Modeling (MLM)
objective use the original BERT model's fixed masking rate of 15%. Our work
instead dynamically schedules the masking ratio throughout training. We found
that linearly decreasing the masking rate from 30% to 15% over the course of
pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a
standard 15% fixed rate. Further analyses demonstrate that the gains from
scheduling come from being exposed to both high and low masking rate regimes.
Our results demonstrate that masking rate scheduling is a simple way to improve
the quality of masked language models and achieve up to a 1.89x speedup in
pretraining