Dynamic Masking Rate Schedules for MLM Pretraining

Ankner, Zachary; Blalock, Davis; Frankle, Jonathan; Leavitt, Matthew L.; Saphra, Naomi

Dynamic Masking Rate Schedules for MLM Pretraining

Authors: Zachary Ankner
Davis Blalock
Jonathan Frankle
Matthew L. Leavitt
Naomi Saphra
Publication date: 24 May 2023
Publisher

Abstract

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. Our work instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30% to 15% over the course of pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a standard 15% fixed rate. Further analyses demonstrate that the gains from scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.15096

Last time updated on 26/05/2023