1 research outputs found
Difference-Masking: Choosing What to Mask in Continued Pretraining
The self-supervised objective of masking-and-predicting has led to promising
performance gains on a variety of downstream tasks. However, while most
approaches randomly mask tokens, there is strong intuition that deciding what
to mask can substantially improve learning outcomes. We investigate this in
continued pretraining setting in which pretrained models continue to pretrain
on domain-specific data before performing some downstream task. We introduce
Difference-Masking, a masking strategy that automatically chooses what to mask
during continued pretraining by considering what makes a task domain different
from the pretraining domain. Empirically, we find that Difference-Masking
outperforms baselines on continued pretraining settings across four diverse
language-only and multimodal video tasks