When an NLP model is trained on text data from one time period and tested or
deployed on data from another, the resulting temporal misalignment can degrade
end-task performance. In this work, we establish a suite of eight diverse tasks
across different domains (social media, science papers, news, and reviews) and
periods of time (spanning five years or more) to quantify the effects of
temporal misalignment. Our study is focused on the ubiquitous setting where a
pretrained model is optionally adapted through continued domain-specific
pretraining, followed by task-specific finetuning. We establish a suite of
tasks across multiple domains to study temporal misalignment in modern NLP
systems. We find stronger effects of temporal misalignment on task performance
than have been previously reported. We also find that, while temporal
adaptation through continued pretraining can help, these gains are small
compared to task-specific finetuning on data from the target time period. Our
findings motivate continued research to improve temporal robustness of NLP
models.Comment: 9 pages, 6 figures, 3 table