Using language models (LMs) pre-trained in a self-supervised setting on large
corpora and then fine-tuning for a downstream task has helped to deal with the
problem of limited label data for supervised learning tasks such as Named
Entity Recognition (NER). Recent research in biomedical language processing has
offered a number of biomedical LMs pre-trained using different methods and
techniques that advance results on many BioNLP tasks, including NER. However,
there is still a lack of a comprehensive comparison of pre-training approaches
that would work more optimally in the biomedical domain. This paper aims to
investigate different pre-training methods, such as pre-training the biomedical
LM from scratch and pre-training it in a continued fashion. We compare existing
methods with our proposed pre-training method of initializing weights for new
tokens by distilling existing weights from the BERT model inside the context
where the tokens were found. The method helps to speed up the pre-training
stage and improve performance on NER. In addition, we compare how masking rate,
corruption strategy, and masking strategies impact the performance of the
biomedical LM. Finally, using the insights from our experiments, we introduce a
new biomedical LM (BIOptimus), which is pre-trained using Curriculum Learning
(CL) and contextualized weight distillation method. Our model sets new states
of the art on several biomedical Named Entity Recognition (NER) tasks. We
release our code and all pre-trained model