Recent advances in natural language processing (NLP) can be largely
attributed to the advent of pre-trained language models such as BERT and
RoBERTa. While these models demonstrate remarkable performance on general
datasets, they can struggle in specialized domains such as medicine, where
unique domain-specific terminologies, domain-specific abbreviations, and
varying document structures are common. This paper explores strategies for
adapting these models to domain-specific requirements, primarily through
continuous pre-training on domain-specific data. We pre-trained several German
medical language models on 2.4B tokens derived from translated public English
medical data and 3B tokens of German clinical data. The resulting models were
evaluated on various German downstream tasks, including named entity
recognition (NER), multi-label classification, and extractive question
answering. Our results suggest that models augmented by clinical and
translation-based pre-training typically outperform general domain models in
medical contexts. We conclude that continuous pre-training has demonstrated the
ability to match or even exceed the performance of clinical models trained from
scratch. Furthermore, pre-training on clinical data or leveraging translated
texts have proven to be reliable methods for domain adaptation in medical NLP
tasks.Comment: Accepted at LREC-COLING 202