2 research outputs found
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
This paper presents medBERTde, a pre-trained German BERT model specifically
designed for the German medical domain. The model has been trained on a large
corpus of 4.7 Million German medical documents and has been shown to achieve
new state-of-the-art performance on eight different medical benchmarks covering
a wide range of disciplines and medical document types. In addition to
evaluating the overall performance of the model, this paper also conducts a
more in-depth analysis of its capabilities. We investigate the impact of data
deduplication on the model's performance, as well as the potential benefits of
using more efficient tokenization methods. Our results indicate that
domain-specific models such as medBERTde are particularly useful for longer
texts, and that deduplication of training data does not necessarily lead to
improved performance. Furthermore, we found that efficient tokenization plays
only a minor role in improving model performance, and attribute most of the
improved performance to the large amount of training data. To encourage further
research, the pre-trained model weights and new benchmarks based on
radiological data are made publicly available for use by the scientific
community.Comment: Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann
contributed equall