Search CORE

4 research outputs found

Cross-Lingual Knowledge Transfer for Clinical Phenotyping

Author: Gers Felix
Giannakoulas George
Grundmann Paul
Kyparissidis Ilias
Löser Alexander
Papaioannou Jens-Michalis
Samaras Athanasios
van Aken Betty
Publication venue
Publication date: 03/08/2022
Field of study

Clinical phenotyping enables the automatic extraction of clinical conditions from patient records, which can be beneficial to doctors and clinics worldwide. However, current state-of-the-art models are mostly applicable to clinical notes written in English. We therefore investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language and have a small amount of in-domain data available. We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains such as cardiology, oncology and the ICU. Our results reveal two strategies that outperform the state-of-the-art: Translation-based methods in combination with domain-specific encoders and cross-lingual encoders plus adapters. We find that these strategies perform especially well for classifying rare phenotypes and we advise on which method to prefer in which situation. Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.Comment: LREC 2022 submmision: January 202

arXiv.org e-Print Archive

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Author: Adams Lisa C.
Aerts Hugo JWL.
Augustin Moritz
Borchert Florian
Bressem Keno K.
Busch Felix
Grosser Lennart
Grundmann Paul
Liu Leonhard
Loyen Jan P.
Löser Alexander
Makowski Marcus R.
Niehues Stefan M.
Papaioannou Jens-Michalis
Xu Lina
Publication venue
Publication date: 24/03/2023
Field of study

This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.Comment: Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann contributed equall

arXiv.org e-Print Archive