4 research outputs found
Cross-Lingual Knowledge Transfer for Clinical Phenotyping
Clinical phenotyping enables the automatic extraction of clinical conditions
from patient records, which can be beneficial to doctors and clinics worldwide.
However, current state-of-the-art models are mostly applicable to clinical
notes written in English. We therefore investigate cross-lingual knowledge
transfer strategies to execute this task for clinics that do not use the
English language and have a small amount of in-domain data available. We
evaluate these strategies for a Greek and a Spanish clinic leveraging clinical
notes from different clinical domains such as cardiology, oncology and the ICU.
Our results reveal two strategies that outperform the state-of-the-art:
Translation-based methods in combination with domain-specific encoders and
cross-lingual encoders plus adapters. We find that these strategies perform
especially well for classifying rare phenotypes and we advise on which method
to prefer in which situation. Our results show that using multilingual data
overall improves clinical phenotyping models and can compensate for data
sparseness.Comment: LREC 2022 submmision: January 202
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
This paper presents medBERTde, a pre-trained German BERT model specifically
designed for the German medical domain. The model has been trained on a large
corpus of 4.7 Million German medical documents and has been shown to achieve
new state-of-the-art performance on eight different medical benchmarks covering
a wide range of disciplines and medical document types. In addition to
evaluating the overall performance of the model, this paper also conducts a
more in-depth analysis of its capabilities. We investigate the impact of data
deduplication on the model's performance, as well as the potential benefits of
using more efficient tokenization methods. Our results indicate that
domain-specific models such as medBERTde are particularly useful for longer
texts, and that deduplication of training data does not necessarily lead to
improved performance. Furthermore, we found that efficient tokenization plays
only a minor role in improving model performance, and attribute most of the
improved performance to the large amount of training data. To encourage further
research, the pre-trained model weights and new benchmarks based on
radiological data are made publicly available for use by the scientific
community.Comment: Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann
contributed equall