Search CORE

7 research outputs found

MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Author: Verkijk Stella
Vossen Piek
Publication venue
Publication date: 01/12/2021
Field of study

This paper presents MedRoBERTa.nl as the first Transformer-based language model for Dutch medical language. We show that using 13GB of text data from Dutch hospital notes, pre-training from scratch results in a better domain-specific language model than further pre-training RobBERT. When extending pre-training on RobBERT, we use a domain-specific vocabulary and re-train the embedding look-up layer. We show that MedRoBERTa.nl, the model that was trained from scratch, outperforms general language models for Dutch on a medical odd-one-out similarity task. MedRoBERTa.nl already reaches higher performance than general language models for Dutch on this task after only 10k pre-training steps. When fine-tuned, MedRobERTa.nl outperforms general language models for Dutch in a task classifying sentences from Dutch hospital notes that contain information about patients' mobility levels

VU Research Portal

MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Author: Verkijk Stella
Vossen Piek
Publication venue
Publication date: 01/12/2021
Field of study

Sunken Ships Shan’t Sail:Ontology Design for Reconstructing Events in the Dutch East India Company Archives

Author: Verkijk Stella
Vossen Piek
Publication venue
Publication date: 01/01/2023
Field of study

This short paper describes ongoing work on the design of an event ontology that supports state-of-the-art event extraction in the archives of the Dutch East India Company (VOC). The ontology models Dynamic Events (actions or processes) and Static Events (states). By modelling the transition of a given to a new state as a logical implication that can be inferred automatically from the occurrence of a Dynamic Event, the ontology supports implied information extraction. It also considers implied sub-event detection and models event arguments as coreferential between event classes where possible. By doing so, it enables the extraction of much more information than is only explicitly stated in the archival texts with minimal annotation effort. We define this complete event extraction task that adopts both Natural Language Processing techniques as well as reasoning components as Event Reconstruction. The Event Reconstruction module will be embedded in a search interface that facilitates historical research in the VOC archives.</p

VU Research Portal

Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records:a Two-Step Method

Author: Verkijk Stella
Vossen Piek
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

VU Research Portal

Automated recognition of functioning, activity and participation in COVID-19 from electronic patient records by natural language processing:a proof- of- concept

Author: Geleijn Edwin
Kim Jenia
Meskers Carel G.M.
Meskers Caroline J.W.
Smit Quirine T.S.
van der Leeden Marike
van der Veen Sabina
Verkijk Stella
Vossen Piek T.J.M.
Widdershoven Guy A.M.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2022
Field of study

PURPOSE: To address the feasibility, reliability and internal validity of natural language processing (NLP) for automated functional assessment of hospitalised COVID-19 patients in key International Classification of Functioning, Disability and Health (ICF) categories and levels from unstructured text in electronic health records (EHR) from a large teaching hospital. MATERIALS AND METHODS: Eight human annotators assigned four ICF categories to relevant sentences: Emotional functions, Exercise tolerance, Walking and Moving, Work and Employment and their ICF levels (Functional Ambulation Categories for Walking and Moving, metabolic equivalents for Exercise tolerance). A linguistic neural network-based model was trained on 80% of the annotated sentences; inter-annotator agreement (IAA, Cohen’s kappa), a weighted score of precision and recall (F1) and RMSE for level detection were assessed for the remaining 20%. RESULTS: In total 4112 sentences of non-COVID-19 and 1061 of COVID-19 patients were annotated. Average IAA was 0.81; F1 scores were 0.7 for Walking and Moving and Emotional functions; RMSE for Walking and Moving (5- level scale) was 1.17 for COVID-19 patients. CONCLUSION: Using a limited amount of annotated EHR sentences, a proof-of-concept was obtained for automated functional assessment of COVID-19 patients in ICF categories and levels. This allows for instantaneous assessment of the functional consequences of new diseases like COVID-19 for large numbers of patients. KEY MESSAGES: 1. Hospitalised Covid-19 survivors may persistently suffer from low physical and mental functioning and a reduction in overall quality of life requiring appropriate and personalised rehabilitation strategies. 2. For this, assessment of functioning within multiple domains and categories of the International Classification of Function is required, which is cumbersome using structured data. 3. We show a proof-of-concept using Natural Language Processing techniques to automatically derive the aforementioned information from free-text notes within the Electronic Health Record of a large academic teaching hospital

VU Research Portal

PubMed Central

Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method

Author: Bechet Frederic
Blache Philippe
Calzolari Nicoletta
Choukri Khalid
Cieri Christopher
Declerck Thierry
Goggi Sara
Isahara Hitoshi
Maegaard Bente
Mariani Joseph
Mazo Helene
Odijk Jan
Piperidis Stelios
Verkijk Stella
Vossen Piek
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/06/2022
Field of study

Neural Network (NN) architectures are used more and more to model large amounts of data, such as text data available online. Transformer-based NN architectures have shown to be very useful for language modelling. Although many researchers study how such Language Models (LMs) work, not much attention has been paid to the privacy risks of training LMs on large amounts of data and publishing them online. This paper presents a new method for anonymizing a language model by presenting the way in which MedRoBERTa.nl, a Dutch language model for hospital notes, was anonymized. The two step method involves i) automatic anonymization of the training data and ii) semi-automatic anonymization of the LM's vocabulary. Adopting the fill-mask task where the model predicts what tokens are most probable to appear in a certain context, it was tested how often the model will predict a name in a context where a name should be. It was shown that it predicts a name-like token 0.2% of the time. Any name-like token that was predicted was never the name originally presented in the training data. By explaining how a LM trained on highly private real-world medical data can be safely published with open access, we hope that more language resources will be published openly and responsibly so the community can profit from them

Modeling Dutch Medical Texts for Detecting Functional Categories and Levels of COVID-19 Patients

Author: Bechet Frederic
Blache Philippe
Calzolari Nicoletta
Choukri Khalid
Cieri Christopher
Declerck Thierry
Geleijn Edwin
Goggi Sara
Isahara Hitoshi
Kim Jenia
Maegaard Bente
Mariani Joseph
Mazo Helene
Meskers Carel
Meskers Caroline
Odijk Jan
Piperidis Stelios
van der Leeden Marike
van der Veen Sabina
Verkijk Stella
Vossen Piek
Widdershoven Guy
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2022
Field of study

Electronic Health Records contain a lot of information in natural language that is not expressed in the structured clinical data. Especially in the case of new diseases such as COVID-19, this information is crucial to get a better understanding of patient recovery patterns and factors that may play a role in it. However, the language in these records is very different from standard language and generic natural language processing tools cannot easily be applied out-of-the-box. In this paper, we present a fine-tuned Dutch language model specifically developed for the language in these health records that can determine the functional level of patients according to a standard coding framework from the World Health Organization. We provide evidence that our classification performs at a sufficient level (F1-score above 80% for the main categories and error rates of less than 1 level on a 5-point Likert scale for levels) to generate patient recovery patterns that can be used to analyse factors that contribute to the rehabilitation of COVID-19 patients and to predict individual patient recovery of functioning