The current state of adoption of well-structured electronic health records
and integration of digital methods for storing medical patient data in
structured formats can often considered as inferior compared to the use of
traditional, unstructured text based patient data documentation. Data mining in
the field of medical data analysis often needs to rely solely on processing of
unstructured data to retrieve relevant data. In natural language processing
(NLP), statistical models have been shown successful in various tasks like
part-of-speech tagging, relation extraction (RE) and named entity recognition
(NER). In this work, we present GERNERMED, the first open, neural NLP model for
NER tasks dedicated to detect medical entity types in German text data. Here,
we avoid the conflicting goals of protection of sensitive patient data from
training data extraction and the publication of the statistical model weights
by training our model on a custom dataset that was translated from publicly
available datasets in foreign language by a pretrained neural machine
translation model. The sample code and the statistical model is available at:
https://github.com/frankkramer-lab/GERNERME

Frei, Johann

Kramer, Frank

English

arXiv

OPUS Augsburg

Software Impacts 11 (2022) 100212FBContents lists available at ScienceDirectSoftware Impactsjournal homepage: www.journals.elsevier.com/software-impactsOriginal software publicationGERNERMED: An open German medical NER modelJohann Frei ∗, Frank Krameraculty of Applied Computer Science, University of Augsburg, Alter Postweg 101, 86159 Augsburg, GermanyA R T I C L E I N F OKeywords:Named entity recognitionNatural language processingClinical text miningMachine learningA B S T R A C TRecent advancements in natural language processing (NLP) have been achieved by the use of increasinglycomplex neural networks. In clinical context, NLP is a key technique to access highly relevant information fromunstructured texts such as clinical notes. We evaluate the feasibility of training our neural model GERNERMEDon annotated German training data generated by automated translation from a public English dataset. Thework guides other researchers about the use of machine-translation methods for dataset acquisition. Due tothe public origin of the dataset, our trained software can be used by fellow researchers without any legalaccess restrictions.Code metadataCurrent code version v1.0Permanent link to code/repository used for this code version https://github.com/SoftwareImpacts/SIMPAC-2021-181Permanent link to reproducible capsule https://codeocean.com/capsule/0396930/tree/v1Legal code license MIT LicenseCode versioning system used noneSoftware code languages, tools and services used Python, C++, pytorch/fairseq, clab/fast_align, explosion/SpaCy.Compilation requirements, operating environments and dependencies Python 3, SpaCy libraryIf available, link to developer documentation/manual Readme page:https://github.com/frankkramer-lab/GERNERMED/blob/main/README.mdSupport email for questions johann.frei@informatik.uni-augsburg.de1. IntroductionRecent advancements in natural language processing (NLP) havebeen achieved by the extensive use of increasingly complex neuralnetworks. For example, large general purpose language models fromthe kind of BERT [1]- or GPT [2,3]-inspired architectures are commonlytrained on large corpora such as Common Crawl [4] or The Pile [5] thatare composed of 320 TiB (Common Crawl) or 825 GiB (The Pile) rawtext data. Since any kind of such large-scale data is infeasible to anno-tate, these datasets are mainly purposed for unsupervised methods suchas pretraining [6]. However, when facing case-specific downstreamtasks, well-suited datasets are used for fine-tuning in a supervisedfashion [6]. In this context the dataset is required to be annotated for acertain task accordingly. The dataset plays a key role since the qualityof such NLP models highly correlates with the quantity and qualitye ofthe training dataset that governs the model’s learned parameters.The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibilityadge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.∗ Corresponding author.E-mail addresses: johann.frei@informatik.uni-augsburg.de (J. Frei), frank.kramer@informatik.uni-augsburg.de (F. Kramer).While public datasets have been used for training NLP models forspecific tasks, the availability of these datasets falls short when itcomes to non-English text data. For instance, in the case of NLP forclinical application, several public English datasets are accessible tothe research community [7,8]. However, for clinical NLP in German,only limited data is available [9] to open research due to GDPR andother privacy protection concerns as well as the frequent lack of goldstandard annotations.Processing of unstructured German clinical data remains an ongoingarea of research. Common tasks in NLP, such as named entity recog-nition (NER), are used for determining key elements from texts likemedication information and various related information like dosageand duration [6].In this work we present the GERNERMED software component,which was trained on a custom dataset of clinical notes for Germanhttps://doi.org/10.1016/j.simpa.2021.100212Received 10 December 2021; Received in revised form 22 December 2021; Accepted 22 December 20212665-9638/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).J. Frei and F. Kramer Software Impacts 11 (2022) 100212Fig. 1. Pipeline illustration: workflow for data synthesis, NER training and inference.texts and can be easily deployed as independent part or used as a partof a larger NLP pipeline. As foundational work, the underlying datasetwas automatically synthesized from publicly available English-basedclinical data. The work also aims to guide other researchers about theuse of machine-translation methods for similar silver-labeled datasetacquisition without manual quality control.2. Materials and methodsThe n2c2 2018 [8] challenge often provides the basis for EnglishNLP work. The dataset consists of 303 annotated training documentsand 202 gold standard-annotated test documents. We extracted andparsed the text and annotation labels from the dataset in order totranslate the text from English into German using a pretrained neuralmachine translation model from Fairseq [10].It cannot be assumed that the translated text does not differ fromthe structure of the original text due to inherent differences in syntaxfor English and German. For instance, it is not guaranteed that atranslation-wise correspondence between exactly the fourth word inEnglish and the fourth word in German exists.In order to establish a word-to-word correspondence, we buildupon the FastAlign [11] software that estimates a word-to-word align-ments based on an expectation maximization-based algorithm giventhe pairs of input and output sentences. Because of the simplificationof the statistical model for sentence alignment we expect the align-ment estimation results to exhibit flaws in outlier samples that do notfollow ordinary sentence structures in the original dataset. In orderto filter these misalignment artifacts, we encode the assumption thatsuccessful alignment estimation approximately follows the word orderof an English and German sentence pair. We discard samples from thedataset if average distance from the entries of the alignment matrix toits diagonal axis is exceeded by a certain threshold value. Given thealignment for each sentence pair, the annotation information for the2J. Frei and F. Kramer Software Impacts 11 (2022) 100212dcdmplNapolhrtFig. 2. NER tagging: successful processing of a German demo sentence.English sentence can be propagated to the corresponding tokens in theGerman sentence.Using our synthesized dataset, we can train a custom named entityrecognizer component for the clinical application use case. For theimplementation of the neural component and sentence parsing, we usethe SpaCy [12] software for training and inference. The workflow isillustrated in Fig. 1.3. ResultsHere, we present a named entity recognizer component, whichenables fellow researchers to directly integrate an annotation compo-nent into their research software systems. It was trained given thedefault SpaCy parameters for named entity recognition components.Our obtained dataset consists of 8599 sentences with a total numberof 172695 tokens. The dataset was conventionally split into training(80%), validation (10%) and test set (10%) in order to measure thelearning behavior as well as the final model performance.The trained NER component is capable of detecting the medical-related entity tags Drug, Strength, Route, Form, Dosage, Frequency andDuration on an average F1-score of 81.54%. An example of the textannotation result is provided in Fig. 2.Since our NER component is based on the component code of theSpaCy NLP pipeline, the component can be easily installed by a singlecommand and included into related clinical text processing researchpipelines in two lines of code.4. Impact overviewExtracting relevant information such as drugs and medications fromunstructured text data is a highly relevant use case because it enablesother researchers with access to hospital-internal clinical notes to pro-cess large amounts of German text data in order to study and trackhealth-related information for further research. In general, unstructuredtext processing does not only concern current data collection but in-cludes processing of historic and legacy text data. Thus, it featuresrelevance for retrospective study designs and secondary use of healthdata.GERNERMED can provide benefits to the mining of patient recordsfor the DIFUTURE ProVal-MS study [13] on Multiple Sclerosis in or-er to extract medication and drug-related information from Germanlinical notes at the local university hospital. Understanding the drug-isease interactions in multiple sclerosis can contribute to advance-ents in treatment decision and outcome. The DIFUTURE researchroject (‘‘use case’’) on Parkinson’s Disease [13] faces similar chal-enges, yet detection and extraction of medication data through ourER model can improve the quality of existing study data for statisticalnalysis.Similarly, our model can be used by fellow researchers for other NLPipelines in clinical research. The main impact in this research field isn automated annotation of non-English clinical documents.Because the NER model was trained on data derived from pub-icly available sources instead of highly sensitive internal data fromospitals, we bypass the legal regulations and restrictions on privacy-elated health data and are allowed to provide the trained NER modelo the public audience. Due to the open nature of our component, thesoftware can be further used for a broad variety of situations includingcommercial applications within the domain of German clinical NLP,but also be used for potential statistical model analysis since the modelweights are publicly accessible.Due to the novelty of our software component, we aim to receivefeedback from upcoming internal as well as external projects and usersto provide an updated iteration of the component as part of futurework.5. DiscussionThe dataset was automatically generated through translation andalignment, error-inducing translation and alignment estimation areexpected to degrade the quality of the dataset in comparison to man-ually curated datasets. However, the NER performance scores pointout the capabilities and limits of such automated data synthetizationand therefore, can be also relevant for other researchers from differentdomains.We regard the deep analysis of the dataset and the software compo-nent as future work. The software can be considered as a baseline forcompeting open NLP components that will potentially be published inupcoming research work.6. ConclusionWe presented the GERNERMED software component, an open namedentity recognition system for German clinical texts. As a prerequisite fortraining such component, we described means to fast and effectivelyobtain a language-specific dataset from datasets of foreign languagesfor clinical domains.Applying the method of public datasets allows us to provide thetrained components for public use and make it easily accessible forinterested users without relying on access restrictions. Furthermore,we supply example code and the performance evaluation script for oursoftware in order to increase reproducibility in this research area.Our results also provide other researchers general information onthe effectiveness of building NLP components through machinetranslation-based dataset generation as an alternative to time- andcost-intensive manual dataset acquisition.Declaration of competing interestThe authors declare that they have no known competing finan-cial interests or personal relationships that could have appeared toinfluence the work reported in this paper.References[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, 2018,CoRR, abs/1810.04805.[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Ka-plan,Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, ChrisHesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, JackClark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, DarioAmodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato,R. Hadsell, M.F. Balcan, H. Lin (Eds.), Advances in Neural Information ProcessingSystems, vol. 33, Curran Associates, Inc, 2020, pp. 1877–1901.3J. Frei and F. Kramer Software Impacts 11 (2022) 100212[3] Ben Wang, Aran Komatsuzaki, GPT-J-6B: A 6 Billion Parameter AutoregressiveLanguage Model, 2021, https://github.com/kingoflolz/mesh-transformer-jax.[4] Common crawl blog, http://commoncrawl.org/connect/blog/. (Accessed: 2021-12-10).[5] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, CharlesFoster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser,Connor Leahy, The pile: An 800 gb dataset of diverse text for language modeling,2020, arXiv preprint arXiv:2101.00027.[6] Bethany Percha, Modern clinical text mining: A guide and review, Annu. Rev.Biomed. Data Sci. 4 (1) (2021) 165–187, PMID: 34465177.[7] Tom J. Pollard, Alistair E.W. Johnson, The mimic-iii clinical database, 2016,http://dx.doi.org/10.13026/C2XW26.[8] Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, Ozlem Uzuner,2018 n2c2 Shared task on adverse drug events and medication extraction inelectronic health records, J. Am. Med. Inform. Assoc.: JAMIA 27 (1) (2020)3—12.[9] Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, MarkusFollmann, Jan.Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow, Ggponc: Acorpus of german medical text with rich metadata based on clinical practiceguidelines. in: Proceedings of the 11th International Workshop on Health TextMining and Information Analysis, 2020, pp. 38–48.[10] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng,David Grangier, Michael Auli, Fairseq: A fast, extensible toolkit for sequencemodeling, in: Proceedings of NAACL-HLT 2019, Demonstrations, 2019.[11] Chris Dyer, Victor Chahuneau, Noah A. Smith, A simple, fast, and effectivereparameterization of IBM model 2, in: Proceedings of the 2013 Conferenceof the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Association for Computational Linguistics,Atlanta, Georgia, 2013, pp. 644–648.[12] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, spaCy:Industrial-strength Natural Language Processing in Python, 2020.[13] Fabian Prasser, Oliver Kohlbacher, Ulrich Mansmann, Bernhard Bauer, Klaus A.Kuhn, Data integration for future medicine (difuture), Methods Inf. Med. 57 (S01) (2018) e57–e65.4

GERNERMED: an open German medical NER model

https://opus.bibliothek.uni-augsburg.de/opus4/files/98541/1-s2.0-S2665963821000944-main.pdf

GERNERMED: an open German medical NER model

Abstract

Similar works

Full text

Available Versions

OPUS Augsburg