Search CORE

1,808 research outputs found

Learning for clinical named entity recognition without manual annotations

Author: Ghiasvand Omid
Kate Rohit J.
Publication venue: UWM Digital Commons
Publication date: 30/10/2018
Field of study

Background: Named entity recognition (NER) systems are commonly built using supervised methods that use machine learning to learn from corpora manually annotated with named entities. However, manually annotating corpora is very expensive and laborious. Materials and methods: In this paper, a novel method is presented for training clinical NER systems that does not require any manual annotations. It only requires a raw text corpus and a resource like UMLS that can give a list of named entities along with their semantic types. Using these two resources, annotations are automatically obtained to train machine learning methods. The method was evaluated on the NER shared-task datasets of i2b2 2010 and SemEval 2014. Results: On the SemEval 2014 dataset for recognizing diseases and disorders, the method obtained F-measure of 0.693 for exact matching and of 0.773 allowing overlaps. This is comparable to many supervised systems in the past that had used manual annotations for training. On the i2b2 2010 dataset for recognizing problems, tests and treatments, the method obtained F-measures of 0.451, 0.338 and 0.204 respectively for exact matching and of 0.721, 0.587 and 0.475 respectively allowing overlaps. These results are better than an existing unsupervised method. Conclusions: Experiments on standard datasets showed that the new method performed well. The method is general and could be applied to recognize entities of other types on other genres of text without needing manual annotations

University of Wisconsin-Milwaukee

Unsupervised Biomedical Named Entity Recognition

Author: Ghiasvand Omid
Publication venue: UWM Digital Commons
Publication date: 01/08/2017
Field of study

Named entity recognition (NER) from text is an important task for several applications, including in the biomedical domain. Supervised machine learning based systems have been the most successful on NER task, however, they require correct annotations in large quantities for training. Annotating text manually is very labor intensive and also needs domain expertise. The purpose of this research is to reduce human annotation effort and to decrease cost of annotation for building NER systems in the biomedical domain. The method developed in this work is based on leveraging the availability of resources like UMLS (Unified Medical Language System), that contain a list of biomedical entities and a large unannotated corpus to build an unsupervised NER system that does not require any manual annotations. The method that we developed in this research has two phases. In the first phase, a biomedical corpus is automatically annotated with some named entities using UMLS through unambiguous exact matching which we call weakly-labeled data. In this data, positive examples are the entities in the text that exactly match in UMLS and have only one semantic type which belongs to the desired entity class to be extracted (for example, diseases and disorders). Negative examples are the entities in the text that exactly match in UMLS but are of semantic types other than those that belong to the desired entity class. These examples are then used to train a machine learning classifier using features that represent the contexts in which they appeared in the text. The trained classifier is applied back to the text to gather more examples iteratively through the process of self-training. The trained classifier is then capable of classifying mentions in an unseen text as of the desired entity class or not from the contexts in which they appear. Although the trained named entity detector is good at detecting the presence of entities of the desired class in text, it cannot determine their correct boundaries. In the second phase of our method, called “Boundary Expansion”, the correct boundaries of the entities are determined. This method is based on a novel idea that utilizes machine learning and UMLS. Training examples for boundary expansion are gathered directly from UMLS and do not require any manual annotations. We also developed a new WordNet based approach for boundary expansion. Our developed method was evaluated on three datasets - SemEval 2014 Task 7 dataset that has diseases and disorders as the desired entity class, GENIA dataset that has proteins, DNAs, RNAs, cell types, and cell lines as the desired entity classes, and i2b2 dataset that has problems, tests, and treatments as the desired entity classes. Our method performed well and obtained performance close to supervised methods on the SemEval dataset. On the other datasets, it outperformed an existing unsupervised method on most entity classes. Availability of a list of entity names with their semantic types and a large unannotated corpus are the only requirements of our method to work well. Given these, our method generalizes across different types of entities and different types of biomedical text. Being unsupervised, the method can be easily applied to new NER tasks without needing costly annotations

University of Wisconsin-Milwaukee

Unsupervised extraction, labelling and clustering of segments from clinical notes

Author: Halámková Jana
Nováček Vít
Zelina Petr
Publication venue
Publication date: 21/11/2022
Field of study

This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a stepping stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes.Comment: To be published at the IEEE BIBM 2022 conferenc

arXiv.org e-Print Archive

Advancing Italian Biomedical Information Extraction with Large Language Models: Methodological Insights and Multicenter Practical Application

Author: Bellazzi Riccardo
Binetti Giuliano
Buonocore Tommaso Mario
Capelli Marco
Costa Alfredo
Crema Claudio
Fostinelli Silvia
Fundarò Cira
Manera Marina
Parimbelli Enea
Ramusino Matteo Cotta
Redolfi Alberto
Verde Federico
Publication venue
Publication date: 08/06/2023
Field of study

The introduction of computerized medical records in hospitals has reduced burdensome operations like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting them from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation, using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Large Language Model for this task. Moreover, we conducted several experiments with three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "few-shot" approach. This allowed us to establish methodological guidelines that pave the way for future implementations in this field and allow Italian hospitals to tap into important research opportunities

arXiv.org e-Print Archive

Safeguarding Privacy Through Deep Learning Techniques

Author: Catelli Rosario
Publication venue
Publication date: 13/04/2021
Field of study

Over the last few years, there has been a growing need to meet minimum security and privacy requirements. Both public and private companies have had to comply with increasingly stringent standards, such as the ISO 27000 family of standards, or the various laws governing the management of personal data. The huge amount of data to be managed has required a huge effort from the employees who, in the absence of automatic techniques, have had to work tirelessly to achieve the certification objectives. Unfortunately, due to the delicate information contained in the documentation relating to these problems, it is difficult if not impossible to obtain material for research and study purposes on which to experiment new ideas and techniques aimed at automating processes, perhaps exploiting what is in ferment in the scientific community and linked to the fields of ontologies and artificial intelligence for data management. In order to bypass this problem, it was decided to examine data related to the medical world, which, especially for important reasons related to the health of individuals, have gradually become more and more freely accessible over time, without affecting the generality of the proposed methods, which can be reapplied to the most diverse fields in which there is a need to manage privacy-sensitive information

Università degli Studi di Napoli Federico Il Open Archive

Digital Twins for Patient Care via Knowledge Graphs and Closed-Form Continuous-Time Liquid Neural Networks

Author: Nye Logan
Publication venue
Publication date: 08/07/2023
Field of study

Digital twin technology has is anticipated to transform healthcare, enabling personalized medicines and support, earlier diagnoses, simulated treatment outcomes, and optimized surgical plans. Digital twins are readily gaining traction in industries like manufacturing, supply chain logistics, and civil infrastructure. Not in patient care, however. The challenge of modeling complex diseases with multimodal patient data and the computational complexities of analyzing it have stifled digital twin adoption in the biomedical vertical. Yet, these major obstacles can potentially be handled by approaching these models in a different way. This paper proposes a novel framework for addressing the barriers to clinical twin modeling created by computational costs and modeling complexities. We propose structuring patient health data as a knowledge graph and using closed-form continuous-time liquid neural networks, for real-time analytics. By synthesizing multimodal patient data and leveraging the flexibility and efficiency of closed form continuous time networks and knowledge graph ontologies, our approach enables real time insights, personalized medicine, early diagnosis and intervention, and optimal surgical planning. This novel approach provides a comprehensive and adaptable view of patient health along with real-time analytics, paving the way for digital twin simulations and other anticipated benefits in healthcare.Comment: 6 page

arXiv.org e-Print Archive