121 research outputs found
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
Natural language processing (NLP) applications such as named entity
recognition (NER) for low-resource corpora do not benefit from recent advances
in the development of large language models (LLMs) where there is still a need
for larger annotated datasets. This research article introduces a methodology
for generating translated versions of annotated datasets through crosslingual
annotation projection. Leveraging a language agnostic BERT-based approach, it
is an efficient solution to increase low-resource corpora with few human
efforts and by only using already available open data resources. Quantitative
and qualitative evaluations are often lacking when it comes to evaluating the
quality and effectiveness of semi-automatic data generation strategies. The
evaluation of our crosslingual annotation projection approach showed both
effectiveness and high accuracy in the resulting dataset. As a practical
application of this methodology, we present the creation of French Annotated
Resource with Semantic Information for Medical Entities Detection (FRASIMED),
an annotated corpus comprising 2'051 synthetic clinical cases in French. The
corpus is now available for researchers and practitioners to develop and refine
French natural language processing (NLP) applications in the clinical field
(https://zenodo.org/record/8355629), making it the largest open annotated
corpus with linked medical concepts in French
Information extraction from Spanish radiology reports
En los últimos a˜nos, la cantidad de información clínica disponible en formato digital ha crecido constantemente debido a la adopción del uso de sistemas de informática médica. En la mayoría de los casos, dicha información se encuentra representada en forma textual. La extracción de información contenida en dichos textos puede utilizarse para colaborar en tareas relacionadas con la clínica médica y para la toma de decisiones, y resulta esencial para la mejora de la atención médica. El dominio biomédico tiene vocabulario altamente especializado, local a distintos países, regiones e instituciones. Se utilizan abreviaturas ambiguas y no estándares. Por otro lado, algunos tipos de informes médicos suelen presentar faltas ortográficas y errores gramaticales. Además, la cantidad de datos anotados disponibles es escasa, debido a la dificultad de obtenerlos y a temas relacionados con la confidencialidad de la información. Esta situación dificulta el avance en el área de extracción de información. Pese a ser el segundo idioma con mayor cantidad de hablantes nativos en el mundo, poco trabajo se ha realizado hasta ahora en extracción de información de informes médicos escritos en espa˜nol. A los desafíos anteriormente descriptos se agregan la ausencia de terminologías específicas para ciertos dominios médicos y la menor disponibilidad de recursos linguísticos que los existentes para otros idiomas. En este trabajo contribuimos al dominio de la biomedicina en espa˜nol, proveyendo métodos con resultados competitivos para el desarrollo de componentes fundamentales de un proceso de extracción de información médico, específicamente para informes radiológicos. Con este fin, creamos un corpus anotado de informes radiológicos en espa˜nol para el reconocimiento de entidades, negación y especulación y extracción de relaciones. Publicamos el proceso seguido para la anotación y el esquema desarrollado. Implementamos dos algoritmos de detección de entidades nombradas con el fin de encontrar entidades anatómicas y hallazgos clínicos. El primero está basado en un diccionario especializado del dominio no disponible en espa˜nol y en el uso de reglas basadas en conocimiento morfosintáctico y está pensado para trabajar con lenguajes sin muchos recursos linguísticos. El segundo está basado en campos aleatorios condicionales y arroja mejores resultados. Adicionalmente, estudiamos e implementamos distintas soluciones para la detección de hallazgos clínicos negados. Para esto, adaptamos al espa˜nol un conocido algoritmo de detección de negaciones en textos médicos escritos en inglés y desarrollamos un método basado en reglas creadas a partir de patrones inferidos del análisis de caminos en árboles de dependencias. También adaptamos el primer método, que arrojó los mejores resultados, para la detección de negación y especulación en resúmenes de alta hospitalaria y notas de evolución clínica escritos en alemán. Consideramos que los resultados obtenidos y la publicación de criterios de anotación y evaluación contribuirán a seguir avanzando en la extracción de información de informes clínicos escritos en espa˜nol.In the last years, the number of digitized clinical data has been growing steadily, due to the adoption of clinical information systems. A great amount of this data is in textual format. The extraction of information contained in texts can be used to support clinical tasks and decisions and is essential for improving health care. The biomedical domain uses a highly specialized and local vocabulary, with abundance of non-standard and ambiguous abbreviations. Moreover, some type of medical reports present ill-formed sentences and lack of diacritics. Publicly accessible annotated data is scarce, due to two main reasons: the difficulty of creating it and the confidential nature of the data, that demands de-identification. This situation hinders the advance of information extraction in the biomedical domain area. Although Spanish is the second language in terms of numbers of native speakers in the world, not much work has been done in information extraction from Spanish medical reports. Challenges include the absence of specific terminologies for certain medical domains in Spanish and the availability of linguistic resources, that are less developed than those of high resources languages, such as English. In this thesis, we contribute to the BioNLP domain by providing methods with competitive results to apply a fragment of a medical information extraction pipeline to Spanish radiology reports. Therefore, an annotated dataset for entity recognition, negation and speculation detection, and relation extraction was created. The annotation process followed and the annotation schema developed were shared with the community. Two named entity recognition algorithms were implemented for the detection of anatomical entities and clinical findings. The first algorithm developed is based on a specialized dictionary of the radiology domain not available in Spanish and in the use of rules based on morphosyntactic knowledge and is designed for named entity recognition in medium or low resource languages. The second one, based on conditional random fields, was implemented when we were able to obtain a larger set of annotated data and achieves better results. We also studied and implemented different solutions for negation detection of clinical findings: an adaptation to Spanish of a popular negation detection algorithm for English medical reports and a rule-based method that detects negations based on patterns inferred from the analysis of paths of dependency parse trees. The first method obtained the best results and was also adapted for negation and speculation detection in German clinical notes and discharge summaries. We consider that the results obtained, and the annotation guidelines provided will bring new benefits to further advance in the field of information extraction from Spanish medical reports.Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina
Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers
Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, espe- cially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1—named entity recognition in a Spanish corpus—an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools.The authors acknowledge the support from the Ministerio de Ciencia e Innovación (MICINN) under project AEI/10.13039/501100011033. This work is also supported by the University of Malaga/CBUA funding for open access charge
The Impact of Automatic Pre-annotation in Clinical Note Data Element Extraction - the CLEAN Tool
Objective. Annotation is expensive but essential for clinical note review and
clinical natural language processing (cNLP). However, the extent to which
computer-generated pre-annotation is beneficial to human annotation is still an
open question. Our study introduces CLEAN (CLinical note rEview and
ANnotation), a pre-annotation-based cNLP annotation system to improve clinical
note annotation of data elements, and comprehensively compares CLEAN with the
widely-used annotation system Brat Rapid Annotation Tool (BRAT).
Materials and Methods. CLEAN includes an ensemble pipeline (CLEAN-EP) with a
newly developed annotation tool (CLEAN-AT). A domain expert and a novice
user/annotator participated in a comparative usability test by tagging 87 data
elements related to Congestive Heart Failure (CHF) and Kawasaki Disease (KD)
cohorts in 84 public notes.
Results. CLEAN achieved higher note-level F1-score (0.896) over BRAT (0.820),
with significant difference in correctness (P-value < 0.001), and the mostly
related factor being system/software (P-value < 0.001). No significant
difference (P-value 0.188) in annotation time was observed between CLEAN (7.262
minutes/note) and BRAT (8.286 minutes/note). The difference was mostly
associated with note length (P-value < 0.001) and system/software (P-value
0.013). The expert reported CLEAN to be useful/satisfactory, while the novice
reported slight improvements.
Discussion. CLEAN improves the correctness of annotation and increases
usefulness/satisfaction with the same level of efficiency. Limitations include
untested impact of pre-annotation correctness rate, small sample size, small
user size, and restrictedly validated gold standard.
Conclusion. CLEAN with pre-annotation can be beneficial for an expert to deal
with complex annotation tasks involving numerous and diverse target data
elements
COHORT IDENTIFICATION FROM FREE-TEXT CLINICAL NOTES USING SNOMED CT’S SEMANTIC RELATIONS
In this paper, a new cohort identification framework that exploits the semantic hierarchy of SNOMED CT is proposed to overcome the limitations of supervised machine learning-based approaches. Eligibility criteria descriptions and free-text clinical notes from the 2018 National NLP Clinical Challenge (n2c2) were processed to map to relevant SNOMED CT concepts and to measure semantic similarity between the eligibility criteria and patients. The eligibility of a patient was determined if the patient had a similarity score higher than a threshold cut-off value, which was established where the best F1 score could be achieved. The performance of the proposed system was evaluated for three eligibility criteria. The current framework’s macro-average F1 score across three eligibility criteria was higher than the previously reported results of the 2018 n2c2 (0.933 vs. 0.889). This study demonstrated that SNOMED CT alone can be leveraged for cohort identification tasks without referring to external textual sources for training.Doctor of Philosoph
Enhancing the interactivity of a clinical decision support system by using knowledge engineering and natural language processing
Mental illness is a serious health problem and it affects many people. Increasingly,Clinical Decision Support Systems (CDSS) are being used for diagnosis and it is important to improve the reliability and performance of these systems. Missing a potential clue or a wrong diagnosis can have a detrimental effect on the patient's quality of life and could lead to a fatal outcome. The context of this research is the Galatean Risk and Safety Tool (GRiST), a mental-health-risk assessment system. Previous research has shown that success of a CDSS depends on its ease of use, reliability and interactivity. This research addresses these concerns for the GRiST by deploying data mining techniques. Clinical narratives and numerical data have both been analysed for this purpose.Clinical narratives have been processed by natural language processing (NLP)technology to extract knowledge from them. SNOMED-CT was used as a reference ontology and the performance of the different extraction algorithms have been compared. A new Ensemble Concept Mining (ECM) method has been proposed, which may eliminate the need for domain specific phrase annotation requirements. Word embedding has been used to filter phrases semantically and to build a semantic representation of each of the GRiST ontology nodes.The Chi-square and FP-growth methods have been used to find relationships between GRiST ontology nodes. Interesting patterns have been found that could be used to provide real-time feedback to clinicians. Information gain has been used efficaciously to explain the differences between the clinicians and the consensus risk. A new risk management strategy has been explored by analysing repeat assessments. A few novel methods have been proposed to perform automatic background analysis of the patient data and improve the interactivity and reliability of GRiST and similar systems
Front-Line Physicians' Satisfaction with Information Systems in Hospitals
Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe
Methods and Applications for Summarising Free-Text Narratives in Electronic Health Records
As medical services move towards electronic health record (EHR) systems the breadth and depth of data stored at each patient encounter has increased. This growing wealth of data and investment in care systems has arguably put greater strain on services, as those at the forefront are pushed towards greater time spent in front of computers over their patients. To minimise the use of EHR systems clinicians often revert to using free-text data entry to circumvent the structured input fields. It has been estimated that approximately 80% of EHR data is within the free-text portion. Outside of their primary use, that is facilitating the direct care of the patient, secondary use of EHR data includes clinical research, clinical audits, service improvement research, population health analysis, disease and patient phenotyping, clinical trial recruitment to name but a few.This thesis presents a number of projects, previously published and original work in the development, assessment and application of summarisation methods for EHR free-text. Firstly, I introduce, define and motivate EHR free-text analysis and summarisation methods of open-domain text and how this compares to EHR free-text. I then introduce a subproblem in natural language processing (NLP) that is the recognition of named entities and linking of the entities to pre-existing clinical knowledge bases (NER+L). This leads to the first novel contribution the Medical Concept Annotation Toolkit (MedCAT) that provides a software library workflow for clinical NER+L problems. I frame the outputs of MedCAT as a form of summarisation by showing the tools contributing to published clinical research and the application of this to another clinical summarisation use-case ‘clinical coding’. I then consider methods for the textual summarisation of portions of clinical free-text. I show how redundancy in clinical text is empirically different to open-domain text discussing how this impacts text-to-text summarisation. I then compare methods to generate discharge summary sections from previous clinical notes using methods presented in prior chapters via a novel ‘guidance’ approach.I close the thesis by discussing my contributions in the context of state-of-the-art and how my work fits into the wider body of clinical NLP research. I briefly describe the challenges encountered throughout, offer my perspectives on the key enablers of clinical informatics research, and finally the potential future work that will go towards translating research impact to real-world benefits to healthcare systems, workers and patients alike
- …