75 research outputs found
Enhanced Integrated Scoring for Cleaning Dirty Texts
An increasing number of approaches for ontology engineering from text are
gearing towards the use of online sources such as company intranet and the
World Wide Web. Despite such rise, not much work can be found in aspects of
preprocessing and cleaning dirty texts from online sources. This paper presents
an enhancement of an Integrated Scoring for Spelling error correction,
Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as
part of a text preprocessing phase in an ontology engineering system. New
evaluations performed on the enhanced ISSAC using 700 chat records reveal an
improved accuracy of 98% as compared to 96.5% and 71% based on the use of only
basic ISSAC and of Aspell, respectively.Comment: More information is available at
http://explorer.csse.uwa.edu.au/reference
Word Sense Disambiguation for clinical abbreviations
Abbreviations are extensively used in electronic health records (EHR) of patients as
well as medical documentation, reaching 30-50% of the words in clinical narrative. There
are more than 197,000 unique medical abbreviations found in the clinical text and their
meanings vary depending on the context in which they are used. Since data in electronic
health records could be shareable across health information systems (hospitals, primary care centers, etc.) as well as others such as insurance companies information systems, it is essential determining the correct meaning of the abbreviations to avoid misunderstandings.
Clinical abbreviations have specific characteristic that do not follow any standard
rules for creating them. This makes it complicated to find said abbreviations and corresponding meanings. Furthermore, there is an added difficulty to working with clinical data due to privacy reasons, since it is essential to have them in order to develop and test algorithms.
Word sense disambiguation (WSD) is an essential task in natural language processing
(NLP) applications such as information extraction, chatbots and summarization systems
among others. WSD aims to identify the correct meaning of the ambiguous word which
has more than one meaning. Disambiguating clinical abbreviations is a type of lexical
sample WSD task. Previous research works adopted supervised, unsupervised and
Knowledge-based (KB) approaches to disambiguate clinical abbreviations. This thesis
aims to propose a classification model that apart from disambiguating well known abbreviations also disambiguates rare and unseen abbreviations using the most recent deep neural network architectures for language modeling.
In clinical abbreviation disambiguation several resources and disambiguation models
were encountered. Different classification approaches used to disambiguate the clinical
abbreviations were investigated in this thesis. Considering that computers do not directly
understand texts, different data representations were implemented to capture the meaning of the words. Since it is also necessary to measure the performance of algorithms, the evaluation measurements used are discussed.
As the different solutions proposed to clinical WSD we have explored static word embeddings data representation on 13 English clinical abbreviations of the UMN data set (from University of Minnesota) by testing traditional supervised machine learning algorithms separately for each abbreviation. Moreover, we have utilized a transformer-base pretrained model that was fine-tuned as a multi-classification classifier for the whole data set (75 abbreviations of the UMN data set). The aim of implementing just one multi-class classifier is to predict rare and unseen abbreviations that are most common in clinical
narrative. Additionally, other experiments were conducted for a different type of abbreviations
(scientific abbreviations and acronyms) by defining a hybrid approach composed
of supervised and knowledge-based approaches.
Most previous works tend to build a separated classifier for each clinical abbreviation,
tending to leverage different data resources to overcome the data acquisition bottleneck.
However, those models were restricted to disambiguate terms that have been
seen in trained data. Meanwhile, based on our results, transfer learning by fine-tuning a
transformer-based model could predict rare and unseen abbreviations. A remaining challenge for future work is to improve the model to automate the disambiguation of clinical abbreviations on run-time systems by implementing self-supervised learning models.Las abreviaturas se utilizan ampliamente en las historias clÃnicas electrónicas de los
pacientes y en mucha documentación médica, llegando a ser un 30-50% de las palabras empleadas en narrativa clÃnica. Existen más de 197.000 abreviaturas únicas usadas en textos clÃnicos siendo términos altamente ambiguos El significado de las abreviaturas varÃa en función del contexto en el que se utilicen. Dado que los datos de las historias clÃnicas electrónicas pueden compartirse entre servicios, hospitales, centros de atención primaria asà como otras organizaciones como por ejemplo, las compañÃas de seguros es fundamental determinar el significado correcto de las abreviaturas para evitar además eventos adversos relacionados con la seguridad del paciente. Nuevas abreviaturas clÃnicas aparecen constantemente y tienen la caracterÃstica especÃfica de que no siguen ningún estándar para su creación. Esto hace que sea muy difÃcil disponer de un recurso con todas las abreviaturas y todos sus significados. A todo esto hay que añadir la dificultad para trabajar con datos clÃnicos por cuestiones de privacidad cuando es esencial disponer de ellos para poder desarrollar algoritmos para su tratamiento.
La desambiguación del sentido de las palabras (WSD, en inglés) es una tarea esencial
en tareas de procesamiento del lenguaje natural (PLN) como extracción de información,
chatbots o generadores de resúmenes, entre otros. WSD tiene como objetivo identificar
el significado correcto de una palabra ambigua (que tiene más de un significado). Esta
tarea se ha abordado previamente utilizando tanto enfoques supervisados, no supervisados asà como basados en conocimiento. Esta tesis tiene como objetivo definir un modelo de clasificación que además de desambiguar abreviaturas conocidas desambigüe también abreviaturas menos frecuentes que no han aparecido previamente en los conjuntos de entrenaminto utilizando las arquitecturas de redes neuronales profundas más recientes relacionadas ocn los modelos del lenguaje.
En la desambiguación de abreviaturas clÃnicas se emplean diversos recursos y modelos de desambiguación. Se han investigado los diferentes enfoques de clasificación utilizados para desambiguar las abreviaturas clÃnicas. Dado que un ordenador no comprende directamente los textos, se han implementado diferentes representaciones de textos para capturar el significado de las palabras. Puesto que también es necesario medir el desempeño de cualquier algoritmo, se describen también las medidas de evaluación utilizadas.
La mayorÃa de los trabajos previos se han basado en la construcción de un clasificador
separado para cada abreviatura clÃnica. De este modo, tienden a aprovechar diferentes
recursos de datos para superar el cuello de botella de la adquisición de datos. Sin embargo, estos modelos se limitaban a desambiguar con los datos para los que el sistema habÃa sido entrenado.
Se han explorado además representaciones basadas vectores de palabras (word embeddings) estáticos para 13 abreviaturas clÃnicas en el corpus UMN en inglés (de la University of Minnesota) utilizando algoritmos de clasificación tradicionales de aprendizaje automático supervisados (un clasificador por cada abreviatura). Se ha llevado a cabo un segundo experimento utilizando un modelo multi-clasificador sobre todo el conjunto de las 75 abreviaturas del corpus UMN basado en un modelo Transformer pre-entrenado. El objetivo ha sido implementar un clasificador multiclase para predecir también abreviaturas raras y no vistas. Se realizó un experimento adicional para siglas cientÃficas en documentos de dominio abierto mediante la aplicación de un enfoque hÃbrido compuesto por enfoques supervisados y basados en el conocimiento.
AsÃ, basándonos en los resultados de esta tesis, el aprendizaje por transferencia (transfer learning) mediante el ajuste (fine-tuning) de un modelo de lenguaje preentrenado podrÃa predecir abreviaturas raras y no vistas sin necesidad de entrenarlas previamente. Un reto pendiente para el trabajo futuro es mejorar el modelo para automatizar la desambiguación de las abreviaturas clÃnicas en tiempo de ejecución mediante la implementación de modelos de aprendizaje autosupervisados.Programa de Doctorado en Ciencia y TecnologÃa Informática por la Universidad Carlos III de MadridPresidente: Israel González Carrasco.- Secretario: Leonardo Campillos Llanos.- Vocal: Ana MarÃa GarcÃa Serran
Acronym-Expansion Disambiguation for Intelligent Processing of Enterprise Information
An acronym is an abbreviation of several words in such a way that the abbreviation itself forms a pronounceable word. Acronyms occur frequently throughout various documents, especially those of a technical nature, for example, research papers and patents. While these acronyms can enhance document readability, in a variety of fields, they have a negative effect on business intelligence. To resolve this problem, we propose a method of acronym-expansion disambiguation to collect high-quality enterprise information. In experimental evaluations, we demonstrate its efficiency through the use of objective comparisons
A system to extract abbreviation-expansion pairs from biomedical literature
We present a system to identify abbreviation expansion pairs from scientific articles. We work with the Genomics track of the TREC collection. Authors report abbreviations in two places - an abbreviations section and within the body of a scientific article. Articles with an abbreviations section had fewer abbreviations than those that did not have an abbreviations section (an average of 7.1 versus 13.2 abbreviations per article). For articles that do have an abbreviations section, authors report 98.2% of the abbreviations present in the document in that section. Inspired by Schwartz & Hearst's earlier work our program identified 2.1 million abbreviations from 162,259 documents. A manual inspection of a randomly selected set of articles revealed that our system achieved 86.7% precision and 81.9% recall
Normalizing acronyms and abbreviations to aid patient understanding of clinical texts: ShARe/CLEF eHealth Challenge 2013, Task 2
Background: The ShARe/CLEF eHealth challenge lab aims to stimulate development of natural language
processing and information retrieval technologies to aid patients in understanding their clinical reports. In clinical
text, acronyms and abbreviations, also referenced as short forms, can be difficult for patients to understand. For one
of three shared tasks in 2013 (Task 2), we generated a reference standard of clinical short forms normalized to the
Unified Medical Language System. This reference standard can be used to improve patient understanding by
linking to web sources with lay descriptions of annotated short forms or by substituting short forms with a more
simplified, lay term.
Methods: In this study, we evaluate 1) accuracy of participating systems’ normalizing short forms compared to a
majority sense baseline approach, 2) performance of participants’ systems for short forms with variable majority
sense distributions, and 3) report the accuracy of participating systems’ normalizing shared normalized concepts
between the test set and the Consumer Health Vocabulary, a vocabulary of lay medical terms.
Results: The best systems submitted by the five participating teams performed with accuracies ranging from 43 to
72 %. A majority sense baseline approach achieved the second best performance. The performance of participating
systems for normalizing short forms with two or more senses with low ambiguity (majority sense greater than
80 %) ranged from 52 to 78 % accuracy, with two or more senses with moderate ambiguity (majority sense
between 50 and 80 %) ranged from 23 to 57 % accuracy, and with two or more senses with high ambiguity
(majority sense less than 50 %) ranged from 2 to 45 % accuracy. With respect to the ShARe test set, 69 % of short
form annotations contained common concept unique identifiers with the Consumer Health Vocabulary. For these
2594 possible annotations, the performance of participating systems ranged from 50 to 75 % accuracy.
Conclusion: Short form normalization continues to be a challenging problem. Short form normalization systems
perform with moderate to reasonable accuracies. The Consumer Health Vocabulary could enrich its knowledge base
with missed concept unique identifiers from the ShARe test set to further support patient understanding of
unfamiliar medical terms.</p
- …