1,240 research outputs found
Disambiguation of biomedical text using diverse sources of information
Background: Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of various sources of information including linguistic features of the context in which the ambiguous term is used and domain-specific resources, such as UMLS.
Materials and methods: We compare various sources of information including ones which have been previously used and a novel one: MeSH terms. Evaluation is carried out using a standard test set (the NLM-WSD corpus).
Results: The best performance is obtained using a combination of linguistic features and MeSH terms. Performance of our system exceeds previously published results for systems evaluated using the same data set.
Conclusion: Disambiguation of biomedical terms benefits from the use of information from a variety of sources. In particular, MeSH terms have proved to be useful and should be used if available
Sense-based biomedical indexing and retrieval
International audienceThis paper tackles the problem of term ambiguity, especially for biomedical literature. We propose and evaluate two methods of Word Sense Disambiguation (WSD) for biomedical terms and integrate them to a sense-based document indexing and retrieval framework. Ambiguous biomedical terms in documents and queries are disambiguated using the Medical Subject Headings (MeSH) thesaurus and semantically indexed with their associated correct sense. Experimental evaluation carried out on the TREC9-FT 2000 collection shows that our approach of WSD and sense-based indexing and retrieval is promising
Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning
The healthcare environment is commonly referred to as "information-rich" but
also "knowledge poor". Healthcare systems collect huge amounts of data from
various sources: lab reports, medical letters, logs of medical tools or
programs, medical prescriptions, etc. These massive sets of data can provide
great knowledge and information that can improve the medical services, and
overall the healthcare domain, such as disease prediction by analyzing the
patient's symptoms or disease prevention, by facilitating the discovery of
behavioral factors for diseases. Unfortunately, only a relatively small volume
of the textual eHealth data is processed and interpreted, an important factor
being the difficulty in efficiently performing Big Data operations. In the
medical field, detecting domain-specific multi-word terms is a crucial task as
they can define an entire concept with a few words. A term can be defined as a
linguistic structure or a concept, and it is composed of one or more words with
a specific meaning to a domain. All the terms of a domain create its
terminology. This chapter offers a critical study of the current, most
performant solutions for analyzing unstructured (image and textual) eHealth
data. This study also provides a comparison of the current Natural Language
Processing and Deep Learning techniques in the eHealth context. Finally, we
examine and discuss some of the current issues, and we define a set of research
directions in this area
Exploiting domain information for Word Sense Disambiguation of medical documents
OBJECTIVE: Current techniques for knowledge-based Word Sense Disambiguation (WSD) of ambiguous biomedical terms rely on relations in the Unified Medical Language System Metathesaurus but do not take into account the domain of the target documents. The authors' goal is to improve these methods by using information about the topic of the document in which the ambiguous term appears. DESIGN: The authors proposed and implemented several methods to extract lists of key terms associated with Medical Subject Heading terms. These key terms are used to represent the document topic in a knowledge-based WSD system. They are applied both alone and in combination with local context. MEASUREMENTS: A standard measure of accuracy was calculated over the set of target words in the widely used National Library of Medicine WSD dataset. RESULTS AND DISCUSSION: The authors report a significant improvement when combining those key terms with local context, showing that domain information improves the results of a WSD system based on the Unified Medical Language System Metathesaurus alone. The best results were obtained using key terms obtained by relevance feedback and weighted by inverse document frequency
Acronyms as an integral part of multi–word term recognition - A token of appreciation
Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation
Recommended from our members
Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance. RESULTS: Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated. CONCLUSION: Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology
Word Sense Disambiguation for clinical abbreviations
Abbreviations are extensively used in electronic health records (EHR) of patients as
well as medical documentation, reaching 30-50% of the words in clinical narrative. There
are more than 197,000 unique medical abbreviations found in the clinical text and their
meanings vary depending on the context in which they are used. Since data in electronic
health records could be shareable across health information systems (hospitals, primary care centers, etc.) as well as others such as insurance companies information systems, it is essential determining the correct meaning of the abbreviations to avoid misunderstandings.
Clinical abbreviations have specific characteristic that do not follow any standard
rules for creating them. This makes it complicated to find said abbreviations and corresponding meanings. Furthermore, there is an added difficulty to working with clinical data due to privacy reasons, since it is essential to have them in order to develop and test algorithms.
Word sense disambiguation (WSD) is an essential task in natural language processing
(NLP) applications such as information extraction, chatbots and summarization systems
among others. WSD aims to identify the correct meaning of the ambiguous word which
has more than one meaning. Disambiguating clinical abbreviations is a type of lexical
sample WSD task. Previous research works adopted supervised, unsupervised and
Knowledge-based (KB) approaches to disambiguate clinical abbreviations. This thesis
aims to propose a classification model that apart from disambiguating well known abbreviations also disambiguates rare and unseen abbreviations using the most recent deep neural network architectures for language modeling.
In clinical abbreviation disambiguation several resources and disambiguation models
were encountered. Different classification approaches used to disambiguate the clinical
abbreviations were investigated in this thesis. Considering that computers do not directly
understand texts, different data representations were implemented to capture the meaning of the words. Since it is also necessary to measure the performance of algorithms, the evaluation measurements used are discussed.
As the different solutions proposed to clinical WSD we have explored static word embeddings data representation on 13 English clinical abbreviations of the UMN data set (from University of Minnesota) by testing traditional supervised machine learning algorithms separately for each abbreviation. Moreover, we have utilized a transformer-base pretrained model that was fine-tuned as a multi-classification classifier for the whole data set (75 abbreviations of the UMN data set). The aim of implementing just one multi-class classifier is to predict rare and unseen abbreviations that are most common in clinical
narrative. Additionally, other experiments were conducted for a different type of abbreviations
(scientific abbreviations and acronyms) by defining a hybrid approach composed
of supervised and knowledge-based approaches.
Most previous works tend to build a separated classifier for each clinical abbreviation,
tending to leverage different data resources to overcome the data acquisition bottleneck.
However, those models were restricted to disambiguate terms that have been
seen in trained data. Meanwhile, based on our results, transfer learning by fine-tuning a
transformer-based model could predict rare and unseen abbreviations. A remaining challenge for future work is to improve the model to automate the disambiguation of clinical abbreviations on run-time systems by implementing self-supervised learning models.Las abreviaturas se utilizan ampliamente en las historias clínicas electrónicas de los
pacientes y en mucha documentación médica, llegando a ser un 30-50% de las palabras empleadas en narrativa clínica. Existen más de 197.000 abreviaturas únicas usadas en textos clínicos siendo términos altamente ambiguos El significado de las abreviaturas varía en función del contexto en el que se utilicen. Dado que los datos de las historias clínicas electrónicas pueden compartirse entre servicios, hospitales, centros de atención primaria así como otras organizaciones como por ejemplo, las compañías de seguros es fundamental determinar el significado correcto de las abreviaturas para evitar además eventos adversos relacionados con la seguridad del paciente. Nuevas abreviaturas clínicas aparecen constantemente y tienen la característica específica de que no siguen ningún estándar para su creación. Esto hace que sea muy difícil disponer de un recurso con todas las abreviaturas y todos sus significados. A todo esto hay que añadir la dificultad para trabajar con datos clínicos por cuestiones de privacidad cuando es esencial disponer de ellos para poder desarrollar algoritmos para su tratamiento.
La desambiguación del sentido de las palabras (WSD, en inglés) es una tarea esencial
en tareas de procesamiento del lenguaje natural (PLN) como extracción de información,
chatbots o generadores de resúmenes, entre otros. WSD tiene como objetivo identificar
el significado correcto de una palabra ambigua (que tiene más de un significado). Esta
tarea se ha abordado previamente utilizando tanto enfoques supervisados, no supervisados así como basados en conocimiento. Esta tesis tiene como objetivo definir un modelo de clasificación que además de desambiguar abreviaturas conocidas desambigüe también abreviaturas menos frecuentes que no han aparecido previamente en los conjuntos de entrenaminto utilizando las arquitecturas de redes neuronales profundas más recientes relacionadas ocn los modelos del lenguaje.
En la desambiguación de abreviaturas clínicas se emplean diversos recursos y modelos de desambiguación. Se han investigado los diferentes enfoques de clasificación utilizados para desambiguar las abreviaturas clínicas. Dado que un ordenador no comprende directamente los textos, se han implementado diferentes representaciones de textos para capturar el significado de las palabras. Puesto que también es necesario medir el desempeño de cualquier algoritmo, se describen también las medidas de evaluación utilizadas.
La mayoría de los trabajos previos se han basado en la construcción de un clasificador
separado para cada abreviatura clínica. De este modo, tienden a aprovechar diferentes
recursos de datos para superar el cuello de botella de la adquisición de datos. Sin embargo, estos modelos se limitaban a desambiguar con los datos para los que el sistema había sido entrenado.
Se han explorado además representaciones basadas vectores de palabras (word embeddings) estáticos para 13 abreviaturas clínicas en el corpus UMN en inglés (de la University of Minnesota) utilizando algoritmos de clasificación tradicionales de aprendizaje automático supervisados (un clasificador por cada abreviatura). Se ha llevado a cabo un segundo experimento utilizando un modelo multi-clasificador sobre todo el conjunto de las 75 abreviaturas del corpus UMN basado en un modelo Transformer pre-entrenado. El objetivo ha sido implementar un clasificador multiclase para predecir también abreviaturas raras y no vistas. Se realizó un experimento adicional para siglas científicas en documentos de dominio abierto mediante la aplicación de un enfoque híbrido compuesto por enfoques supervisados y basados en el conocimiento.
Así, basándonos en los resultados de esta tesis, el aprendizaje por transferencia (transfer learning) mediante el ajuste (fine-tuning) de un modelo de lenguaje preentrenado podría predecir abreviaturas raras y no vistas sin necesidad de entrenarlas previamente. Un reto pendiente para el trabajo futuro es mejorar el modelo para automatizar la desambiguación de las abreviaturas clínicas en tiempo de ejecución mediante la implementación de modelos de aprendizaje autosupervisados.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Israel González Carrasco.- Secretario: Leonardo Campillos Llanos.- Vocal: Ana María García Serran
- …