243 research outputs found

    Leveraging Unannotated Texts for Scientific Relation Extraction

    Get PDF

    Progress and Opportunities of Foundation Models in Bioinformatics

    Full text link
    Bioinformatics has witnessed a paradigm shift with the increasing integration of artificial intelligence (AI), particularly through the adoption of foundation models (FMs). These AI techniques have rapidly advanced, addressing historical challenges in bioinformatics such as the scarcity of annotated data and the presence of data noise. FMs are particularly adept at handling large-scale, unlabeled data, a common scenario in biological contexts due to the time-consuming and costly nature of experimentally determining labeled data. This characteristic has allowed FMs to excel and achieve notable results in various downstream validation tasks, demonstrating their ability to represent diverse biological entities effectively. Undoubtedly, FMs have ushered in a new era in computational biology, especially in the realm of deep learning. The primary goal of this survey is to conduct a systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed. Central to our focus is the application of FMs to specific biological problems, aiming to guide the research community in choosing appropriate FMs for their research needs. We delve into the specifics of the problem at hand including sequence analysis, structure prediction, function annotation, and multimodal integration, comparing the structures and advancements against traditional methods. Furthermore, the review analyses challenges and limitations faced by FMs in biology, such as data noise, model explainability, and potential biases. Finally, we outline potential development paths and strategies for FMs in future biological research, setting the stage for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also as a roadmap for future explorations and applications of FMs in biology.Comment: 27 pages, 3 figures, 2 table

    Extreme multi-label deep neural classification of Spanish health records according to the International Classification of Diseases

    Get PDF
    111 p.Este trabajo trata sobre la minería de textos clínicos, un campo del Procesamiento del Lenguaje Natural aplicado al dominio biomédico. El objetivo es automatizar la tarea de codificación médica. Los registros electrónicos de salud (EHR) son documentos que contienen información clínica sobre la salud de unpaciente. Los diagnósticos y procedimientos médicos plasmados en la Historia Clínica Electrónica están codificados con respecto a la Clasificación Internacional de Enfermedades (CIE). De hecho, la CIE es la base para identificar estadísticas de salud internacionales y el estándar para informar enfermedades y condiciones de salud. Desde la perspectiva del aprendizaje automático, el objetivo es resolver un problema extremo de clasificación de texto de múltiples etiquetas, ya que a cada registro de salud se le asignan múltiples códigos ICD de un conjunto de más de 70 000 términos de diagnóstico. Una cantidad importante de recursos se dedican a la codificación médica, una laboriosa tarea que actualmente se realiza de forma manual. Los EHR son narraciones extensas, y los codificadores médicos revisan los registros escritos por los médicos y asignan los códigos ICD correspondientes. Los textos son técnicos ya que los médicos emplean una jerga médica especializada, aunque rica en abreviaturas, acrónimos y errores ortográficos, ya que los médicos documentan los registros mientras realizan la práctica clínica real. Paraabordar la clasificación automática de registros de salud, investigamos y desarrollamos un conjunto de técnicas de clasificación de texto de aprendizaje profundo

    Position-aware deep multi-task learning for drug–drug interaction extraction

    Get PDF
    Objective A drug–drug interaction (DDI) is a situation in which a drug affects the activity of another drug synergistically or antagonistically when being administered together. The information of DDIs is crucial for healthcare professionals to prevent adverse drug events. Although some known DDIs can be found in purposely-built databases such as DrugBank, most information is still buried in scientific publications. Therefore, automatically extracting DDIs from biomedical texts is sorely needed. Methods and material In this paper, we propose a novel position-aware deep multi-task learning approach for extracting DDIs from biomedical texts. In particular, sentences are represented as a sequence of word embeddings and position embeddings. An attention-based bidirectional long short-term memory (BiLSTM) network is used to encode each sentence. The relative position information of words with the target drugs in text is combined with the hidden states of BiLSTM to generate the position-aware attention weights. Moreover, the tasks of predicting whether or not two drugs interact with each other and further distinguishing the types of interactions are learned jointly in multi-task learning framework. Results The proposed approach has been evaluated on the DDIExtraction challenge 2013 corpus and the results show that with the position-aware attention only, our proposed approach outperforms the state-of-the-art method by 0.99% for binary DDI classification, and with both position-aware attention and multi-task learning, our approach achieves a micro F-score of 72.99% on interaction type identification, outperforming the state-of-the-art approach by 1.51%, which demonstrates the effectiveness of the proposed approach

    Word Sense Disambiguation for clinical abbreviations

    Get PDF
    Abbreviations are extensively used in electronic health records (EHR) of patients as well as medical documentation, reaching 30-50% of the words in clinical narrative. There are more than 197,000 unique medical abbreviations found in the clinical text and their meanings vary depending on the context in which they are used. Since data in electronic health records could be shareable across health information systems (hospitals, primary care centers, etc.) as well as others such as insurance companies information systems, it is essential determining the correct meaning of the abbreviations to avoid misunderstandings. Clinical abbreviations have specific characteristic that do not follow any standard rules for creating them. This makes it complicated to find said abbreviations and corresponding meanings. Furthermore, there is an added difficulty to working with clinical data due to privacy reasons, since it is essential to have them in order to develop and test algorithms. Word sense disambiguation (WSD) is an essential task in natural language processing (NLP) applications such as information extraction, chatbots and summarization systems among others. WSD aims to identify the correct meaning of the ambiguous word which has more than one meaning. Disambiguating clinical abbreviations is a type of lexical sample WSD task. Previous research works adopted supervised, unsupervised and Knowledge-based (KB) approaches to disambiguate clinical abbreviations. This thesis aims to propose a classification model that apart from disambiguating well known abbreviations also disambiguates rare and unseen abbreviations using the most recent deep neural network architectures for language modeling. In clinical abbreviation disambiguation several resources and disambiguation models were encountered. Different classification approaches used to disambiguate the clinical abbreviations were investigated in this thesis. Considering that computers do not directly understand texts, different data representations were implemented to capture the meaning of the words. Since it is also necessary to measure the performance of algorithms, the evaluation measurements used are discussed. As the different solutions proposed to clinical WSD we have explored static word embeddings data representation on 13 English clinical abbreviations of the UMN data set (from University of Minnesota) by testing traditional supervised machine learning algorithms separately for each abbreviation. Moreover, we have utilized a transformer-base pretrained model that was fine-tuned as a multi-classification classifier for the whole data set (75 abbreviations of the UMN data set). The aim of implementing just one multi-class classifier is to predict rare and unseen abbreviations that are most common in clinical narrative. Additionally, other experiments were conducted for a different type of abbreviations (scientific abbreviations and acronyms) by defining a hybrid approach composed of supervised and knowledge-based approaches. Most previous works tend to build a separated classifier for each clinical abbreviation, tending to leverage different data resources to overcome the data acquisition bottleneck. However, those models were restricted to disambiguate terms that have been seen in trained data. Meanwhile, based on our results, transfer learning by fine-tuning a transformer-based model could predict rare and unseen abbreviations. A remaining challenge for future work is to improve the model to automate the disambiguation of clinical abbreviations on run-time systems by implementing self-supervised learning models.Las abreviaturas se utilizan ampliamente en las historias clínicas electrónicas de los pacientes y en mucha documentación médica, llegando a ser un 30-50% de las palabras empleadas en narrativa clínica. Existen más de 197.000 abreviaturas únicas usadas en textos clínicos siendo términos altamente ambiguos El significado de las abreviaturas varía en función del contexto en el que se utilicen. Dado que los datos de las historias clínicas electrónicas pueden compartirse entre servicios, hospitales, centros de atención primaria así como otras organizaciones como por ejemplo, las compañías de seguros es fundamental determinar el significado correcto de las abreviaturas para evitar además eventos adversos relacionados con la seguridad del paciente. Nuevas abreviaturas clínicas aparecen constantemente y tienen la característica específica de que no siguen ningún estándar para su creación. Esto hace que sea muy difícil disponer de un recurso con todas las abreviaturas y todos sus significados. A todo esto hay que añadir la dificultad para trabajar con datos clínicos por cuestiones de privacidad cuando es esencial disponer de ellos para poder desarrollar algoritmos para su tratamiento. La desambiguación del sentido de las palabras (WSD, en inglés) es una tarea esencial en tareas de procesamiento del lenguaje natural (PLN) como extracción de información, chatbots o generadores de resúmenes, entre otros. WSD tiene como objetivo identificar el significado correcto de una palabra ambigua (que tiene más de un significado). Esta tarea se ha abordado previamente utilizando tanto enfoques supervisados, no supervisados así como basados en conocimiento. Esta tesis tiene como objetivo definir un modelo de clasificación que además de desambiguar abreviaturas conocidas desambigüe también abreviaturas menos frecuentes que no han aparecido previamente en los conjuntos de entrenaminto utilizando las arquitecturas de redes neuronales profundas más recientes relacionadas ocn los modelos del lenguaje. En la desambiguación de abreviaturas clínicas se emplean diversos recursos y modelos de desambiguación. Se han investigado los diferentes enfoques de clasificación utilizados para desambiguar las abreviaturas clínicas. Dado que un ordenador no comprende directamente los textos, se han implementado diferentes representaciones de textos para capturar el significado de las palabras. Puesto que también es necesario medir el desempeño de cualquier algoritmo, se describen también las medidas de evaluación utilizadas. La mayoría de los trabajos previos se han basado en la construcción de un clasificador separado para cada abreviatura clínica. De este modo, tienden a aprovechar diferentes recursos de datos para superar el cuello de botella de la adquisición de datos. Sin embargo, estos modelos se limitaban a desambiguar con los datos para los que el sistema había sido entrenado. Se han explorado además representaciones basadas vectores de palabras (word embeddings) estáticos para 13 abreviaturas clínicas en el corpus UMN en inglés (de la University of Minnesota) utilizando algoritmos de clasificación tradicionales de aprendizaje automático supervisados (un clasificador por cada abreviatura). Se ha llevado a cabo un segundo experimento utilizando un modelo multi-clasificador sobre todo el conjunto de las 75 abreviaturas del corpus UMN basado en un modelo Transformer pre-entrenado. El objetivo ha sido implementar un clasificador multiclase para predecir también abreviaturas raras y no vistas. Se realizó un experimento adicional para siglas científicas en documentos de dominio abierto mediante la aplicación de un enfoque híbrido compuesto por enfoques supervisados y basados en el conocimiento. Así, basándonos en los resultados de esta tesis, el aprendizaje por transferencia (transfer learning) mediante el ajuste (fine-tuning) de un modelo de lenguaje preentrenado podría predecir abreviaturas raras y no vistas sin necesidad de entrenarlas previamente. Un reto pendiente para el trabajo futuro es mejorar el modelo para automatizar la desambiguación de las abreviaturas clínicas en tiempo de ejecución mediante la implementación de modelos de aprendizaje autosupervisados.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Israel González Carrasco.- Secretario: Leonardo Campillos Llanos.- Vocal: Ana María García Serran

    Unsupervised Biomedical Named Entity Recognition

    Get PDF
    Named entity recognition (NER) from text is an important task for several applications, including in the biomedical domain. Supervised machine learning based systems have been the most successful on NER task, however, they require correct annotations in large quantities for training. Annotating text manually is very labor intensive and also needs domain expertise. The purpose of this research is to reduce human annotation effort and to decrease cost of annotation for building NER systems in the biomedical domain. The method developed in this work is based on leveraging the availability of resources like UMLS (Unified Medical Language System), that contain a list of biomedical entities and a large unannotated corpus to build an unsupervised NER system that does not require any manual annotations. The method that we developed in this research has two phases. In the first phase, a biomedical corpus is automatically annotated with some named entities using UMLS through unambiguous exact matching which we call weakly-labeled data. In this data, positive examples are the entities in the text that exactly match in UMLS and have only one semantic type which belongs to the desired entity class to be extracted (for example, diseases and disorders). Negative examples are the entities in the text that exactly match in UMLS but are of semantic types other than those that belong to the desired entity class. These examples are then used to train a machine learning classifier using features that represent the contexts in which they appeared in the text. The trained classifier is applied back to the text to gather more examples iteratively through the process of self-training. The trained classifier is then capable of classifying mentions in an unseen text as of the desired entity class or not from the contexts in which they appear. Although the trained named entity detector is good at detecting the presence of entities of the desired class in text, it cannot determine their correct boundaries. In the second phase of our method, called “Boundary Expansion”, the correct boundaries of the entities are determined. This method is based on a novel idea that utilizes machine learning and UMLS. Training examples for boundary expansion are gathered directly from UMLS and do not require any manual annotations. We also developed a new WordNet based approach for boundary expansion. Our developed method was evaluated on three datasets - SemEval 2014 Task 7 dataset that has diseases and disorders as the desired entity class, GENIA dataset that has proteins, DNAs, RNAs, cell types, and cell lines as the desired entity classes, and i2b2 dataset that has problems, tests, and treatments as the desired entity classes. Our method performed well and obtained performance close to supervised methods on the SemEval dataset. On the other datasets, it outperformed an existing unsupervised method on most entity classes. Availability of a list of entity names with their semantic types and a large unannotated corpus are the only requirements of our method to work well. Given these, our method generalizes across different types of entities and different types of biomedical text. Being unsupervised, the method can be easily applied to new NER tasks without needing costly annotations

    Mining semantics for culturomics: towards a knowledge-based approach

    Get PDF
    The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods
    corecore