494 research outputs found

    Natural Language Processing and Graph Representation Learning for Clinical Data

    Get PDF
    The past decade has witnessed remarkable progress in biomedical informatics and its related fields: the development of high-throughput technologies in genomics, the mass adoption of electronic health records systems, and the AI renaissance largely catalyzed by deep learning. Deep learning has played an undeniably important role in our attempts to reduce the gap between the exponentially growing amount of biomedical data and our ability to make sense of them. In particular, the two main pillars of this dissertation---natural language processing and graph representation learning---have improved our capacity to learn useful representations of language and structured data to an extent previously considered unattainable in such a short time frame. In the context of clinical data, characterized by its notorious heterogeneity and complexity, natural language processing and graph representation learning have begun to enrich our toolkits for making sense and making use of the wealth of biomedical data beyond rule-based systems or traditional regression techniques. This dissertation comes at the cusp of such a paradigm shift, detailing my journey across the fields of biomedical and clinical informatics through the lens of natural language processing and graph representation learning. The takeaway is quite optimistic: despite the many layers of inefficiencies and challenges in the healthcare ecosystem, AI for healthcare is gearing up to transform the world in new and exciting ways

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    Advances in monolingual and crosslingual automatic disability annotation in Spanish

    Get PDF
    Background Unlike diseases, automatic recognition of disabilities has not received the same attention in the area of medical NLP. Progress in this direction is hampered by obstacles like the lack of annotated corpus. Neural architectures learn to translate sequences from spontaneous representations into their corresponding standard representations given a set of samples. The aim of this paper is to present the last advances in monolingual (Spanish) and crosslingual (from English to Spanish and vice versa) automatic disability annotation. The task consists of identifying disability mentions in medical texts written in Spanish within a collection of abstracts from journal papers related to the biomedical domain. Results In order to carry out the task, we have combined deep learning models that use different embedding granularities for sequence to sequence tagging with a simple acronym and abbreviation detection module to boost the coverage. Conclusions Our monolingual experiments demonstrate that a good combination of different word embedding representations provide better results than single representations, significantly outperforming the state of the art in disability annotation in Spanish. Additionally, we have experimented crosslingual transfer (zero-shot) for disability annotation between English and Spanish with interesting results that might help overcoming the data scarcity bottleneck, specially significant for the disabilities.This work was partially funded by the Spanish Ministry of Science and Innovation (MCI/AEI/FEDER, UE, DOTT-HEALTH/PAT-MED PID2019-106942RB-C31), the Basque Government (IXA IT1570-22), MCIN/AEI/ 10.13039/501100011033 and European Union NextGeneration EU/PRTR (DeepR3, TED2021-130295B-C31) and the EU ERA-Net CHIST-ERA and the Spanish Research Agency (ANTIDOTE PCI2020-120717-2)

    Applying deep learning extreme multi-label classification to the biomedical and multilingual panoramas

    Get PDF
    Tese de mestrado em Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2020A indexação automática de documentos é um passo fundamental para a organização de dados e para a extração de informação relevante dos mesmos. Esta extração de informação é realizada através de processos de prospecção de texto e de técnicas de processamento de linguagem natural que tornam a linguagem natural perceptível para o computador. Actualmente, muitas das soluções que são aplicadas a estes processos consistem em soluções de aprendizagem automática. No entanto, tem se assistido a um aumento contínuo da aplicação de soluções de aprendizagem profunda em tarefas de prospecção de texto e de processamento de linguagem natural visto que, graças aos desenvolvimentos contínuos ao longo dos últimos anos, estas soluções têm conseguido obter cada vez melhores resultados. Uma dessas técnicas é a classificação multi-rótulo extrema, uma técnica de processamento de linguagem natural que consiste na indexação de documentos com rótulos pertencentes a um conjunto que pode conter milhares ou mesmo milhões de possíveis rótulos. Este trabalho apresenta um sistema desenvolvido para as ciências biomédicas e para o domínio multilinguístico, através da adaptação de um algoritmo de classificação multi-rótulo extrema usando aprendizagem profunda. O sistema desenvolvido combina ainda um software de reconhecimento de entidades nomeadas com o algoritmo de classificação multi-rótulo extrema de forma a melhorar a atribuição de rótulos aos documentos biomédicos. Para testar o sistema desenvolvido, participei em três competições internacionais com foco na área das ciências biomédicas, nomeadamente na BioASQ task 8a, BioASQ task MESINESP e ainda na subtarefa CODING da competição CANTEMIST. O objectivo comum destas três competições consistia na indexação de documentos biomédicos com rótulos pertencentes a um dado vocabulário biomédico. No entanto, enquanto na task 8a os dados estavam escritos em Inglês, na task MESINESP e na CANTEMIST, os dados biomédicos estavam escritos em Espanhol. Nas competições da BioASQ, o sistema desenvolvido destacou-se sobretudo nas medidas de precisão, superando a grande maioria dos sistemas e ainda alcançando o 1º lugar por duas semanas consecutivas numa das medidas da BioASQ task 8a. Na subtarefa CODING da CANTEMIST, o sistema atingiu uma pontuação de 0.506 na medida mais relevante.Automatic document indexation is a fundamental step for data organization and information retrieval tasks. Information retrieval can be realized through processes of text mining and natural language processing techniques that make natural language understandable to the computer. Nowadays, most solutions that are applied to these processes use machine learning algorithms. However, thanks to continuous developments through recent years, there has been an increasing usage of deep learning solutions applied to text mining and natural language processing tasks, due to the continuous achievement of better results. One of those techniques is extreme multi-label classification, a natural language processing task consisting in the indexation of documents with labels from a label set that may contain thousands or even millions of possible labels. This work presents a system developed for the biomedical and multilingual panoramas based on the adaptation of a deep learning extreme multi-label classification algorithm. The developed system also combines a named entity recognition software with the extreme multi-label classification algorithm in order to improve the label classification of the biomedical documents. To test the developed system, I participated in three international challenges focused on the biomedical sciences, namely in the BioASQ task 8a, BioASQ task MESINESP and in CANTEMIST CODING subtask. The common goal of these three competitions was the indexation of biomedical documents with labels belonging to a specific biomedical vocabulary. However, while the data in task 8a was in English, in task MESINESP and in CANTEMIST the biomedical data was written in Spanish. In the BioASQ competitions, the system stood out in the precision measures, surpassing most competing systems and achieving the 1st place for two consecutive weeks in one evaluation measure in the BioASQ task 8a. In the CANTEMIST CODING subtask, the system achieved a score of 0.506 in the most relevant measure

    Harnessing Deep Learning Techniques for Text Clustering and Document Categorization

    Get PDF
    This research paper delves into the realm of deep text clustering algorithms with the aim of enhancing the accuracy of document classification. In recent years, the fusion of deep learning techniques and text clustering has shown promise in extracting meaningful patterns and representations from textual data. This paper provides an in-depth exploration of various deep text clustering methodologies, assessing their efficacy in improving document classification accuracy. Delving into the core of deep text clustering, the paper investigates various feature representation techniques, ranging from conventional word embeddings to contextual embeddings furnished by BERT and GPT models.By critically reviewing and comparing these algorithms, we shed light on their strengths, limitations, and potential applications. Through this comprehensive study, we offer insights into the evolving landscape of document analysis and classification, driven by the power of deep text clustering algorithms.Through an original synthesis of existing literature, this research serves as a beacon for researchers and practitioners in harnessing the prowess of deep learning to enhance the accuracy of document classification endeavors

    Enhance Representation Learning of Clinical Narrative with Neural Networks for Clinical Predictive Modeling

    Get PDF
    Medicine is undergoing a technological revolution. Understanding human health from clinical data has major challenges from technical and practical perspectives, thus prompting methods that understand large, complex, and noisy data. These methods are particularly necessary for natural language data from clinical narratives/notes, which contain some of the richest information on a patient. Meanwhile, deep neural networks have achieved superior performance in a wide variety of natural language processing (NLP) tasks because of their capacity to encode meaningful but abstract representations and learn the entire task end-to-end. In this thesis, I investigate representation learning of clinical narratives with deep neural networks through a number of tasks ranging from clinical concept extraction, clinical note modeling, and patient-level language representation. I present methods utilizing representation learning with neural networks to support understanding of clinical text documents. I first introduce the notion of representation learning from natural language processing and patient data modeling. Then, I investigate word-level representation learning to improve clinical concept extraction from clinical notes. I present two works on learning word representations and evaluate them to extract important concepts from clinical notes. The first study focuses on cancer-related information, and the second study evaluates shared-task data. The aims of these two studies are to automatically extract important entities from clinical notes. Next, I present a series of deep neural networks to encode hierarchical, longitudinal, and contextual information for modeling a series of clinical notes. I also evaluate the models by predicting clinical outcomes of interest, including mortality, length of stay, and phenotype predictions. Finally, I propose a novel representation learning architecture to develop a generalized and transferable language representation at the patient level. I also identify pre-training tasks appropriate for constructing a generalizable language representation. The main focus is to improve predictive performance of phenotypes with limited data, a challenging task due to a lack of data. Overall, this dissertation addresses issues in natural language processing for medicine, including clinical text classification and modeling. These studies show major barriers to understanding large-scale clinical notes. It is believed that developing deep representation learning methods for distilling enormous amounts of heterogeneous data into patient-level language representations will improve evidence-based clinical understanding. The approach to solving these issues by learning representations could be used across clinical applications despite noisy data. I conclude that considering different linguistic components in natural language and sequential information between clinical events is important. Such results have implications beyond the immediate context of predictions and further suggest future directions for clinical machine learning research to improve clinical outcomes. This could be a starting point for future phenotyping methods based on natural language processing that construct patient-level language representations to improve clinical predictions. While significant progress has been made, many open questions remain, so I will highlight a few works to demonstrate promising directions

    A review on Natural Language Processing Models for COVID-19 research

    Get PDF
    This survey paper reviews Natural Language Processing Models and their use in COVID-19 research in two main areas. Firstly, a range of transformer-based biomedical pretrained language models are evaluated using the BLURB benchmark. Secondly, models used in sentiment analysis surrounding COVID-19 vaccination are evaluated. We filtered literature curated from various repositories such as PubMed and Scopus and reviewed 27 papers. When evaluated using the BLURB benchmark, the novel T-BPLM BioLinkBERT gives groundbreaking results by incorporating document link knowledge and hyperlinking into its pretraining. Sentiment analysis of COVID-19 vaccination through various Twitter API tools has shown the public’s sentiment towards vaccination to be mostly positive. Finally, we outline some limitations and potential solutions to drive the research community to improve the models used for NLP tasks

    Development of a recommendation system for scientific literature based on deep learning

    Get PDF
    Dissertação de mestrado em BioinformaticsThe previous few decades have seen an enormous volume of articles from the scientific commu nity on the most diverse biomedical topics, making it extremely challenging for researchers to find relevant information. Methods like Machine Learning (ML) and Deep Learning (DL) have been used to create tools that can speed up this process. In that context, this work focuses on examining the performance of different ML and DL techniques when classifying biomedical documents, mainly regarding their relevance to given topics. To evaluate the different techniques, the dataset from the BioCreative VI Track 4 challenge was used. The objective of the challenge was to identify documents related to protein-protein interactions altered by mutations, a topic extremely important in precision medicine. Protein-protein interactions play a crucial role in the cellular mechanisms of all living organisms, and mutations in these interaction sites could be indicative of diseases. To handle the data to be used in training, some text processing methods were implemented in the Omnia package from OmniumAI, the host company of this work. Several preprocessing and feature extraction methods were implemented, such as removing stopwords and TF-IDF, which may be used in other case studies. They can be used either with generic text or biomedical text. These methods, in conjunction with ML pipelines already developed by the Omnia team, allowed the training of several traditional ML models. We were able to achieve a small improvement on performance, compared to the challenge baseline, when applying these traditional ML models on the same dataset. Regarding DL, testing with a CNN model, it was clear that the BioWordVec pre-trained embedding achieved the best performance of all pre-trained embeddings. Additionally, we explored the application of more complex DL models. These models achieved a better performance than the best challenge submission. BioLinkBERT managed an improvement of 0.4 percent points on precision, 4.9 percent points on recall, and 2.2 percent points on F1.As décadas anteriores assistiram a um enorme aumento no volume de artigos da comunidade científica sobre os mais diversos tópicos biomédicos, tornando extremamente difícil para os investigadores encontrar informação relevante. Métodos como Aprendizagem Máquina (AM) e Aprendizagem Profunda (AP) tem sido utilizados para criar ferramentas que podem acelerar este processo. Neste contexto, este trabalho centra-se na avaliação do desempenho de diferentes técnicas de AM e AP na classificação de documentos biomédicos, principalmente no que diz respeito à sua relevância para determinados tópicos. Para avaliar as diferentes técnicas, foi utilizado o conjunto de dados do desafio BioCreative VI Track 4. O objectivo do desafio era identificar documentos relacionados com as interações proteína-proteína alteradas por mutações, um tópico extremamente importante na medicina de precisão. As interacções proteína-proteína desempenham um papel crucial nos mecanismos celulares de todos os organismos vivos, e as mutações nestes locais de interacção podem ser indicativas de doenças. Para tratar os dados a utilizar no treino, alguns métodos de processamento de texto foram implementados no pacote Omnia da OmniumAI, a empresa anfitriã deste trabalho. Foram implementados vários métodos de pré-processamento e extracção de características, tais como a remoção de palavras irrelevantes e TF-IDF, que podem ser utilizados em outros casos de estudos, tanto com texto genérico quer com texto biomédico. Estes métodos, em conjunto com as pipelines de AM já desenvolvidas pela equipa da Omnia, permitiram o treino de vários modelos tradicionais de AM. Conseguimos alcançar uma pequena melhoria no desempenho, em comparação com a linha de referência do desafio, ao aplicar estes modelos tradicionais de AM no mesmo conjunto de dados. Relativamente a AP, testando com um modelo CNN, ficou claro que o embedding pré-treinado BioWordVec alcançou o melhor desempenho de todos os embeddings pré-treinados. Adicionalmente, exploramos a aplicação de modelos de AP mais complexos. Estes modelos alcançaram um melhor desempenho do que a melhor submissão do desafio. BioLinkBERT conseguiu uma melhoria de 0,4 pontos percentuais na precisão, 4,9 pontos percentuais no recall, e 2,2 pontos percentuais em F1

    Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods

    Full text link
    Information Extraction from scientific literature can be challenging due to the highly specialised nature of such text. We describe our entity recognition methods developed as part of the DEAL (Detecting Entities in the Astrophysics Literature) shared task. The aim of the task is to build a system that can identify Named Entities in a dataset composed by scholarly articles from astrophysics literature. We planned our participation such that it enables us to conduct an empirical comparison between word-based tagging and span-based classification methods. When evaluated on two hidden test sets provided by the organizer, our best-performing submission achieved F1F_1 scores of 0.8307 (validation phase) and 0.7990 (testing phase).Comment: AACL-IJCNLP Workshop on Information Extraction from Scientific Publications (WIESP 2022
    • …
    corecore