465 research outputs found

    Cómo adaptar un modelo de aprendizaje profundo a un nuevo dominio: el caso de la extracción de relaciones biomédicas

    Get PDF
    In this article, we study the relation extraction problem from Natural Language Processing (NLP) implementing a domain adaptation setting without external resources. We trained a Deep Learning (DL) model for Relation Extraction (RE), which extracts semantic relations in the biomedical domain. However, can the model be applied to different domains? The model should be adaptable to automatically extract relationships across different domains using the DL network. Completely training DL models in a short time is impractical because the models should quickly adapt to different datasets in several domains without delay. Therefore, adaptation is crucial for intelligent systems, where changing factors and unanticipated perturbations are common. In this study, we present a detailed analysis of the problem, as well as preliminary experimentation, results, and their evaluation.En este trabajo estudiamos el problema de extracción de relaciones del Procesamiento de Lenguaje Natural (PLN). Realizamos una configuración para la adaptación de dominio sin recursos externos. De esta forma, entrenamos un modelo con aprendizaje profundo (DL) para la extracción de relaciones (RE). El modelo permite extraer relaciones semánticas para el dominio biomédico. Sin embargo, ¿El modelo puede ser aplicado a diferentes dominios? El modelo debería adaptarse automáticamente para la extracción de relaciones entre diferentes dominios usando la red de DL. Entrenar completamente modelos DL en una escala de tiempo corta no es práctico, deseamos que los modelos se adapten rápidamente de diferentes conjuntos de datos con varios dominios y sin demora. Así, la adaptación es crucial para los sistemas inteligentes que operan en el mundo real, donde los factores cambiantes y las perturbaciones imprevistas son habituales. En este artículo, presentamos un análisis detallado del problema, una experimentación preliminar, resultados y la discusión acerca de los resultados

    Deep Neural Architectures for End-to-End Relation Extraction

    Get PDF
    The rapid pace of scientific and technological advancements has led to a meteoric growth in knowledge, as evidenced by a sharp increase in the number of scholarly publications in recent years. PubMed, for example, archives more than 30 million biomedical articles across various domains and covers a wide range of topics including medicine, pharmacy, biology, and healthcare. Social media and digital journalism have similarly experienced their own accelerated growth in the age of big data. Hence, there is a compelling need for ways to organize and distill the vast, fragmented body of information (often unstructured in the form of natural human language) so that it can be assimilated, reasoned about, and ultimately harnessed. Relation extraction is an important natural language task toward that end. In relation extraction, semantic relationships are extracted from natural human language in the form of (subject, object, predicate) triples such that subject and object are mentions of discrete concepts and predicate indicates the type of relation between them. The difficulty of relation extraction becomes clear when we consider the myriad of ways the same relation can be expressed in natural language. Much of the current works in relation extraction assume that entities are known at extraction time, thus treating entity recognition as an entirely separate and independent task. However, recent studies have shown that entity recognition and relation extraction, when modeled together as interdependent tasks, can lead to overall improvements in extraction accuracy. When modeled in such a manner, the task is referred to as end-to-end relation extraction. In this work, we present four studies that introduce incrementally sophisticated architectures designed to tackle the task of end-to-end relation extraction. In the first study, we present a pipeline approach for extracting protein-protein interactions as affected by particular mutations. The pipeline system makes use of recurrent neural networks for protein detection, lexicons for gene normalization, and convolutional neural networks for relation extraction. In the second study, we show that a multi-task learning framework, with parameter sharing, can achieve state-of-the-art results for drug-drug interaction extraction. At its core, the model uses graph convolutions, with a novel attention-gating mechanism, over dependency parse trees. In the third study, we present a more efficient and general-purpose end-to-end neural architecture designed around the idea of the table-filling paradigm; for an input sentence of length n, all entities and relations are extracted in a single pass of the network in an indirect fashion by populating the cells of a corresponding n by n table using metric-based features. We show that this approach excels in both the general English and biomedical domains with extraction times that are up to an order of magnitude faster compared to the prior best. In the fourth and last study, we present an architecture for relation extraction that, in addition to being end-to-end, is able to handle cross-sentence and N-ary relations. Overall, our work contributes to the advancement of modern information extraction by exploring end-to-end solutions that are fast, accurate, and generalizable to many high-value domains

    Optimizing text mining methods for improving biomedical natural language processing

    Get PDF
    The overwhelming amount and the increasing rate of publication in the biomedical domain make it difficult for life sciences researchers to acquire and maintain all information that is necessary for their research. Pubmed (the primary citation database for the biomedical literature) currently contains over 21 million article abstracts and more than one million of them were published in 2020 alone. Even though existing article databases provide capable keyword search services, typical everyday-life queries usually return thousands of relevant articles. For instance, a cancer research scientist may need to acquire a complete list of genes that interact with BRCA1 (breast cancer 1) gene. The PubMed keyword search for BRCA1 returns over 16,500 article abstracts, making manual inspection of the retrieved documents impractical. Missing even one of the interacting gene partners in this scenario may jeopardize successful development of a potential new drug or vaccine. Although manually curated databases of biomolecular interactions exist, they are usually not up-to-date and they require notable human effort to maintain. To summarize, new discoveries are constantly being shared within the community via scientific publishing, but unfortunately the probability of missing vital information for research in life sciences is increasing. In response to this problem, the biomedical natural language processing (BioNLP) community of researchers has emerged and strives to assist life sciences researchers by building modern language processing and text mining tools that can be applied at large-scale and scan the whole publicly available literature and extract, classify, and aggregate the information found within, thus keeping life sciences researchers always up-to-date with the recent relevant discoveries and facilitating their research in numerous fields such as molecular biology, biomedical engineering, bioinformatics, genetics engineering and biochemistry. My research has almost exclusively focused on biomedical relation and event extraction tasks. These foundational information extraction tasks deal with automatic detection of biological processes, interactions and relations described in the biomedical literature. Precisely speaking, biomedical relation and event extraction systems can scan through a vast amount of biomedical texts and automatically detect and extract the semantic relations of biomedical named entities (e.g. genes, proteins, chemical compounds, and diseases). The structured outputs of such systems (i.e., the extracted relations or events) can be stored as relational databases or molecular interaction networks which can easily be queried, filtered, analyzed, visualized and integrated with other structured data sources. Extracting biomolecular interactions has always been the primary interest of BioNLP researcher because having knowledge about such interactions is crucially important in various research areas including precision medicine, drug discovery, drug repurposing, hypothesis generation, construction and curation of signaling pathways, and protein function and structure prediction. State-of-the-art relation and event extraction methods are based on supervised machine learning, requiring manually annotated data for training. Manual annotation for the biomedical domain requires domain expertise and it is time-consuming. Hence, having minimal training data for building information extraction systems is a common case in the biomedical domain. This demands development of methods that can make the most out of available training data and this thesis gathers all my research efforts and contributions in that direction. It is worth mentioning that biomedical natural language processing has undergone a revolution since I started my research in this field almost ten years ago. As a member of the BioNLP community, I have witnessed the emergence, improvement– and in some cases, the disappearance–of many methods, each pushing the performance of the best previous method one step further. I can broadly divide the last ten years into three periods. Once I started my research, feature-based methods that relied on heavy feature engineering were dominant and popular. Then, significant advancements in the hardware technology, as well as several breakthroughs in the algorithms and methods enabled machine learning practitioners to seriously utilize artificial neural networks for real-world applications. In this period, convolutional, recurrent, and attention-based neural network models became dominant and superior. Finally, the introduction of transformer-based language representation models such as BERT and GPT impacted the field and resulted in unprecedented performance improvements on many data sets. When reading this thesis, I demand the reader to take into account the course of history and judge the methods and results based on what could have been done in that particular period of the history

    A two-stage deep learning approach for extracting entities and relationships from medical texts

    Get PDF
    This Work Presents A Two-Stage Deep Learning System For Named Entity Recognition (Ner) And Relation Extraction (Re) From Medical Texts. These Tasks Are A Crucial Step To Many Natural Language Understanding Applications In The Biomedical Domain. Automatic Medical Coding Of Electronic Medical Records, Automated Summarizing Of Patient Records, Automatic Cohort Identification For Clinical Studies, Text Simplification Of Health Documents For Patients, Early Detection Of Adverse Drug Reactions Or Automatic Identification Of Risk Factors Are Only A Few Examples Of The Many Possible Opportunities That The Text Analysis Can Offer In The Clinical Domain. In This Work, Our Efforts Are Primarily Directed Towards The Improvement Of The Pharmacovigilance Process By The Automatic Detection Of Drug-Drug Interactions (Ddi) From Texts. Moreover, We Deal With The Semantic Analysis Of Texts Containing Health Information For Patients. Our Two-Stage Approach Is Based On Deep Learning Architectures. Concretely, Ner Is Performed Combining A Bidirectional Long Short-Term Memory (Bi-Lstm) And A Conditional Random Field (Crf), While Re Applies A Convolutional Neural Network (Cnn). Since Our Approach Uses Very Few Language Resources, Only The Pre-Trained Word Embeddings, And Does Not Exploit Any Domain Resources (Such As Dictionaries Or Ontologies), This Can Be Easily Expandable To Support Other Languages And Clinical Applications That Require The Exploitation Of Semantic Information (Concepts And Relationships) From Texts...This work was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R)

    Development of a tool based on deep learning able to classify biomedical literature

    Get PDF
    Dissertação de mestrado em BioinformaticsIn the last decades, the scientific community has produced huge amounts of publications about the most varied biomedical topics, making the search for relevant information a really difficult task for every researcher. Some approaches have been followed to develop tools that can facilitate this process. For instance, PubMed implemented in 2017 a Machine Learning model to sort documents by their relevance. Nevertheless, even the authors consider that their system would benefit from the implementation of a Deep Learning model, which for now needs more studies. In this context, a package called BioTMPy1 was developed in this work, to perform document classification of biomedical literature using the Python programming language. The package is divided into different modules to provide to the user functions to read documents in different formats, perform preprocessing and data analysis and to train, optimize and evaluate Machine and Deep learning models. Our package also provides intuitive pipelines that can be easily adapted for the user needs, illustrating how to implement complex deep learning models. The developed package was applied to a dataset from a challenge of the BioCreative forum, from 2019, about protein-protein interactions altered by mutations, an important topic for the advances related to precision medicine. Using this dataset, it was possible to observe a slightly better performance of BioWordVec pre-trained embeddings over GloVe, ”pubmed pmc” and ”pubmed ncbi” embeddings. Also, with the evaluation of the developed models on the test set, we managed to overcome the challenge’s best submission, by using a model with BioBERT and a bidirectional LSTM on top, resulting in a difference of 7.25% for average precision, 3.22% for precision, 2.99% for recall and 3.15% for the f1-score. Also, a web server was developed to provide access to the best Deep Learning model trained in this work. The overall pipeline here developed can be applied to other case studies in different topics, provided there is a set of documents annotated as relevant and non-relevant, allowing to train the models.Nas últimas décadas, a comunidade científica tem produzido uma enorme quantidade de publicações sobre os mais variados tópicos biomédicos, tornando a procura de informação relevante num processo complicado para qualquer investigador. Alguma abordagem tem sido seguidas para desenvolver ferramentas que possam facilitar este processo. Por exemplo, o PubMed implementou em 2017 um modelo de aprendizagem máquina para ordenar documentos pela sua relevância. Contudo, os autores consideram que o seu sistema pode beneficiar com a implementação de um modelo de Deep Learning, o que para já necessita de mais estudos. Neste projeto, foi desenvolvida um package chamado BioTMPy para classificar documentos da literatura biomédica através da linguagem de programação Python. Este package é dividido em diferentes módulos para fornecer ao utilizador funções para ler documentos de formatos diferentes, realizar pré-processamento e análise de dados, e para treinar, otimizar e avaliar modelos de aprendizagem máquina. A plataforma também fornece pipelines intuitivas que podem ser facilmente adaptadas de acordo com as necessidades do utilizador, demonstrando como implementar modelos complexos de Deep Learning. O package desenvolvido foi aplicado a um conjunto de dados de um desafio do fórum BioCreative, de 2019, acerca de interações proteína-proteína alteradas por mutações, um tópico importante para a área da medicina de precisão. Usando este conjunto de dados, consegue-se observar um melhor desempenho dos BioWordVec embeddings pré-treinados em relação a embeddings como GloVe, ”pubmed pmc” e ”pubmed ncbi”. Com os modelos desenvolvidos, foi possível ultrapassar a melhor submissão do challenge, usando um modelo com BioBERT e uma LSTM bidirecional acima, obtendo-se diferenças de 7.25% na precisão média, 3.22% na precisão, 2.99% no recall e 3.15% para o f1 -score. Foi ainda desenvolvido um servidor web de forma a fornecer acesso ao nosso melhor modelo. A plataforma desenvolvida neste trabalho poderá ser aplicável a outros casos de estudo em diferentes tópicos, desde que exista um conjunto de documentos anotado como relevante ou não relevante, que permita treinar os modelos

    Knowledge-based Biomedical Data Science 2019

    Full text link
    Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table
    corecore