1,442 research outputs found

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    Relation Prediction over Biomedical Knowledge Bases for Drug Repositioning

    Get PDF
    Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying other essential relations (e.g., causation, prevention) between biomedical entities is also critical to understand biomedical processes. Hence, it is crucial to develop automated relation prediction systems that can yield plausible biomedical relations to expedite the discovery process. In this dissertation, we demonstrate three approaches to predict treatment relations between biomedical entities for the drug repositioning task using existing biomedical knowledge bases. Our approaches can be broadly labeled as link prediction or knowledge base completion in computer science literature. Specifically, first we investigate the predictive power of graph paths connecting entities in the publicly available biomedical knowledge base, SemMedDB (the entities and relations constitute a large knowledge graph as a whole). To that end, we build logistic regression models utilizing semantic graph pattern features extracted from the SemMedDB to predict treatment and causative relations in Unified Medical Language System (UMLS) Metathesaurus. Second, we study matrix and tensor factorization algorithms for predicting drug repositioning pairs in repoDB, a general purpose gold standard database of approved and failed drug–disease indications. The idea here is to predict repoDB pairs by approximating the given input matrix/tensor structure where the value of a cell represents the existence of a relation coming from SemMedDB and UMLS knowledge bases. The essential goal is to predict the test pairs that have a blank cell in the input matrix/tensor based on the shared biomedical context among existing non-blank cells. Our final approach involves graph convolutional neural networks where entities and relation types are embedded in a vector space involving neighborhood information. Basically, we minimize an objective function to guide our model to concept/relation embeddings such that distance scores for positive relation pairs are lower than those for the negative ones. Overall, our results demonstrate that recent link prediction methods applied to automatically curated, and hence imprecise, knowledge bases can nevertheless result in high accuracy drug candidate prediction with appropriate configuration of both the methods and datasets used

    Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text

    Get PDF
    Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking

    Mapping of electronic health records in Spanish to the unified medical language system metathesaurus

    Get PDF
    [EN] This work presents a preliminary approach to annotate Spanish electronic health records with concepts of the Unified Medical Language System Metathesaurus. The prototype uses Apache Lucene R to index the Metathesaurus and generate mapping candidates from input text. In addition, it relies on UKB to resolve ambiguities. The tool has been evaluated by measuring its agreement with MetaMap in two English-Spanish parallel corpora, one consisting of titles and abstracts of papers in the clinical domain, and the other of real electronic health record excerpts.[EU] Lan honetan, espainieraz idatzitako mediku-txosten elektronikoak Unified Medical Languge System Metathesaurus deituriko terminologia biomedikoarekin etiketatzeko lehen urratsak eman dira. Prototipoak Apache Lucene R erabiltzen du Metathesaurus-a indexatu eta mapatze hautagaiak sortzeko. Horrez gain, anbiguotasunak UKB bidez ebazten ditu. Ebaluazioari dagokionez, prototipoaren eta MetaMap-en arteko adostasuna neurtu da bi ingelera-gaztelania corpus paralelotan. Corpusetako bat artikulu zientifikoetako izenburu eta laburpenez osatutako dago. Beste corpusa mediku-txosten pasarte batzuez dago osatuta

    Development of a tool based on deep learning able to classify biomedical literature

    Get PDF
    Dissertação de mestrado em BioinformaticsIn the last decades, the scientific community has produced huge amounts of publications about the most varied biomedical topics, making the search for relevant information a really difficult task for every researcher. Some approaches have been followed to develop tools that can facilitate this process. For instance, PubMed implemented in 2017 a Machine Learning model to sort documents by their relevance. Nevertheless, even the authors consider that their system would benefit from the implementation of a Deep Learning model, which for now needs more studies. In this context, a package called BioTMPy1 was developed in this work, to perform document classification of biomedical literature using the Python programming language. The package is divided into different modules to provide to the user functions to read documents in different formats, perform preprocessing and data analysis and to train, optimize and evaluate Machine and Deep learning models. Our package also provides intuitive pipelines that can be easily adapted for the user needs, illustrating how to implement complex deep learning models. The developed package was applied to a dataset from a challenge of the BioCreative forum, from 2019, about protein-protein interactions altered by mutations, an important topic for the advances related to precision medicine. Using this dataset, it was possible to observe a slightly better performance of BioWordVec pre-trained embeddings over GloVe, ”pubmed pmc” and ”pubmed ncbi” embeddings. Also, with the evaluation of the developed models on the test set, we managed to overcome the challenge’s best submission, by using a model with BioBERT and a bidirectional LSTM on top, resulting in a difference of 7.25% for average precision, 3.22% for precision, 2.99% for recall and 3.15% for the f1-score. Also, a web server was developed to provide access to the best Deep Learning model trained in this work. The overall pipeline here developed can be applied to other case studies in different topics, provided there is a set of documents annotated as relevant and non-relevant, allowing to train the models.Nas últimas décadas, a comunidade científica tem produzido uma enorme quantidade de publicações sobre os mais variados tópicos biomédicos, tornando a procura de informação relevante num processo complicado para qualquer investigador. Alguma abordagem tem sido seguidas para desenvolver ferramentas que possam facilitar este processo. Por exemplo, o PubMed implementou em 2017 um modelo de aprendizagem máquina para ordenar documentos pela sua relevância. Contudo, os autores consideram que o seu sistema pode beneficiar com a implementação de um modelo de Deep Learning, o que para já necessita de mais estudos. Neste projeto, foi desenvolvida um package chamado BioTMPy para classificar documentos da literatura biomédica através da linguagem de programação Python. Este package é dividido em diferentes módulos para fornecer ao utilizador funções para ler documentos de formatos diferentes, realizar pré-processamento e análise de dados, e para treinar, otimizar e avaliar modelos de aprendizagem máquina. A plataforma também fornece pipelines intuitivas que podem ser facilmente adaptadas de acordo com as necessidades do utilizador, demonstrando como implementar modelos complexos de Deep Learning. O package desenvolvido foi aplicado a um conjunto de dados de um desafio do fórum BioCreative, de 2019, acerca de interações proteína-proteína alteradas por mutações, um tópico importante para a área da medicina de precisão. Usando este conjunto de dados, consegue-se observar um melhor desempenho dos BioWordVec embeddings pré-treinados em relação a embeddings como GloVe, ”pubmed pmc” e ”pubmed ncbi”. Com os modelos desenvolvidos, foi possível ultrapassar a melhor submissão do challenge, usando um modelo com BioBERT e uma LSTM bidirecional acima, obtendo-se diferenças de 7.25% na precisão média, 3.22% na precisão, 2.99% no recall e 3.15% para o f1 -score. Foi ainda desenvolvido um servidor web de forma a fornecer acesso ao nosso melhor modelo. A plataforma desenvolvida neste trabalho poderá ser aplicável a outros casos de estudo em diferentes tópicos, desde que exista um conjunto de documentos anotado como relevante ou não relevante, que permita treinar os modelos

    The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

    Get PDF
    BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

    From Fully-Supervised Single-Task to Semi-Supervised Multi-Task Deep Learning Architectures for Segmentation in Medical Imaging Applications

    Get PDF
    Medical imaging is routinely performed in clinics worldwide for the diagnosis and treatment of numerous medical conditions in children and adults. With the advent of these medical imaging modalities, radiologists can visualize both the structure of the body as well as the tissues within the body. However, analyzing these high-dimensional (2D/3D/4D) images demands a significant amount of time and effort from radiologists. Hence, there is an ever-growing need for medical image computing tools to extract relevant information from the image data to help radiologists perform efficiently. Image analysis based on machine learning has pivotal potential to improve the entire medical imaging pipeline, providing support for clinical decision-making and computer-aided diagnosis. To be effective in addressing challenging image analysis tasks such as classification, detection, registration, and segmentation, specifically for medical imaging applications, deep learning approaches have shown significant improvement in performance. While deep learning has shown its potential in a variety of medical image analysis problems including segmentation, motion estimation, etc., generalizability is still an unsolved problem and many of these successes are achieved at the cost of a large pool of datasets. For most practical applications, getting access to a copious dataset can be very difficult, often impossible. Annotation is tedious and time-consuming. This cost is further amplified when annotation must be done by a clinical expert in medical imaging applications. Additionally, the applications of deep learning in the real-world clinical setting are still limited due to the lack of reliability caused by the limited prediction capabilities of some deep learning models. Moreover, while using a CNN in an automated image analysis pipeline, it’s critical to understand which segmentation results are problematic and require further manual examination. To this extent, the estimation of uncertainty calibration in a semi-supervised setting for medical image segmentation is still rarely reported. This thesis focuses on developing and evaluating optimized machine learning models for a variety of medical imaging applications, ranging from fully-supervised, single-task learning to semi-supervised, multi-task learning that makes efficient use of annotated training data. The contributions of this dissertation are as follows: (1) developing a fully-supervised, single-task transfer learning for the surgical instrument segmentation from laparoscopic images; and (2) utilizing supervised, single-task, transfer learning for segmenting and digitally removing the surgical instruments from endoscopic/laparoscopic videos to allow the visualization of the anatomy being obscured by the tool. The tool removal algorithms use a tool segmentation mask and either instrument-free reference frames or previous instrument-containing frames to fill in (inpaint) the instrument segmentation mask; (3) developing fully-supervised, single-task learning via efficient weight pruning and learned group convolution for accurate left ventricle (LV), right ventricle (RV) blood pool and myocardium localization and segmentation from 4D cine cardiac MR images; (4) demonstrating the use of our fully-supervised memory-efficient model to generate dynamic patient-specific right ventricle (RV) models from cine cardiac MRI dataset via an unsupervised learning-based deformable registration field; and (5) integrating a Monte Carlo dropout into our fully-supervised memory-efficient model with inherent uncertainty estimation, with the overall goal to estimate the uncertainty associated with the obtained segmentation and error, as a means to flag regions that feature less than optimal segmentation results; (6) developing semi-supervised, single-task learning via self-training (through meta pseudo-labeling) in concert with a Teacher network that instructs the Student network by generating pseudo-labels given unlabeled input data; (7) proposing largely-unsupervised, multi-task learning to demonstrate the power of a simple combination of a disentanglement block, variational autoencoder (VAE), generative adversarial network (GAN), and a conditioning layer-based reconstructor for performing two of the foremost critical tasks in medical imaging — segmentation of cardiac structures and reconstruction of the cine cardiac MR images; (8) demonstrating the use of 3D semi-supervised, multi-task learning for jointly learning multiple tasks in a single backbone module – uncertainty estimation, geometric shape generation, and cardiac anatomical structure segmentation of the left atrial cavity from 3D Gadolinium-enhanced magnetic resonance (GE-MR) images. This dissertation summarizes the impact of the contributions of our work in terms of demonstrating the adaptation and use of deep learning architectures featuring different levels of supervision to build a variety of image segmentation tools and techniques that can be used across a wide spectrum of medical image computing applications centered on facilitating and promoting the wide-spread computer-integrated diagnosis and therapy data science
    corecore