65 research outputs found

    Understanding Patient Safety Reports via Multi-label Text Classification and Semantic Representation

    Get PDF
    Medical errors are the results of problems in health care delivery. One of the key steps to eliminate errors and improve patient safety is through patient safety event reporting. A patient safety report may record a number of critical factors that are involved in the health care when incidents, near misses, and unsafe conditions occur. Therefore, clinicians and risk management can generate actionable knowledge by harnessing useful information from reports. To date, efforts have been made to establish a nationwide reporting and error analysis mechanism. The increasing volume of reports has been driving improvement in quantity measures of patient safety. For example, statistical distributions of errors across types of error and health care settings have been well documented. Nevertheless, a shift to quality measure is highly demanded. In a health care system, errors are likely to occur if one or more components (e.g., procedures, equipment, etc.) that are intrinsically associated go wrong. However, our understanding of what and how these components are connected is limited for at least two reasons. Firstly, the patient safety reports present difficulties in aggregate analysis since they are large in volume and complicated in semantic representation. Secondly, an efficient and clinically valuable mechanism to identify and categorize these components is absent. I strive to make my contribution by investigating the multi-labeled nature of patient safety reports. To facilitate clinical implementation, I propose that machine learning and semantic information of reports, e.g., semantic similarity between terms, can be used to jointly perform automated multi-label classification. My work is divided into three specific aims. In the first aim, I developed a patient safety ontology to enhance semantic representation of patient safety reports. The ontology supports a number of applications including automated text classification. In the second aim, I evaluated multilabel text classification algorithms on patient safety reports. The results demonstrated a list of productive algorithms with balanced predictive power and efficiency. In the third aim, to improve the performance of text classification, I developed a framework for incorporating semantic similarity and kernel-based multi-label text classification. Semantic similarity values produced by different semantic representation models are evaluated in the classification tasks. Both ontology-based and distributional semantic similarity exerted positive influence on classification performance but the latter one shown significant efficiency in terms of the measure of semantic similarity. Our work provides insights into the nature of patient safety reports, that is a report can be labeled by multiple components (e.g., different procedures, settings, error types, and contributing factors) it contains. Multi-labeled reports hold promise to disclose system vulnerabilities since they provide the insight of the intrinsically correlated components of health care systems. I demonstrated the effectiveness and efficiency of the use of automated multi-label text classification embedded with semantic similarity information on patient safety reports. The proposed solution holds potential to incorporate with existing reporting systems, significantly reducing the workload of aggregate report analysis

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    BIOMEDICAL LANGUAGE UNDERSTANDING AND EXTRACTION (BLUE-TEXT): A MINIMAL SYNTACTIC, SEMANTIC METHOD

    Get PDF
    Clinical text understanding (CTU) is of interest to health informatics because critical clinical information frequently represented as unconstrained text in electronic health records are extensively used by human experts to guide clinical practice, decision making, and to document delivery of care, but are largely unusable by information systems for queries and computations. Recent initiatives advocating for translational research call for generation of technologies that can integrate structured clinical data with unstructured data, provide a unified interface to all data, and contextualize clinical information for reuse in multidisciplinary and collaborative environment envisioned by CTSA program. This implies that technologies for the processing and interpretation of clinical text should be evaluated not only in terms of their validity and reliability in their intended environment, but also in light of their interoperability, and ability to support information integration and contextualization in a distributed and dynamic environment. This vision adds a new layer of information representation requirements that needs to be accounted for when conceptualizing implementation or acquisition of clinical text processing tools and technologies for multidisciplinary research. On the other hand, electronic health records frequently contain unconstrained clinical text with high variability in use of terms and documentation practices, and without commitmentto grammatical or syntactic structure of the language (e.g. Triage notes, physician and nurse notes, chief complaints, etc). This hinders performance of natural language processing technologies which typically rely heavily on the syntax of language and grammatical structure of the text. This document introduces our method to transform unconstrained clinical text found in electronic health information systems to a formal (computationally understandable) representation that is suitable for querying, integration, contextualization and reuse, and is resilient to the grammatical and syntactic irregularities of the clinical text. We present our design rationale, method, and results of evaluation in processing chief complaints and triage notes from 8 different emergency departments in Houston Texas. At the end, we will discuss significance of our contribution in enabling use of clinical text in a practical bio-surveillance setting

    Semantic resources in pharmacovigilance: a corpus and an ontology for drug-drug interactions

    Get PDF
    Mención Internacional en el título de doctorNowadays, with the increasing use of several drugs for the treatment of one or more different diseases (polytherapy) in large populations, the risk for drugs combinations that have not been studied in pre-authorization clinical trials has increased. This provides a favourable setting for the occurrence of drug-drug interactions (DDIs), a common adverse drug reaction (ADR) representing an important risk to patients safety, and an increase in healthcare costs. Their early detection is, therefore, a main concern in the clinical setting. Although there are different databases supporting healthcare professionals in the detection of DDIs, the quality of these databases is very uneven, and the consistency of their content is limited. Furthermore, these databases do not scale well to the large and growing number of pharmacovigilance literature in recent years. In addition, large amounts of current and valuable information are hidden in published articles, scientific journals, books, and technical reports. Thus, the large number of DDI information sources has overwhelmed most healthcare professionals because it is not possible to remain up to date on everything published about DDIs. Computational methods can play a key role in the identification, explanation, and prediction of DDIs on a large scale, since they can be used to collect, analyze and manipulate large amounts of biological and pharmacological data. Natural language processing (NLP) techniques can be used to retrieve and extract DDI information from pharmacological texts, supporting researchers and healthcare professionals on the challenging task of searching DDI information among different and heterogeneous sources. However, these methods rely on the availability of specific resources providing the domain knowledge, such as databases, terminological vocabularies, corpora, ontologies, and so forth, which are necessary to address the Information Extraction (IE) tasks. In this thesis, we have developed two semantic resources for the DDI domain that make an important contribution to the research and development of IE systems for DDIs. We have reviewed and analyzed the existing corpora and ontologies relevant to this domain, based on their strengths and weaknesses, we have developed the DDI corpus and the ontology for drug-drug interactions (named DINTO). The DDI corpus has proven to fulfil the characteristics of a high-quality gold-standard, and has demonstrated its usefulness as a benchmark for the training and testing of different IE systems in the SemEval-2013 DDIExtraction shared task. Meanwhile, DINTO has been used and evaluated in two different applications. Firstly, it has been proven that the knowledge represented in the ontology can be used to infer DDIs and their different mechanisms. Secondly, we have provided a proof-of-concept of the contribution of DINTO to NLP, by providing the domain knowledge to be exploited by an IE pilot prototype. From these results, we believe that these two semantic resources will encourage further research into the application of computational methods to the early detection of DDIs. This work has been partially supported by the Regional Government of Madrid under the Research Network MA2VICMR [S2009/TIC-1542], by the Spanish Ministry of Education under the project MULTIMEDICA [TIN2010-20644-C03-01] and by the European Commission Seventh Framework Programme under TrendMiner project [FP7-ICT287863].Hoy en día ha habido un notable aumento del número de pacientes polimedicados que reciben simultáneamente varios fármacos para el tratamiento de una o varias enfermedades. Esta situación proporciona el escenario ideal para la prescripción de combinaciones de fármacos que no han sido estudiadas previamente en ensayos clínicos, y puede dar lugar a un aumento de interacciones farmacológicas (DDIs por sus siglas en inglés). Las interacciones entre fármacos son un tipo de reacción adversa que supone no sólo un riesgo para los pacientes, sino también una importante causa de aumento del gasto sanitario. Por lo tanto, su detección temprana es crucial en la práctica clínica. En la actualidad existen diversos recursos y bases de datos que pueden ayudar a los profesionales sanitarios en la detección de posibles interacciones farmacológicas. Sin embargo, la calidad de su información varía considerablemente de unos a otros, y la consistencia de sus contenidos es limitada. Además, la actualización de estos recursos es difícil debido al aumento que ha experimentado la literatura farmacológica en los últimos años. De hecho, mucha información sobre DDIs se encuentra dispersa en artículos, revistas científicas, libros o informes técnicos, lo que ha hecho que la mayoría de los profesionales sanitarios se hayan visto abrumados al intentar mantenerse actualizados en el dominio de las interacciones farmacológicas. La ingeniería informática puede representar un papel fundamental en este campo permitiendo la identificación, explicación y predicción de DDIs, ya que puede ayudar a recopilar, analizar y manipular grandes cantidades de datos biológicos y farmacológicos. En concreto, las técnicas del procesamiento del lenguaje natural (PLN) pueden ayudar a recuperar y extraer información sobre DDIs de textos farmacológicos, ayudando a los investigadores y profesionales sanitarios en la complicada tarea de buscar esta información en diversas fuentes. Sin embargo, el desarrollo de estos métodos depende de la disponibilidad de recursos específicos que proporcionen el conocimiento del dominio, como bases de datos, vocabularios terminológicos, corpora u ontologías, entre otros, que son necesarios para desarrollar las tareas de extracción de información (EI). En el marco de esta tesis hemos desarrollado dos recursos semánticos en el dominio de las interacciones farmacológicas que suponen una importante contribución a la investigación y al desarrollo de sistemas de EI sobre DDIs. En primer lugar hemos revisado y analizado los corpora y ontologías existentes relevantes para el dominio y, en base a sus potenciales y limitaciones, hemos desarrollado el corpus DDI y la ontología para interacciones farmacológicas DINTO. El corpus DDI ha demostrado cumplir con las características de un estándar de oro de gran calidad, así como su utilidad para el entrenamiento y evaluación de distintos sistemas en la tarea de extracción de información SemEval-2013 DDIExtraction Task. Por su parte, DINTO ha sido utilizada y evaluada en dos aplicaciones diferentes. En primer lugar, hemos demostrado que esta ontología puede ser utilizada para inferir interacciones entre fármacos y los mecanismos por los que ocurren. En segundo lugar, hemos obtenido una primera prueba de concepto de la contribución de DINTO al área del PLN al proporcionar el conocimiento del dominio necesario para ser explotado por un prototipo de un sistema de EI. En vista de estos resultados, creemos que estos dos recursos semánticos pueden estimular la investigación en el desarrollo de métodos computaciones para la detección temprana de DDIs. Este trabajo ha sido financiado parcialmente por el Gobierno Regional de Madrid a través de la red de investigación MA2VICMR [S2009/TIC-1542], por el Ministerio de Educación Español, a través del proyecto MULTIMEDICA [TIN2010-20644-C03-01], y por el Séptimo Programa Macro de la Comisión Europea a través del proyecto TrendMiner [FP7-ICT287863].This work has been partially supported by the Regional Government of Madrid under the Research Network MA2VICMR [S2009/TIC-1542], by the Spanish Ministry of Education under the project MULTIMEDICA [TIN2010-20644-C03-01] and by the European Commission Seventh Framework Programme under TrendMiner project [FP7-ICT287863].Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Asunción Gómez Pérez.- Secretario: María Belén Ruiz Mezcua.- Vocal: Mariana Neve

    Preface

    Get PDF

    Vaccine semantics : Automatic methods for recognizing, representing, and reasoning about vaccine-related information

    Get PDF
    Post-marketing management and decision-making about vaccines builds on the early detection of safety concerns and changes in public sentiment, the accurate access to established evidence, and the ability to promptly quantify effects and verify hypotheses about the vaccine benefits and risks. A variety of resources provide relevant information but they use different representations, which makes rapid evidence generation and extraction challenging. This thesis presents automatic methods for interpreting heterogeneously represented vaccine information. Part I evaluates social media messages for monitoring vaccine adverse events and public sentiment in social media messages, using automatic methods for information recognition. Parts II and III develop and evaluate automatic methods and res

    Network-driven strategies to integrate and exploit biomedical data

    Get PDF
    [eng] In the quest for understanding complex biological systems, the scientific community has been delving into protein, chemical and disease biology, populating biomedical databases with a wealth of data and knowledge. Currently, the field of biomedicine has entered a Big Data era, in which computational-driven research can largely benefit from existing knowledge to better understand and characterize biological and chemical entities. And yet, the heterogeneity and complexity of biomedical data trigger the need for a proper integration and representation of this knowledge, so that it can be effectively and efficiently exploited. In this thesis, we aim at developing new strategies to leverage the current biomedical knowledge, so that meaningful information can be extracted and fused into downstream applications. To this goal, we have capitalized on network analysis algorithms to integrate and exploit biomedical data in a wide variety of scenarios, providing a better understanding of pharmacoomics experiments while helping accelerate the drug discovery process. More specifically, we have (i) devised an approach to identify functional gene sets associated with drug response mechanisms of action, (ii) created a resource of biomedical descriptors able to anticipate cellular drug response and identify new drug repurposing opportunities, (iii) designed a tool to annotate biomedical support for a given set of experimental observations, and (iv) reviewed different chemical and biological descriptors relevant for drug discovery, illustrating how they can be used to provide solutions to current challenges in biomedicine.[cat] En la cerca d’una millor comprensió dels sistemes biològics complexos, la comunitat científica ha estat aprofundint en la biologia de les proteïnes, fàrmacs i malalties, poblant les bases de dades biomèdiques amb un gran volum de dades i coneixement. En l’actualitat, el camp de la biomedicina es troba en una era de “dades massives” (Big Data), on la investigació duta a terme per ordinadors se’n pot beneficiar per entendre i caracteritzar millor les entitats químiques i biològiques. No obstant, la heterogeneïtat i complexitat de les dades biomèdiques requereix que aquestes s’integrin i es representin d’una manera idònia, permetent així explotar aquesta informació d’una manera efectiva i eficient. L’objectiu d’aquesta tesis doctoral és desenvolupar noves estratègies que permetin explotar el coneixement biomèdic actual i així extreure informació rellevant per aplicacions biomèdiques futures. Per aquesta finalitat, em fet servir algoritmes de xarxes per tal d’integrar i explotar el coneixement biomèdic en diferents tasques, proporcionant un millor enteniment dels experiments farmacoòmics per tal d’ajudar accelerar el procés de descobriment de nous fàrmacs. Com a resultat, en aquesta tesi hem (i) dissenyat una estratègia per identificar grups funcionals de gens associats a la resposta de línies cel·lulars als fàrmacs, (ii) creat una col·lecció de descriptors biomèdics capaços, entre altres coses, d’anticipar com les cèl·lules responen als fàrmacs o trobar nous usos per fàrmacs existents, (iii) desenvolupat una eina per descobrir quins contextos biològics corresponen a una associació biològica observada experimentalment i, finalment, (iv) hem explorat diferents descriptors químics i biològics rellevants pel procés de descobriment de nous fàrmacs, mostrant com aquests poden ser utilitzats per trobar solucions a reptes actuals dins el camp de la biomedicina
    corecore