186 research outputs found

    Knowledge-based Biomedical Data Science 2019

    Full text link
    Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

    Medical Informatics

    Get PDF
    Information technology has been revolutionizing the everyday life of the common man, while medical science has been making rapid strides in understanding disease mechanisms, developing diagnostic techniques and effecting successful treatment regimen, even for those cases which would have been classified as a poor prognosis a decade earlier. The confluence of information technology and biomedicine has brought into its ambit additional dimensions of computerized databases for patient conditions, revolutionizing the way health care and patient information is recorded, processed, interpreted and utilized for improving the quality of life. This book consists of seven chapters dealing with the three primary issues of medical information acquisition from a patient's and health care professional's perspective, translational approaches from a researcher's point of view, and finally the application potential as required by the clinicians/physician. The book covers modern issues in Information Technology, Bioinformatics Methods and Clinical Applications. The chapters describe the basic process of acquisition of information in a health system, recent technological developments in biomedicine and the realistic evaluation of medical informatics

    Doctor of Philosophy

    Get PDF
    dissertationThe objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often overwhelmed by excessive, irrelevant data. Scientists have applied natural language processing technologies to improve retrieval. Text summarization, a natural language processing approach, simplifies bibliographic data while filtering it to address a user's need. Traditional text summarization can necessitate the use of multiple software applications to accommodate diverse processing refinements known as "points-of-view." A new, statistical approach to text summarization can transform this process. Combo, a statistical algorithm comprised of three individual metrics, determines which elements within input data are relevant to a user's specified information need, thus enabling a single software application to summarize text for many points-of-view. In this dissertation, I describe this algorithm, and the research process used in developing and testing it. Four studies comprised the research process. The goal of the first study was to create a conventional schema accommodating a genetic disease etiology point-of-view, and an evaluative reference standard. This was accomplished through simulating the task of secondary genetic database curation. The second study addressed the development iv and initial evaluation of the algorithm, comparing its performance to the conventional schema using the previously established reference standard, again within the task of secondary genetic database curation. The third and fourth studies evaluated the algorithm's performance in accommodating additional points-of-view in a simulated clinical decision support task. The third study explored prevention, while the fourth evaluated performance for prevention and drug treatment, comparing results to a conventional treatment schema's output. Both summarization methods identified data that were salient to their tasks. The conventional genetic disease etiology and treatment schemas located salient information for database curation and decision support, respectively. The Combo algorithm located salient genetic disease etiology, treatment, and prevention data, for the associated tasks. Dynamic text summarization could potentially serve additional purposes, such as consumer health information delivery, systematic review creation, and primary research. This technology may benefit many user groups

    Exploiting Latent Features of Text and Graphs

    Get PDF
    As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial processes to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings --- semantically rich vectors of latent features --- to convert human constructs for machine understanding. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machine-generated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking criteria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for HIV-associated Neurodegenerative Disorders. We additionally focus on the latent features of graphs, and propose a new bipartite graph embedding technique. Using our graph embedding, we advance the state-of-the-art in hypergraph partitioning quality. Having newfound intuition of graph embeddings, we present Agatha, a deep-learning approach to hypothesis generation. This system learns a data-driven ranking criteria derived from the embeddings of our large proposed biomedical semantic graph. To produce human-readable results, we additionally propose CBAG, a technique for conditional biomedical abstract generation

    Clinical information extraction for preterm birth risk prediction

    Get PDF
    This paper contributes to the pursuit of leveraging unstructured medical notes to structured clinical decision making. In particular, we present a pipeline for clinical information extraction from medical notes related to preterm birth, and discuss the main challenges as well as its potential for clinical practice. A large collection of medical notes, created by staff during hospitalizations of patients who were at risk of delivering preterm, was gathered and analyzed. Based on an annotated collection of notes, we trained and evaluated information extraction components to discover clinical entities such as symptoms, events, anatomical sites and procedures, as well as attributes linked to these clinical entities. In a retrospective study, we show that these are highly informative for clinical decision support models that are trained to predict whether delivery is likely to occur within specific time windows, in combination with structured information from electronic health records

    Essays on Health Information Technology: Insights from Analyses of Big Datasets

    Get PDF
    The current dissertation provides an examination of health information technology (HIT) by analyzing big datasets. It contains two separate essays focused on: (1) the evolving intellectual structure of the healthcare informatics (HI) and healthcare IT (HIT) scholarly communities, and (2) the impact of social support exchange embedded in social interactions on health promotion outcomes associated with online health community use. Overall, this dissertation extends current theories by applying a unique combination of methods (natural language processing, machine learning, social network analysis, and structural equation modeling etc.) to the analyses of primary datasets. The goal of the first study is to obtain a full understanding of the underlying dynamics of the intellectual structures of HI and its sub-discipline HIT. Using multiple statistical methods including citation and co-citation analysis, social network analysis (SNA), and latent semantic analysis (LSA), this essay shows how HIT research has emerged in IS journals and distinguished itself from the larger HI context. The research themes, intellectual leadership, cohesion of these themes and networks of researchers, and journal presence revealed in our longitudinal intellectual structure analyses foretell how, in particular, these HI and HIT fields have evolved to date and also how they could evolve in the future. Our findings identify which research streams are central (versus peripheral) and which are cohesive (as opposed to disparate). Suggestions for vibrant areas of future research emerge from our analysis. The second part of the dissertation focuses on comprehensively understanding the effect of social support exchange in online health communities on individual members’ health promotion outcomes. This study examines the effectiveness of online consumer-to-consumer social support exchange on health promotion outcomes via analyses of big health data. Based on previous research, we propose a conceptual framework which integrates social capital theory and social support theory in the context of online health communities and test it through a quantitative field study and multiple analyses of a big online health community dataset. Specifically, natural language processing and machine learning techniques are utilized to automate content analysis of digital trace data. This research not only extends current theories of social support exchange in online health communities, but also sheds light on the design and management of such communities

    Knowledge extraction from unstructured data

    Get PDF
    Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effectively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models

    Knowledge-driven entity recognition and disambiguation in biomedical text

    Get PDF
    Entity recognition and disambiguation (ERD) for the biomedical domain are notoriously difficult problems due to the variety of entities and their often long names in many variations. Existing works focus heavily on the molecular level in two ways. First, they target scientific literature as the input text genre. Second, they target single, highly specialized entity types such as chemicals, genes, and proteins. However, a wealth of biomedical information is also buried in the vast universe of Web content. In order to fully utilize all the information available, there is a need to tap into Web content as an additional input. Moreover, there is a need to cater for other entity types such as symptoms and risk factors since Web content focuses on consumer health. The goal of this thesis is to investigate ERD methods that are applicable to all entity types in scientific literature as well as Web content. In addition, we focus on under-explored aspects of the biomedical ERD problems -- scalability, long noun phrases, and out-of-knowledge base (OOKB) entities. This thesis makes four main contributions, all of which leverage knowledge in UMLS (Unified Medical Language System), the largest and most authoritative knowledge base (KB) of the biomedical domain. The first contribution is a fast dictionary lookup method for entity recognition that maximizes throughput while balancing the loss of precision and recall. The second contribution is a semantic type classification method targeting common words in long noun phrases. We develop a custom set of semantic types to capture word usages; besides biomedical usage, these types also cope with non-biomedical usage and the case of generic, non-informative usage. The third contribution is a fast heuristics method for entity disambiguation in MEDLINE abstracts, again maximizing throughput but this time maintaining accuracy. The fourth contribution is a corpus-driven entity disambiguation method that addresses OOKB entities. The method first captures the entities expressed in a corpus as latent representations that comprise in-KB and OOKB entities alike before performing entity disambiguation.Die Erkennung und Disambiguierung von EntitĂ€ten fĂŒr den biomedizinischen Bereich stellen, wegen der vielfĂ€ltigen Arten von biomedizinischen EntitĂ€ten sowie deren oft langen und variantenreichen Namen, große Herausforderungen dar. Vorhergehende Arbeiten konzentrieren sich in zweierlei Hinsicht fast ausschließlich auf molekulare EntitĂ€ten. Erstens fokussieren sie sich auf wissenschaftliche Publikationen als Genre der Eingabetexte. Zweitens fokussieren sie sich auf einzelne, sehr spezialisierte EntitĂ€tstypen wie Chemikalien, Gene und Proteine. Allerdings bietet das Internet neben diesen Quellen eine Vielzahl an Inhalten biomedizinischen Wissens, das vernachlĂ€ssigt wird. Um alle verfĂŒgbaren Informationen auszunutzen besteht der Bedarf weitere Internet-Inhalte als zusĂ€tzliche Quellen zu erschließen. Außerdem ist es auch erforderlich andere EntitĂ€tstypen wie Symptome und Risikofaktoren in Betracht zu ziehen, da diese fĂŒr zahlreiche Inhalte im Internet, wie zum Beispiel Verbraucherinformationen im Gesundheitssektor, relevant sind. Das Ziel dieser Dissertation ist es, Methoden zur Erkennung und Disambiguierung von EntitĂ€ten zu erforschen, die alle EntitĂ€tstypen in Betracht ziehen und sowohl auf wissenschaftliche Publikationen als auch auf andere Internet-Inhalte anwendbar sind. DarĂŒber hinaus setzen wir Schwerpunkte auf oft vernachlĂ€ssigte Aspekte der biomedizinischen Erkennung und Disambiguierung von EntitĂ€ten, nĂ€mlich Skalierbarkeit, lange Nominalphrasen und fehlende EntitĂ€ten in einer Wissensbank. In dieser Hinsicht leistet diese Dissertation vier HauptbeitrĂ€ge, denen allen das Wissen von UMLS (Unified Medical Language System), der grĂ¶ĂŸten und wichtigsten Wissensbank im biomedizinischen Bereich, zu Grunde liegt. Der erste Beitrag ist eine schnelle Methode zur Erkennung von EntitĂ€ten mittels Lexikonabgleich, welche den Durchsatz maximiert und gleichzeitig den Verlust in Genauigkeit und Trefferquote (precision and recall) balanciert. Der zweite Beitrag ist eine Methode zur Klassifizierung der semantischen Typen von Nomen, die sich auf gebrĂ€uchliche Nomen von langen Nominalphrasen richtet und auf einer selbstentwickelten Sammlung von semantischen Typen beruht, die die Verwendung der Nomen erfasst. Neben biomedizinischen können diese Typen auch nicht-biomedizinische und allgemeine, informationsarme Verwendungen behandeln. Der dritte Beitrag ist eine schnelle Heuristikmethode zur Disambiguierung von EntitĂ€ten in MEDLINE Kurzfassungen, welche den Durchsatz maximiert, aber auch die Genauigkeit erhĂ€lt. Der vierte Beitrag ist eine korpusgetriebene Methode zur Disambiguierung von EntitĂ€ten, die speziell fehlende EntitĂ€ten in einer Wissensbank behandelt. Die Methode wandelt erst die EntitĂ€ten, die in einem Textkorpus ausgedrĂŒckt aber nicht notwendigerweise in einer Wissensbank sind, in latente Darstellungen um und fĂŒhrt anschließend die Disambiguierung durch

    Mapping of electronic health records in Spanish to the unified medical language system metathesaurus

    Get PDF
    [EN] This work presents a preliminary approach to annotate Spanish electronic health records with concepts of the Unified Medical Language System Metathesaurus. The prototype uses Apache Lucene R to index the Metathesaurus and generate mapping candidates from input text. In addition, it relies on UKB to resolve ambiguities. The tool has been evaluated by measuring its agreement with MetaMap in two English-Spanish parallel corpora, one consisting of titles and abstracts of papers in the clinical domain, and the other of real electronic health record excerpts.[EU] Lan honetan, espainieraz idatzitako mediku-txosten elektronikoak Unified Medical Languge System Metathesaurus deituriko terminologia biomedikoarekin etiketatzeko lehen urratsak eman dira. Prototipoak Apache Lucene R erabiltzen du Metathesaurus-a indexatu eta mapatze hautagaiak sortzeko. Horrez gain, anbiguotasunak UKB bidez ebazten ditu. Ebaluazioari dagokionez, prototipoaren eta MetaMap-en arteko adostasuna neurtu da bi ingelera-gaztelania corpus paralelotan. Corpusetako bat artikulu zientifikoetako izenburu eta laburpenez osatutako dago. Beste corpusa mediku-txosten pasarte batzuez dago osatuta
    • 

    corecore