16 research outputs found

    The added value of text from Dutch general practitioner notes in predictive modeling

    Get PDF
    Objective:This work aims to explore the value of Dutch unstructured data, in combination with structured data, for the development of prognostic prediction models in a general practitioner (GP) setting.Materials and methods:We trained and validated prediction models for 4 common clinical prediction problems using various sparse text representations, common prediction algorithms, and observational GP electronic health record (EHR) data. We trained and validated 84 models internally and externally on data from different EHR systems.Results:On average, over all the different text representations and prediction algorithms, models only using text data performed better or similar to models using structured data alone in 2 prediction tasks. Additionally, in these 2 tasks, the combination of structured and text data outperformed models using structured or text data alone. No large performance differences were found between the different text representations and prediction algorithms.Discussion:Our findings indicate that the use of unstructured data alone can result in well-performing prediction models for some clinical prediction problems. Furthermore, the performance improvement achieved by combining structured and text data highlights the added value. Additionally, we demonstrate the significance of clinical natural language processing research in languages other than English and the possibility of validating text-based prediction models across various EHR systems.Conclusion:Our study highlights the potential benefits of incorporating unstructured data in clinical prediction models in a GP setting. Although the added value of unstructured data may vary depending on the specific prediction task, our findings suggest that it has the potential to enhance patient care

    Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation

    Get PDF
    BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose

    SEMCARE: Multilingual Semantic Search in Semi-Structured Clinical Data.

    No full text
    The vast amount of clinical data in electronic health records constitutes a great potential for secondary use. However, most of this content consists of unstructured or semi-structured texts, which is difficult to process. Several challenges are still pending: medical language idiosyncrasies in different natural languages, and the large variety of medical terminology systems. In this paper we present SEMCARE, a European initiative designed to minimize these problems by providing a multi-lingual platform (English, German, and Dutch) that allows users to express complex queries and obtain relevant search results from clinical texts. SEMCARE is based on a selection of adapted biomedical terminologies, together with Apache UIMA and Apache Solr as open source state-of-the-art natural language pipeline and indexing technologies. SEMCARE has been deployed and is currently being tested at three medical institutions in the UK, Austria, and the Netherlands, showing promising results in a cardiology use case

    Drug-induced acute myocardial infarction: identifying 'prime suspects' from electronic healthcare records-based surveillance system.

    Get PDF
    BACKGROUND: Drug-related adverse events remain an important cause of morbidity and mortality and impose huge burden on healthcare costs. Routinely collected electronic healthcare data give a good snapshot of how drugs are being used in 'real-world' settings. OBJECTIVE: To describe a strategy that identifies potentially drug-induced acute myocardial infarction (AMI) from a large international healthcare data network. METHODS: Post-marketing safety surveillance was conducted in seven population-based healthcare databases in three countries (Denmark, Italy, and the Netherlands) using anonymised demographic, clinical, and prescription/dispensing data representing 21,171,291 individuals with 154,474,063 person-years of follow-up in the period 1996-2010. Primary care physicians' medical records and administrative claims containing reimbursements for filled prescriptions, laboratory tests, and hospitalisations were evaluated using a three-tier triage system of detection, filtering, and substantiation that generated a list of drugs potentially associated with AMI. Outcome of interest was statistically significant increased risk of AMI during drug exposure that has not been previously described in current literature and is biologically plausible. RESULTS: Overall, 163 drugs were identified to be associated with increased risk of AMI during preliminary screening. Of these, 124 drugs were eliminated after adjustment for possible bias and confounding. With subsequent application of criteria for novelty and biological plausibility, association with AMI remained for nine drugs ('prime suspects'): azithromycin; erythromycin; roxithromycin; metoclopramide; cisapride; domperidone; betamethasone; fluconazole; and megestrol acetate. LIMITATIONS: Although global health status, co-morbidities, and time-invariant factors were adjusted for, residual confounding cannot be ruled out. CONCLUSION: A strategy to identify potentially drug-induced AMI from electronic healthcare data has been proposed that takes into account not only statistical association, but also public health relevance, novelty, and biological plausibility. Although this strategy needs to be further evaluated using other healthcare data sources, the list of 'prime suspects' makes a good starting point for further clinical, laboratory, and epidemiologic investigation

    Towards creating a new triple store for literature-based discovery

    No full text
    Literature-based discovery (LBD) is a field of research aiming at discovering new knowledge by mining scientific literature. Knowledge bases are commonly used by LBD systems. SemMedDB, created with the use of SemRep information extraction system, is the most frequently used database in LBD. However, new applications of LBD are emerging that go beyond the scope of SemMedDB. In this work, we propose some new discovery patterns that lie in the domain of Natural Products and that are not covered by the existing databases and tools. Our goal thus is to create a new, extended knowledge base, addressing limitations of SemMedDB. Our proposed contribution is three-fold: 1) we add types of entities and relations that are of interest for LBD but are not covered by SemMedDB; 2) we plan to leverage full texts of scientific publications, instead of titles and abstracts only; 3) we envisage using the RDF model for our database, in accordance with Semantic Web standards. To create a new database, we plan to build a distantly supervised entity and relation extraction system, employing a neural networks/deep learning architecture. We describe the methods and tools we plan to employ
    corecore