227 research outputs found

    Rewriting and suppressing UMLS terms for improved biomedical term identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.</p> <p>Results</p> <p>Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.</p> <p>Conclusions</p> <p>We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at <url>http://biosemantics.org/casper</url>.</p

    Thesaurus-based disambiguation of gene symbols

    Get PDF
    BACKGROUND: Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. RESULTS: We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. CONCLUSION: The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications

    Identification of acute myocardial infarction from electronic healthcare records using different disease coding systems

    Get PDF
    Objective: To evaluate positive predictive value (PPV) of different disease codes and free text in identifying acute myocardial infarction (AMI) from electronic healthcare records (EHRs). Design: Validation study of cases of AMI identified from general practitioner records and hospital discharge diagnoses using free text and codes from the International Classification of Primary Care (ICPC), International Classification of Diseases 9th revision-clinical modification (ICD9-CM) and ICD-10th revision (ICD-10). Setting: Population-based databases comprising routinely collected data from primary care in Italy and the Netherlands and from secondary care in Denmark from 1996 to 2009. Participants: A total of 4 034 232 individuals with 22 428 883 person-years of follow-up contributed to the data, from which 42 774 potential AMI cases were identified. A random sample of 800 cases was subsequently obtained for validation. Main outcome measures: PPVs were calculated overall and for each code/free text. 'Best-case scenario' and 'worst-case scenario' PPVs were calculated, the latter taking into account non-retrievable/non-assessable cases. We further assessed the effects of AMI misclassification on estimates of risk during drug exposure. Results: Records of 748 cases (93.5% of sample) were retrieved. ICD-10 codes had a 'best-case scenario' PPV of 100% while ICD9-CM codes had a PPV of 96.6% (95% CI 93.2% to 99.9%). ICPC codes had a 'best-case scenario' PPV of 75% (95% CI 67.4% to 82.6%) and free text had PPV ranging from 20% to 60%. Corresponding PPVs in the 'worst-case scenario' all decreased. Use of codes with lower PPV generally resulted in small changes in AMI risk during drug exposure, but codes with higher PPV resulted in attenuation of risk for positive associations. Conclusions: ICD9-CM and ICD-10 codes have good PPV in identifying AMI from EHRs; strategies are necessary to further optimise utility of ICPC codes and free-text search. Use of specific AMI disease codes in estimation of risk during drug exposure may lead to small but significant changes and at the expense of decreased precision

    Medication-Wide Association Studies

    Get PDF
    Undiscovered side effects of drugs can have a profound effect on the health of the nation, and electronic health-care databases offer opportunities to speed up the discovery of these side effects. We applied a “medication-wide association study” approach that combined multivariate analysis with exploratory visualization to study four health outcomes of interest in an administrative claims database of 46 million patients and a clinical database of 11 million patients. The technique had good predictive value, but there was no threshold high enough to eliminate false-positive findings. The visualization not only highlighted the class effects that strengthened the review of specific products but also underscored the challenges in confounding. These findings suggest that observational databases are useful for identifying potential associations that warrant further consideration but are unlikely to provide definitive evidence of causal effects

    Prenatal antidepressant use and risk of attention-deficit/hyperactivity disorder in offspring:population based cohort study

    Get PDF
    textabstractObjective To assess the potential association between prenatal use of antidepressants and the risk of attention-deficit/hyperactivity disorder (ADHD) in offspring. Design Population based cohort study. Setting Data from the Hong Kong population based electronic medical records on the Clinical Data Analysis and Reporting System. Participants 190 618 children born in Hong Kong public hospitals between January 2001 and December 2009 and followed-up to December 2015. Main outcome measure Hazard ratio of maternal antidepressant use during pregnancy and ADHD in children aged 6 to 14 years, with an average follow-up time of 9.3 years (range 7.4-11.0 years). Results Among 190 618 children, 1252 had a mother who used prenatal antidepressants. 5659 children (3.0%) were given a diagnosis of ADHD or received treatment for ADHD. The crude hazard ratio of maternal antidepressant use during pregnancy was 2.26 (P<0.01) compared with non-use. After adjustment for potential confounding factors, including maternal psychiatric disorders and use of other psychiatric drugs, the adjusted hazard ratio was reduced to 1.39 (95% confidence interval 1.07 to 1.82, P=0.01). Likewise, similar results were observed when comparing children of mothers who had used antidepressants before pregnancy with those who were never users (1.76, 1.36 to 2.30, P<0.01). The risk of ADHD in the children of mothers with psychiatric disorders was higher compared with the children of mothers without psychiatric disorders even if the mothers had never used antidepressants (1.84, 1.54 to 2.18, P<0.01). All sensitivity analyses yielded similar results. Sibling matched analysis identified no significant difference in risk of ADHD in siblings exposed to antidepressants during gestation and those not exposed during gestation (0.54, 0.17 to 1.74, P=0.30). Conclusions The findings suggest that the association between prenatal use of antidepressants and risk of ADHD in offspring can be partially explained by confounding by indication of antidepressants. If there is a causal association, the size of the effect is probably smaller than that reported previously

    PatientExploreR: an extensible application for dynamic visualization of patient clinical history from electronic health records in the OMOP common data model.

    Get PDF
    MotivationElectronic health records (EHRs) are quickly becoming omnipresent in healthcare, but interoperability issues and technical demands limit their use for biomedical and clinical research. Interactive and flexible software that interfaces directly with EHR data structured around a common data model (CDM) could accelerate more EHR-based research by making the data more accessible to researchers who lack computational expertise and/or domain knowledge.ResultsWe present PatientExploreR, an extensible application built on the R/Shiny framework that interfaces with a relational database of EHR data in the Observational Medical Outcomes Partnership CDM format. PatientExploreR produces patient-level interactive and dynamic reports and facilitates visualization of clinical data without any programming required. It allows researchers to easily construct and export patient cohorts from the EHR for analysis with other software. This application could enable easier exploration of patient-level data for physicians and researchers. PatientExploreR can incorporate EHR data from any institution that employs the CDM for users with approved access. The software code is free and open source under the MIT license, enabling institutions to install and users to expand and modify the application for their own purposes.Availability and implementationPatientExploreR can be freely obtained from GitHub: https://github.com/BenGlicksberg/PatientExploreR. We provide instructions for how researchers with approved access to their institutional EHR can use this package. We also release an open sandbox server of synthesized patient data for users without EHR access to explore: http://patientexplorer.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online

    Knowledge-based biomedical word sense disambiguation: comparison of approaches

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain.</p> <p>Methods</p> <p>We present research on existing WSD approaches based on knowledge bases, which complement the studies performed on statistical learning. We compare four approaches which rely on the UMLS Metathesaurus as the source of knowledge. The first approach compares the overlap of the context of the ambiguous word to the candidate senses based on a representation built out of the definitions, synonyms and related terms. The second approach collects training data for each of the candidate senses to perform WSD based on queries built using monosemous synonyms and related terms. These queries are used to retrieve MEDLINE citations. Then, a machine learning approach is trained on this corpus. The third approach is a graph-based method which exploits the structure of the Metathesaurus network of relations to perform unsupervised WSD. This approach ranks nodes in the graph according to their relative structural importance. The last approach uses the semantic types assigned to the concepts in the Metathesaurus to perform WSD. The context of the ambiguous word and semantic types of the candidate concepts are mapped to Journal Descriptors. These mappings are compared to decide among the candidate concepts. Results are provided estimating accuracy of the different methods on the WSD test collection available from the NLM.</p> <p>Conclusions</p> <p>We have found that the last approach achieves better results compared to the other methods. The graph-based approach, using the structure of the Metathesaurus network to estimate the relevance of the Metathesaurus concepts, does not perform well compared to the first two methods. In addition, the combination of methods improves the performance over the individual approaches. On the other hand, the performance is still below statistical learning trained on manually produced data and below the maximum frequency sense baseline. Finally, we propose several directions to improve the existing methods and to improve the Metathesaurus to be more effective in WSD.</p

    A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

    Get PDF
    Background and objective: As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). Methods: We show step-by-step how to implement the analytics pipeline for the question: ‘In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?’. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Results: Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Conclusion: Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.</p
    corecore