58,836 research outputs found

    Building Data-Driven Pathways From Routinely Collected Hospital Data:A Case Study on Prostate Cancer

    Get PDF
    Background: Routinely collected data in hospitals is complex, typically heterogeneous, and scattered across multiple Hospital Information Systems (HIS). This big data, created as a byproduct of health care activities, has the potential to provide a better understanding of diseases, unearth hidden patterns, and improve services and cost. The extent and uses of such data rely on its quality, which is not consistently checked, nor fully understood. Nevertheless, using routine data for the construction of data-driven clinical pathways, describing processes and trends, is a key topic receiving increasing attention in the literature. Traditional algorithms do not cope well with unstructured processes or data, and do not produce clinically meaningful visualizations. Supporting systems that provide additional information, context, and quality assurance inspection are needed. Objective: The objective of the study is to explore how routine hospital data can be used to develop data-driven pathways that describe the journeys that patients take through care, and their potential uses in biomedical research; it proposes a framework for the construction, quality assessment, and visualization of patient pathways for clinical studies and decision support using a case study on prostate cancer. Methods: Data pertaining to prostate cancer patients were extracted from a large UK hospital from eight different HIS, validated, and complemented with information from the local cancer registry. Data-driven pathways were built for each of the 1904 patients and an expert knowledge base, containing rules on the prostate cancer biomarker, was used to assess the completeness and utility of the pathways for a specific clinical study. Software components were built to provide meaningful visualizations for the constructed pathways. Results: The proposed framework and pathway formalism enable the summarization, visualization, and querying of complex patient-centric clinical information, as well as the computation of quality indicators and dimensions. A novel graphical representation of the pathways allows the synthesis of such information. Conclusions: Clinical pathways built from routinely collected hospital data can unearth information about patients and diseases that may otherwise be unavailable or overlooked in hospitals. Data-driven clinical pathways allow for heterogeneous data (ie, semistructured and unstructured data) to be collated over a unified data model and for data quality dimensions to be assessed. This work has enabled further research on prostate cancer and its biomarkers, and on the development and application of methods to mine, compare, analyze, and visualize pathways constructed from routine data. This is an important development for the reuse of big data in hospitals

    Impact of Selective Mapping Strategies on Automated Laboratory Result Notification to Public Health Authorities

    Get PDF
    Automated electronic laboratory reporting (ELR) for public health has many potential advantages, but requires mapping local laboratory test codes to a standard vocabulary such as LOINC. Mapping only the most frequently reported tests provides one way to prioritize the effort and mitigate the resource burden. We evaluated the implications of selective mapping on ELR for public health by comparing reportable conditions from an operational ELR system with the codes in the LOINC Top 2000. Laboratory result codes in the LOINC Top 2000 accounted for 65.3% of the reportable condition volume. However, by also including the 129 most frequent LOINC codes that identified reportable conditions in our system but were not present in the LOINC Top 2000, this set would cover 98% of the reportable condition volume. Our study highlights the ways that our approach to implementing vocabulary standards impacts secondary data uses such as public health reporting

    Multimodal Machine Learning for Automated ICD Coding

    Full text link
    This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD-10 codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC -III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coefficient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.Comment: Machine Learning for Healthcare 201

    Data-driven approach for creating synthetic electronic medical records

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>New algorithms for disease outbreak detection are being developed to take advantage of full electronic medical records (EMRs) that contain a wealth of patient information. However, due to privacy concerns, even anonymized EMRs cannot be shared among researchers, resulting in great difficulty in comparing the effectiveness of these algorithms. To bridge the gap between novel bio-surveillance algorithms operating on full EMRs and the lack of non-identifiable EMR data, a method for generating complete and synthetic EMRs was developed.</p> <p>Methods</p> <p>This paper describes a novel methodology for generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population.</p> <p>Results</p> <p>We generated EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results for 203 synthetic tularemia outbreak patients. Validation of the records by a medical expert revealed problems in 19% of the records; these were subsequently corrected. We also generated background EMRs for over 3000 patients in the 4-11 yr age group. Validation of those records by a medical expert revealed problems in fewer than 3% of these background patient EMRs and the errors were subsequently rectified.</p> <p>Conclusions</p> <p>A data-driven method was developed for generating fully synthetic EMRs. The method is general and can be applied to any data set that has similar data elements (such as laboratory and radiology orders and results, clinical activity, prescription orders). The pilot synthetic outbreak records were for tularemia but our approach may be adapted to other infectious diseases. The pilot synthetic background records were in the 4-11 year old age group. The adaptations that must be made to the algorithms to produce synthetic background EMRs for other age groups are indicated.</p

    Impact of Terminology Mapping on Population Health Cohorts IMPaCt

    Get PDF
    Background and Objectives: The population health care delivery model uses phenotype algorithms in the electronic health record (EHR) system to identify patient cohorts targeted for clinical interventions such as laboratory tests, and procedures. The standard terminology used to identify disease cohorts may contribute to significant variation in error rates for patient inclusion or exclusion. The United States requires EHR systems to support two diagnosis terminologies, the International Classification of Disease (ICD) and the Systematized Nomenclature of Medicine (SNOMED). Terminology mapping enables the retrieval of diagnosis data using either terminology. There are no standards of practice by which to evaluate and report the operational characteristics of ICD and SNOMED value sets used to select patient groups for population health interventions. Establishing a best practice for terminology selection is a step forward in ensuring that the right patients receive the right intervention at the right time. The research question is, “How does the diagnosis retrieval terminology (ICD vs SNOMED) and terminology map maintenance impact population health cohorts?” Aim 1 and 2 explore this question, and Aim 3 informs practice and policy for population health programs. Methods Aim 1: Quantify impact of terminology choice (ICD vs SNOMED) ICD and SNOMED phenotype algorithms for diabetes, chronic kidney disease (CKD), and heart failure were developed using matched sets of codes from the Value Set Authority Center. The performance of the diagnosis-only phenotypes was compared to published reference standard that included diagnosis codes, laboratory results, procedures, and medications. Aim 2: Measure terminology maintenance impact on SNOMED cohorts For each disease state, the performance of a single SNOMED algorithm before and after terminology updates was evaluated in comparison to a reference standard to identify and quantify cohort changes introduced by terminology maintenance. Aim 3: Recommend methods for improving population health interventions The socio-technical model for studying health information technology was used to inform best practice for the use of population health interventions. Results Aim 1: ICD-10 value sets had better sensitivity than SNOMED for diabetes (.829, .662) and CKD (.242, .225) (N=201,713, p Aim 2: Following terminology maintenance the SNOMED algorithm for diabetes increased in sensitivity from (.662 to .683 (p Aim 3: Based on observed social and technical challenges to population health programs, including and in addition to the development and measurement of phenotypes, a practical method was proposed for population health intervention development and reporting

    Secondary use of Structured Electronic Health Records Data: From Observational Studies to Deep Learning-based Predictive Modeling

    Get PDF
    With the wide adoption of electronic health records (EHRs), researchers, as well as large healthcare organizations, governmental institutions, insurance, and pharmaceutical companies have been interested in leveraging this rich clinical data source to extract clinical evidence and develop predictive algorithms. Large vendors have been able to compile structured EHR data from sites all over the United States, de-identify these data, and make them available to data science researchers in a more usable format. For this dissertation, we leveraged one of the earliest and largest secondary EHR data sources and conducted three studies of increasing scope. In the first study, which was of limited scope, we conducted a retrospective observational study to compare the effect of three drugs on a specific population of approximately 3,000 patients. Using a novel statistical method, we found evidence that the selection of phenylephrine as the primary vasopressor to induce hypertension for the management of nontraumatic subarachnoid hemorrhage is associated with better outcomes as compared to selecting norepinephrine or dopamine. In the second study, we widened our scope, using a cohort of more than 100,000 patients to train generalizable models for the risk prediction of specific clinical events, such as heart failure in diabetes patients or pancreatic cancer. In this study, we found that recurrent neural network-based predictive models trained on expressive terminologies, which preserve a high level of granularity, are associated with better prediction performance as compared with other baseline methods, such as logistic regression. Finally, we widened our scope again, to train Med-BERT, a foundation model, on more than 20 million patients’ diagnosis data. Med-BERT was found to improve the prediction performance of downstream tasks that have a small sample size, which otherwise would limit the ability of the model to learn good representation. In conclusion, we found that we can extract useful information and train helpful deep learning-based predictive models. Given the limitations of secondary EHR data and taking into consideration that the data were originally collected for administrative and not research purposes, however, the findings need clinical validation. Therefore, clinical trials are warranted to further validate any new evidence extracted from such data sources before updating clinical practice guidelines. The implementability of the developed predictive models, which are in an early development phase, also warrants further evaluation
    corecore