38,289 research outputs found

    MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

    Full text link
    Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and aggregating semantically equivalent features, thus accounting for duplication and reducing missingness. Second, it preserves the time series nature of clinical data and can be easily integrated into clinically actionable prediction tasks in machine learning for health. Finally, it is highly extensible so that other researchers with related questions can easily use the same pipeline. We demonstrate the utility of this pipeline by showcasing several benchmark tasks and baseline results

    NHANES-GCP: Leveraging the Google Cloud Platform and BigQuery ML for reproducible machine learning with data from the National Health and Nutrition Examination Survey

    Full text link
    Summary: NHANES, the National Health and Nutrition Examination Survey, is a program of studies led by the Centers for Disease Control and Prevention (CDC) designed to assess the health and nutritional status of adults and children in the United States (U.S.). NHANES data is frequently used by biostatisticians and clinical scientists to study health trends across the U.S., but every analysis requires extensive data management and cleaning before use and this repetitive data engineering collectively costs valuable research time and decreases the reproducibility of analyses. Here, we introduce NHANES-GCP, a Cloud Development Kit for Terraform (CDKTF) Infrastructure-as-Code (IaC) and Data Build Tool (dbt) resources built on the Google Cloud Platform (GCP) that automates the data engineering and management aspects of working with NHANES data. With current GCP pricing, NHANES-GCP costs less than 2torunandlessthan2 to run and less than 15/yr of ongoing costs for hosting the NHANES data, all while providing researchers with clean data tables that can readily be integrated for large-scale analyses. We provide examples of leveraging BigQuery ML to carry out the process of selecting data, integrating data, training machine learning and statistical models, and generating results all from a single SQL-like query. NHANES-GCP is designed to enhance the reproducibility of analyses and create a well-engineered NHANES data resource for statistics, machine learning, and fine-tuning Large Language Models (LLMs). Availability and implementation" NHANES-GCP is available at https://github.com/In-Vivo-Group/NHANES-GCPComment: 7 pages, 1 figur

    Development of a Composite Health Index in Children with Cystic Fibrosis: A Pipeline for Data Processing, Machine Learning, and Model Implementation using Electronic Health Records

    Get PDF
    Cystic Fibrosis (CF) is a heterogeneous multi-faceted genetic condition that primarily affects the lungs and digestive system. For children and young people living with CF, timely management is necessary to prevent the establishment of severe disease. Modern data capture through electronic health records (EHR) have created an opportunity to use machine learning algorithms to classify subgroups of disease to understand health status and prognosis. The overall aim of this thesis was to develop a composite health index in children with CF. An iterative approach to unsupervised cluster analysis was developed to identify homogeneous clusters of children with CF in a pre-existing encounter-based CF database from Toronto Canada. An external validation of the model was carried out in a historical CF dataset from Great Ormond Street Hospital (GOSH) in London UK. The clusters were also re-created and validated using EHR data from GOSH when it first became accessible in 2021. The interpretability and sensitivity of the GOSH EHR model was explored. Lastly, a scoping review was carried out to investigate common barriers to implementation of prognostic machine learning algorithms in paediatric respiratory care. A cluster model was identified that detailed four clusters associated with time to future hospitalisation, pulmonary exacerbation, and lung function. The clusters were also associated with different disease related variables such as comorbidities, anthropometrics, microbiology infections, and treatment history. An app was developed to display individualised cluster assignment, which will be a useful way to interpret the cluster model clinically. The review of prognostic machine learning algorithms identified a lack of reproducibility and validations as the major limitation to model reporting that impair clinical translation. EHR systems facilitate point-of-care access of individualised data and integrated machine learning models. However, there is a gap in translation to clinical implementation of machine learning models. With appropriate regulatory frameworks the health index developed for children with CF could be implemented in CF care

    The PLOS ONE collection on machine learning in health and biomedicine: Towards open code and open data

    Get PDF
    Recent years have seen a surge of studies in machine learning in health and biomedicine, driven by digitalization of healthcare environments and increasingly accessible computer systems for conducting analyses. Many of us believe that these developments will lead to significant improvements in patient care. Like many academic disciplines, however, progress is hampered by lack of code and data sharing. In bringing together this PLOS ONE collection on machine learning in health and biomedicine, we sought to focus on the importance of reproducibility, making it a requirement, as far as possible, for authors to share data and code alongside their papers

    Digital single-image smartphone assessment of total body fat and abdominal fat using machine learning

    Get PDF
    Background: Obesity is chronic health problem. Screening for the obesity phenotype is limited by the availability of practical methods. Methods: We determined the reproducibility and accuracy of an automated machine-learning method using smartphone camera-enabled capture and analysis of single, two-dimensional (2D) standing lateral digital images to estimate fat mass (FM) compared to dual X-ray absorptiometry (DXA) in females and males. We also report the first model to predict abdominal FM using 2D digital images. Results: Gender-specific 2D estimates of FM were significantly correlated (p 0.05). Reproducibility of FM estimates was very high (R2 = 0.99) with high concordance (R2 = 0.99) and low absolute pure error (0.114 to 0.116 kg) and percent error (1.3 and 3%). Bland–Altman plots revealed no proportional bias with limits of agreement of 4.9 to -4.3 kg and 3.9 to -4.9 kg for females and males, respectively. A novel 2D model to estimate abdominal (lumbar 2–5) FM produced high correlations (R2 = 0.99) and concordance (R2 = 0.99) compared to DXA abdominal FM values. Conclusions: A smartphone camera trained with machine learning and automated processing of 2D lateral standing digital images is an objective and valid method to estimate FM and, with proof of concept, to determine abdominal FM. It can facilitate practical identification of the obesity phenotype in adults.Peer ReviewedPostprint (published version
    • …
    corecore