860 research outputs found
ssROC: Semi-Supervised ROC Analysis for Reliable and Streamlined Evaluation of Phenotyping Algorithms
High-throughput phenotyping will accelerate the use of
electronic health records (EHRs) for translational research. A critical
roadblock is the extensive medical supervision required for phenotyping
algorithm (PA) estimation and evaluation. To address this challenge, numerous
weakly-supervised learning methods have been proposed to estimate PAs. However,
there is a paucity of methods for reliably evaluating the predictive
performance of PAs when a very small proportion of the data is labeled. To fill
this gap, we introduce a semi-supervised approach (ssROC) for estimation of the
receiver operating characteristic (ROC) parameters of PAs (e.g., sensitivity,
specificity).
ssROC uses a small labeled dataset to
nonparametrically impute missing labels. The imputations are then used for ROC
parameter estimation to yield more precise estimates of PA performance relative
to classical supervised ROC analysis (supROC) using only labeled data. We
evaluated ssROC through in-depth simulation studies and an extensive evaluation
of eight PAs from Mass General Brigham.
In both simulated and real data, ssROC produced ROC
parameter estimates with significantly lower variance than supROC for a given
amount of labeled data. For the eight PAs, our results illustrate that ssROC
achieves similar precision to supROC, but with approximately 60% of the amount
of labeled data on average.
ssROC enables precise evaluation of PA performance to
increase trust in observational health research without demanding large volumes
of labeled data. ssROC is also easily implementable in open-source
software.
When used in conjunction with weakly-supervised PAs,
ssROC facilitates the reliable and streamlined phenotyping necessary for
EHR-based research
Methods for enhancing the reproducibility of biomedical research findings using electronic health records.
BACKGROUND: The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner. RESULTS: In this paper, we review and propose a series of methods and tools utilized in adjunct scientific disciplines that can be used to enhance the reproducibility of research using electronic health records and enable researchers to report analytical approaches in a transparent manner. Specifically, we discuss the adoption of scientific software engineering principles and best-practices such as test-driven development, source code revision control systems, literate programming and the standardization and re-use of common data management and analytical approaches. CONCLUSION: The adoption of such approaches will enable scientists to systematically document and share EHR analytical workflows and increase the reproducibility of biomedical research using such complex data sources
Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions
Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting
clinical research, but they become an obstacle when they are not readily available.
Developing new definitions manually requires expert involvement that is labor-intensive,
time-consuming, and unscalable. Moreover, automated approaches rely mostly on
electronic health records’ data that suffer from bias, confounding, and incompleteness.
Limited efforts established in utilizing text-mining and data-driven approaches to automate
extraction and literature-based knowledge discovery of phenotyping definitions and to
support their scalability. In this dissertation, we proposed a text-mining pipeline combining
rule-based and machine-learning methods to automate retrieval, classification, and
extraction of phenotyping definitions’ information from literature. To achieve this, we first
developed an annotation guideline with ten dimensions to annotate sentences with evidence
of phenotyping definitions' modalities, such as phenotypes and laboratories. Two
annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text
observational studies’ methods sections (n=86). Percent and Kappa statistics showed high
inter-annotator agreement on sentence-level annotations. Second, we constructed two
validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level.
We applied the abstract-level classifier on a large-scale biomedical literature of over
20 million abstracts published between 1975 and 2018 to classify positive abstracts
(n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from
their methods sections and used the full-text sentence-level classifier to extract positive
sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the
positively classified sentences. Lexica-based methods were used to recognize medical
concepts in these sentences (n=19,423). Co-occurrence and association methods were used
to identify and rank phenotype candidates that are associated with a phenotype of interest.
We derived 12,616,465 associations from our large-scale corpus. Our literature-based
associations and large-scale corpus contribute in building new data-driven phenotyping
definitions and expanding existing definitions with minimal expert involvement
The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/1/sim8445_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/2/sim8445.pd
Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data
Electronic health records (EHR) are increasingly being used for constructing
disease risk prediction models. Feature engineering in EHR data however is
challenging due to their highly dimensional and heterogeneous nature.
Low-dimensional representations of EHR data can potentially mitigate these
challenges. In this paper, we use global vectors (GloVe) to learn word
embeddings for diagnoses and procedures recorded using 13 million ontology
terms across 2.7 million hospitalisations in national UK EHR. We demonstrate
the utility of these embeddings by evaluating their performance in identifying
patients which are at higher risk of being hospitalised for congestive heart
failure. Our findings indicate that embeddings can enable the creation of
robust EHR-derived disease risk prediction models and address some the
limitations associated with manual clinical feature engineering.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
arXiv:1811.0721
Recommended from our members
Integration of Multiscale Sensing Data for Phenomics Applications
Sensing technologies can be a powerful tool for phenotyping in breeding programs. Plant phenotypes can be assessed non-invasively and repeatedly across the whole population and throughout the plant development period utilizing advanced sensors and remote sensing platforms. In this study, multiscale sensing platforms—satellite, unmanned aerial vehicle (UAV), proximal sensing system, and Internet of Things (IoT) based sensing systems—equipped with sensors such as visible/RGB, multispectral, and hyperspectral systems were utilized for field-based phenomics applications. The applicability of a suitable sensing technology depends on the area of study, specific phenomics application, sensor specification, and data acquisition conditions. Three main phenomics applications were explored: (i) pasture crop health status evaluation, (ii) above-ground biomass quantity and quality evaluation in the field pea, and (iii) evaluating wheat yield potential in winter and spring wheat. The first study demonstrates the reliability of using a high-resolution satellite (ground sampling distance, GSD = 3 m) and UAV imagery for pasture management. The data from multiscale sensing data showed that the grazing density significantly affected pasture biomass (p < 0.05) only in 2019, and the vegetation index (VI) data from the two imagery types were highly correlated (r ≥ 0.78, p < 0.001, 2019). In the second study, the above-ground biomass (AGBM) and biomass quality (12 quality traits) were evaluated using UAV-based RGB and multispectral imaging, and hyperspectral sensing, respectively, in the winter pea breeding program (2019 and 2020 seasons). Three image processing approaches were evaluated for AGBM estimation, where the best results were acquired using the 3D point cloud model at 1.5 alpha shape technique showing high correlation with harvested fresh (r = 0.78–0.81, p < 0.001) and dry (r = 0.70–0.81, p < 0.001) AGBM. Similarly, the selected features from the normalized difference spectral indices and the ratio spectral indices extracted from hyperspectral data with the random forest model provided high predictive accuracy for all 12 biomass quality traits (0.81 < R2 < 0. 93; 0.05 < RMSE (%) < 1.80; 0.03 < MAE (%) < 1.32).In the wheat study, the vegetation indies were highly correlated between satellite (GSD = 0.31 m) and UAV data (0.42 ≤ r ≤ 0.99, p < 0.01) from winter and spring wheat breeding trials (2020 and 2021). The yield prediction using such VIs with the high-resolution satellite imagery (6.26 ≤ RMSE% ≤ 25.49; 5.11 ≤ MAE% ≤ 20.95; 0.17 ≤ r ≤0.78) and UAV imagery (5.53 ≤ RMSE% ≤ 17.20; 4.28 ≤ MAE% ≤ 14.20; 0.43 ≤ r ≤ 0.92) was also high. In addition to these two platforms, an intelligent and compact IoT-based sensor system was developed for independent and automated phenomics applications to measure and monitor plant responses in real-time. The sensor development, improvisation, and implementation encompassed three field seasons (2020, 2021, and 2022 seasons). The developed IoT-based sensor system could be successfully implemented to monitor multiple trials for timely crop management and increased resource efficiency. The system shows a high potential for supporting plant breeding programs for in-field phenotyping applications. All studies demonstrated promising results in monitoring and estimating crop performance and phenotypic traits using multiscale sensing systems
- …