Search CORE

860 research outputs found

ssROC: Semi-Supervised ROC Analysis for Reliable and Streamlined Evaluation of Phenotyping Algorithms

Author: Bonzel Clara-Lea
Gao Jianhui
Gronsbell Jessica
Hong Chuan
Varghese Paul
Zakir Karim
Publication venue
Publication date: 16/06/2023
Field of study

\textbf{Objective:}

High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed to estimate PAs. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (e.g., sensitivity, specificity).

\textbf{Materials and Methods:}

ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC through in-depth simulation studies and an extensive evaluation of eight PAs from Mass General Brigham.

\textbf{Results:}

In both simulated and real data, ssROC produced ROC parameter estimates with significantly lower variance than supROC for a given amount of labeled data. For the eight PAs, our results illustrate that ssROC achieves similar precision to supROC, but with approximately 60% of the amount of labeled data on average.

\textbf{Discussion:}

ssROC enables precise evaluation of PA performance to increase trust in observational health research without demanding large volumes of labeled data. ssROC is also easily implementable in open-source

\texttt{R}

software.

\textbf{Conclusion:}

When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research

arXiv.org e-Print Archive

Methods for enhancing the reproducibility of biomedical research findings using electronic health records.

Author: Cakiroglu Aylin
Denaxas Spiros
Direk Kenan
Gonzalez-Izquierdo Arturo
Hemingway Harry
Moore Jason
Pikoula Maria
Smeeth Liam
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

BACKGROUND: The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner. RESULTS: In this paper, we review and propose a series of methods and tools utilized in adjunct scientific disciplines that can be used to enhance the reproducibility of research using electronic health records and enable researchers to report analytical approaches in a transparent manner. Specifically, we discuss the adoption of scientific software engineering principles and best-practices such as test-driven development, source code revision control systems, literate programming and the standardization and re-use of common data management and analytical approaches. CONCLUSION: The adoption of such approaches will enable scientists to systematically document and share EHR analytical workflows and increase the reproducibility of biomedical research using such complex data sources

Crossref

LSHTM Research Online

Directory of Open Access Journals

UCL Discovery

FigShare

Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

Author: Binkheder Samar Hussein
Publication venue
Publication date: 01/07/2019
Field of study

Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement

IUPUIScholarWorks

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Author: Aczon M
Agniel D
Al‐Azwani IK
Beaulieu‐Jones BK
Beaulieu‐Jones BK
Beesley LJ
Bjørnland T
Caballero K
Castro V
Choi SW
Fan JW
Fritsche LG
Garg R
Ge T
Good P
Haneuse S
Harang R
Johnson KW
Kuang Z
Lloyd‐Jones LR
Lloyd‐Jones LR
Long Q
Mcculloch CE
National Institutes of Health
Neale B.
Pendergrass SA
Pollard TJ
Rajkomar A
Rothman KJ
Santillana M
Shi X
Shickel B
Tang L
Thompson K
Uddin MJ
Wells BJ
West SG
Xie S
Yang J
Publication venue: John Wiley & Sons, Inc.
Publication date: 15/03/2020
Field of study

Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/1/sim8445_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/2/sim8445.pd

Crossref

Deep Blue Documents at the University of Michigan

Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data

Author: Denaxas Spiros
Dobson Richard
Hemingway Harry
Pikoula Maria
Riedel Sebastian
Stenetorp Pontus
Publication venue
Publication date: 28/11/2018
Field of study

Electronic health records (EHR) are increasingly being used for constructing disease risk prediction models. Feature engineering in EHR data however is challenging due to their highly dimensional and heterogeneous nature. Low-dimensional representations of EHR data can potentially mitigate these challenges. In this paper, we use global vectors (GloVe) to learn word embeddings for diagnoses and procedures recorded using 13 million ontology terms across 2.7 million hospitalisations in national UK EHR. We demonstrate the utility of these embeddings by evaluating their performance in identifying patients which are at higher risk of being hospitalised for congestive heart failure. Our findings indicate that embeddings can enable the creation of robust EHR-derived disease risk prediction models and address some the limitations associated with manual clinical feature engineering.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.0721

arXiv.org e-Print Archive

UCL Discovery

Methods for enhancing the reproducibility of biomedical research findings using electronic health records

Crossref

Recommended from our members

Integration of Multiscale Sensing Data for Phenomics Applications

Author: Sangjan Worasit
Publication venue: Washington State University
Publication date: 01/01/2023
Field of study

Sensing technologies can be a powerful tool for phenotyping in breeding programs. Plant phenotypes can be assessed non-invasively and repeatedly across the whole population and throughout the plant development period utilizing advanced sensors and remote sensing platforms. In this study, multiscale sensing platforms—satellite, unmanned aerial vehicle (UAV), proximal sensing system, and Internet of Things (IoT) based sensing systems—equipped with sensors such as visible/RGB, multispectral, and hyperspectral systems were utilized for field-based phenomics applications. The applicability of a suitable sensing technology depends on the area of study, specific phenomics application, sensor specification, and data acquisition conditions. Three main phenomics applications were explored: (i) pasture crop health status evaluation, (ii) above-ground biomass quantity and quality evaluation in the field pea, and (iii) evaluating wheat yield potential in winter and spring wheat. The first study demonstrates the reliability of using a high-resolution satellite (ground sampling distance, GSD = 3 m) and UAV imagery for pasture management. The data from multiscale sensing data showed that the grazing density significantly affected pasture biomass (p < 0.05) only in 2019, and the vegetation index (VI) data from the two imagery types were highly correlated (r ≥ 0.78, p < 0.001, 2019). In the second study, the above-ground biomass (AGBM) and biomass quality (12 quality traits) were evaluated using UAV-based RGB and multispectral imaging, and hyperspectral sensing, respectively, in the winter pea breeding program (2019 and 2020 seasons). Three image processing approaches were evaluated for AGBM estimation, where the best results were acquired using the 3D point cloud model at 1.5 alpha shape technique showing high correlation with harvested fresh (r = 0.78–0.81, p < 0.001) and dry (r = 0.70–0.81, p < 0.001) AGBM. Similarly, the selected features from the normalized difference spectral indices and the ratio spectral indices extracted from hyperspectral data with the random forest model provided high predictive accuracy for all 12 biomass quality traits (0.81 < R2 < 0. 93; 0.05 < RMSE (%) < 1.80; 0.03 < MAE (%) < 1.32).In the wheat study, the vegetation indies were highly correlated between satellite (GSD = 0.31 m) and UAV data (0.42 ≤ r ≤ 0.99, p < 0.01) from winter and spring wheat breeding trials (2020 and 2021). The yield prediction using such VIs with the high-resolution satellite imagery (6.26 ≤ RMSE% ≤ 25.49; 5.11 ≤ MAE% ≤ 20.95; 0.17 ≤ r ≤0.78) and UAV imagery (5.53 ≤ RMSE% ≤ 17.20; 4.28 ≤ MAE% ≤ 14.20; 0.43 ≤ r ≤ 0.92) was also high. In addition to these two platforms, an intelligent and compact IoT-based sensor system was developed for independent and automated phenomics applications to measure and monitor plant responses in real-time. The sensor development, improvisation, and implementation encompassed three field seasons (2020, 2021, and 2022 seasons). The developed IoT-based sensor system could be successfully implemented to monitor multiple trials for timely crop management and increased resource efficiency. The system shows a high potential for supporting plant breeding programs for in-field phenotyping applications. All studies demonstrated promising results in monitoring and estimating crop performance and phenotypic traits using multiscale sensing systems

Washington State University institutional repository