15 research outputs found
ssROC: Semi-Supervised ROC Analysis for Reliable and Streamlined Evaluation of Phenotyping Algorithms
High-throughput phenotyping will accelerate the use of
electronic health records (EHRs) for translational research. A critical
roadblock is the extensive medical supervision required for phenotyping
algorithm (PA) estimation and evaluation. To address this challenge, numerous
weakly-supervised learning methods have been proposed to estimate PAs. However,
there is a paucity of methods for reliably evaluating the predictive
performance of PAs when a very small proportion of the data is labeled. To fill
this gap, we introduce a semi-supervised approach (ssROC) for estimation of the
receiver operating characteristic (ROC) parameters of PAs (e.g., sensitivity,
specificity).
ssROC uses a small labeled dataset to
nonparametrically impute missing labels. The imputations are then used for ROC
parameter estimation to yield more precise estimates of PA performance relative
to classical supervised ROC analysis (supROC) using only labeled data. We
evaluated ssROC through in-depth simulation studies and an extensive evaluation
of eight PAs from Mass General Brigham.
In both simulated and real data, ssROC produced ROC
parameter estimates with significantly lower variance than supROC for a given
amount of labeled data. For the eight PAs, our results illustrate that ssROC
achieves similar precision to supROC, but with approximately 60% of the amount
of labeled data on average.
ssROC enables precise evaluation of PA performance to
increase trust in observational health research without demanding large volumes
of labeled data. ssROC is also easily implementable in open-source
software.
When used in conjunction with weakly-supervised PAs,
ssROC facilitates the reliable and streamlined phenotyping necessary for
EHR-based research
Desiderata for the development of next-generation electronic health record phenotype libraries
Background
High-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.
Methods
A group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.
Results
We present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.
Conclusions
There are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Biomedical Informatics Applications for Precision Management of Neurodegenerative Diseases
Modern medicine is in the midst of a revolution driven by âbig data,â rapidly advancing computing power, and broader integration of technology into healthcare. Highly detailed and individualized profiles of both health and disease states are now possible, including biomarkers, genomic profiles, cognitive and behavioral phenotypes, high-frequency assessments, and medical imaging. Although these data are incredibly complex, they can potentially be used to understand multi-determinant causal relationships, elucidate modifiable factors, and ultimately customize treatments based on individual parameters. Especially for neurodegenerative diseases, where an effective therapeutic agent has yet to be discovered, there remains a critical need for an interdisciplinary perspective on data and information management due to the number of unanswered questions. Biomedical informatics is a multidisciplinary field that falls at the intersection of information technology, computer and data science, engineering, and healthcare that will be instrumental for uncovering novel insights into neurodegenerative disease research, including both causal relationships and therapeutic targets and maximizing the utility of both clinical and research data. The present study aims to provide a brief overview of biomedical informatics and how clinical data applications such as clinical decision support tools can be developed to derive new knowledge from the wealth of available data to advance clinical care and scientific research of neurodegenerative diseases in the era of precision medicine
Desiderata for the development of next-generation electronic health record phenotype libraries
BackgroundHigh-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.MethodsA group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.ResultsWe present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.ConclusionsThere are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Recommended from our members
Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods
Background
Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.
Materials and methods
Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.
Results
Across all models, we found that the mean AUROC for detecting AIS was 0.963â±â0.0520 and average precision score 0.790â±â0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832â±â0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60â150 fold over expected).
Conclusions
Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models
The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/1/sim8445_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/154448/2/sim8445.pd