25 research outputs found
The utility of general purpose versus specialty clinical databases for research: Warfarin dose estimation from extracted clinical variables
AbstractThere is debate about the utility of clinical data warehouses for research. Using a clinical warfarin dosing algorithm derived from research-quality data, we evaluated the data quality of both a general-purpose database and a coagulation-specific database. We evaluated the functional utility of these repositories by using data extracted from them to predict warfarin dose. We reasoned that high-quality clinical data would predict doses nearly as accurately as research data, while poor-quality clinical data would predict doses less accurately. We evaluated the Mean Absolute Error (MAE) in predicted weekly dose as a metric of data quality. The MAE was comparable between the clinical gold standard (10.1mg/wk) and the specialty database (10.4mg/wk), but the MAE for the clinical warehouse was 40% greater (14.1mg/wk). Our results indicate that the research utility of clinical data collected in focused clinical settings is greater than that of data collected during general-purpose clinical care
Automated Lung Ultrasound Pulmonary Disease Quantification Using an Unsupervised Machine Learning Technique for COVID-19
COVID-19 is an ongoing global health pandemic. Although COVID-19 can be diagnosed with various tests such as PCR, these tests do not establish pulmonary disease burden. Whereas point-of-care lung ultrasound (POCUS) can directly assess the severity of characteristic pulmonary findings of COVID-19, the advantage of using US is that it is inexpensive, portable, and widely available for use in many clinical settings. For automated assessment of pulmonary findings, we have developed an unsupervised learning technique termed the calculated lung ultrasound (CLU) index. The CLU can quantify various types of lung findings, such as A or B lines, consolidations, and pleural effusions, and it uses these findings to calculate a CLU index score, which is a quantitative measure of pulmonary disease burden. This is accomplished using an unsupervised, patient-specific approach that does not require training on a large dataset. The CLU was tested on 52 lung ultrasound examinations from several institutions. CLU demonstrated excellent concordance with radiologist findings in different pulmonary disease states. Given the global nature of COVID-19, the CLU would be useful for sonographers and physicians in resource-strapped areas with limited ultrasound training and diagnostic capacities for more accurate assessment of pulmonary status
An integrative method for scoring candidate genes from association studies: application to warfarin dosing
BackgroundA key challenge in pharmacogenomics is the identification of genes whose variants contribute to drug response phenotypes, which can include severe adverse effects. Pharmacogenomics GWAS attempt to elucidate genotypes predictive of drug response. However, the size of these studies has severely limited their power and potential application. We propose a novel knowledge integration and SNP aggregation approach for identifying genes impacting drug response. Our SNP aggregation method characterizes the degree to which uncommon alleles of a gene are associated with drug response. We first use pre-existing knowledge sources to rank pharmacogenes by their likelihood to affect drug response. We then define a summary score for each gene based on allele frequencies and train linear and logistic regression classifiers to predict drug response phenotypes.ResultsWe applied our method to a published warfarin GWAS data set comprising 181 individuals. We find that our method can increase the power of the GWAS to identify both VKORC1 and CYP2C9 as warfarin pharmacogenes, where the original analysis had only identified VKORC1. Additionally, we find that our method can be used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders.ConclusionsOur method offers a new route for candidate pharmacogene discovery from pharmacogenomics GWAS, and serves as a foundation for future work in methods for predictive pharmacogenomics
Cross-Modal Data Programming Enables Rapid Medical Machine Learning
Labeling training datasets has become a key barrier to building medical
machine learning models. One strategy is to generate training labels
programmatically, for example by applying natural language processing pipelines
to text reports associated with imaging studies. We propose cross-modal data
programming, which generalizes this intuitive strategy in a
theoretically-grounded way that enables simpler, clinician-driven input,
reduces required labeling time, and improves with additional unlabeled data. In
this approach, clinicians generate training labels for models defined over a
target modality (e.g. images or time series) by writing rules over an auxiliary
modality (e.g. text reports). The resulting technical challenge consists of
estimating the accuracies and correlations of these rules; we extend a recent
unsupervised generative modeling technique to handle this cross-modal setting
in a provably consistent way. Across four applications in radiography, computed
tomography, and electroencephalography, and using only several hours of
clinician time, our approach matches or exceeds the efficacy of
physician-months of hand-labeling with statistical significance, demonstrating
a fundamentally faster and more flexible way of building machine learning
models in medicine
Recommended from our members
Pathway analysis of genome-wide data improves warfarin dose prediction
Background: Many genome-wide association studies focus on associating single loci with target phenotypes. However, in the setting of rare variation, accumulating sufficient samples to assess these associations can be difficult. Moreover, multiple variations in a gene or a set of genes within a pathway may all contribute to the phenotype, suggesting that the aggregation of variations found over the gene or pathway may be useful for improving the power to detect associations. Results: Here, we present a method for aggregating single nucleotide polymorphisms (SNPs) along biologically relevant pathways in order to seek genetic associations with phenotypes. Our method uses all available genetic variants and does not remove those in linkage disequilibrium (LD). Instead, it uses a novel SNP weighting scheme to down-weight the contributions of correlated SNPs. We apply our method to three cohorts of patients taking warfarin: two European descent cohorts and an African American cohort. Although the clinical covariates and key pharmacogenetic loci for warfarin have been characterized, our association metric identifies a significant association with mutations distributed throughout the pathway of warfarin metabolism. We improve dose prediction after using all known clinical covariates and pharmacogenetic variants in VKORC1 and CYP2C9. In particular, we find that at least 1% of the missing heritability in warfarin dose may be due to the aggregated effects of variations in the warfarin metabolic pathway, even though the SNPs do not individually show a significant association. Conclusions: Our method allows researchers to study aggregative SNP effects in an unbiased manner by not preselecting SNPs. It retains all the available information by accounting for LD-structure through weighting, which eliminates the need for LD pruning
Predictive Model for ICU Readmission Based on Discharge Summaries Using Machine Learning and Natural Language Processing
Predicting ICU readmission risk will help physicians make decisions regarding discharge. We used discharge summaries to predict ICU 30-day readmission risk using text mining and machine learning (ML) with data from the Medical Information Mart for Intensive Care III (MIMIC-III). We used Natural Language Processing (NLP) and the Bag-of-Words approach on discharge summaries to build a Document-Term-Matrix with 3000 features. We compared the performance of support vector machines with the radial basis function kernel (SVM-RBF), adaptive boosting (AdaBoost), quadratic discriminant analysis (QDA), least absolute shrinkage and selection operator (LASSO), and Ridge Regression. A total of 4000 patients were used for model training and 6000 were used for validation. Using the bag-of-words determined by NLP, the area under the receiver operating characteristic (AUROC) curve was 0.71, 0.68, 0.65, 0.69, and 0.65 correspondingly for SVM-RBF, AdaBoost, QDA, LASSO, and Ridge Regression. We then used the SVM-RBF model for feature selection by incrementally adding features to the model from 1 to 3000 bag-of-words. Through this exhaustive search approach, only 825 features (words) were dominant. Using those selected features, we trained and validated all ML models. The AUROC curve was 0.74, 0.69, 0.67, 0.70, and 0.71 respectively for SVM-RBF, AdaBoost, QDA, LASSO, and Ridge Regression. Overall, this technique could predict ICU readmission relatively well
Automated Lung Ultrasound Pulmonary Disease Quantification Using an Unsupervised Machine Learning Technique for COVID-19
COVID-19 is an ongoing global health pandemic. Although COVID-19 can be diagnosed with various tests such as PCR, these tests do not establish pulmonary disease burden. Whereas point-of-care lung ultrasound (POCUS) can directly assess the severity of characteristic pulmonary findings of COVID-19, the advantage of using US is that it is inexpensive, portable, and widely available for use in many clinical settings. For automated assessment of pulmonary findings, we have developed an unsupervised learning technique termed the calculated lung ultrasound (CLU) index. The CLU can quantify various types of lung findings, such as A or B lines, consolidations, and pleural effusions, and it uses these findings to calculate a CLU index score, which is a quantitative measure of pulmonary disease burden. This is accomplished using an unsupervised, patient-specific approach that does not require training on a large dataset. The CLU was tested on 52 lung ultrasound examinations from several institutions. CLU demonstrated excellent concordance with radiologist findings in different pulmonary disease states. Given the global nature of COVID-19, the CLU would be useful for sonographers and physicians in resource-strapped areas with limited ultrasound training and diagnostic capacities for more accurate assessment of pulmonary status
Predictive Model for ICU Readmission Based on Discharge Summaries Using Machine Learning and Natural Language Processing
Predicting ICU readmission risk will help physicians make decisions regarding discharge. We used discharge summaries to predict ICU 30-day readmission risk using text mining and machine learning (ML) with data from the Medical Information Mart for Intensive Care III (MIMIC-III). We used Natural Language Processing (NLP) and the Bag-of-Words approach on discharge summaries to build a Document-Term-Matrix with 3000 features. We compared the performance of support vector machines with the radial basis function kernel (SVM-RBF), adaptive boosting (AdaBoost), quadratic discriminant analysis (QDA), least absolute shrinkage and selection operator (LASSO), and Ridge Regression. A total of 4000 patients were used for model training and 6000 were used for validation. Using the bag-of-words determined by NLP, the area under the receiver operating characteristic (AUROC) curve was 0.71, 0.68, 0.65, 0.69, and 0.65 correspondingly for SVM-RBF, AdaBoost, QDA, LASSO, and Ridge Regression. We then used the SVM-RBF model for feature selection by incrementally adding features to the model from 1 to 3000 bag-of-words. Through this exhaustive search approach, only 825 features (words) were dominant. Using those selected features, we trained and validated all ML models. The AUROC curve was 0.74, 0.69, 0.67, 0.70, and 0.71 respectively for SVM-RBF, AdaBoost, QDA, LASSO, and Ridge Regression. Overall, this technique could predict ICU readmission relatively well