3,283 research outputs found

    Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records

    Get PDF
    Introduction Clustering algorithms are a class of algorithms that can discover groups of observations in complex data and are often used to identify subtypes of heterogeneous diseases in electronic health records (EHR). Evaluating clustering experiments for biological and clinical significance is a vital but challenging task due to the lack of consensus on best practices. As a result, the translation of findings from clustering experiments to clinical practice is limited. Aim The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of clustering experiments using EHR. Methods We conducted a scoping review of clustering studies in EHR to identify common evaluation approaches. We systematically investigated the performance of the identified approaches using a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER) that tested whether clusterable structures exist in EHR. To develop this method we tested several cluster validation indexes and methods of generating null data to see which are the best at discovering clusters. In order to enable the robust benchmarking of evaluation approaches, we created a tool that generated synthetic EHR data that contain known cluster labels across a range of clustering scenarios. Results Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing cluster results across multiple algorithms (30% of studies). We examined this approach conducting a clustering experiment on AD patients using a population of 10,065 AD patients and 21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means 4 was found to have the best clustering solution with the highest silhouette score (0.19) and was more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD (n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of mental health issues, smoking and early disease onset (n=1528), which has been found in previous research as well as in the results of other clustering methods. We created a synthetic data generation tool which allows for the generation of realistic EHR clusters that can vary in separation and number of noise variables to alter the difficulty of the clustering problem. We found that decreasing cluster separation did increase cluster difficulty significantly whereas noise variables increased cluster difficulty but not significantly. To develop the tool to assess clusters existence we tested different methods of null dataset generation and cluster validation indices, the best performing null dataset method was the min max method and the best performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters were identified using the Calinski Harabasz index they were more likely to have significantly different outcomes between clusters. Lastly we repeated the initial clustering experiment, comparing 10 different pre-processing methods. The three best performing methods were RBF kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters; heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory loss (n = 1823), female with more problem (n=2244). Conclusion We have developed and tested a series of methods and tools to enable the evaluation of EHR clustering experiments. We developed and proposed a novel cluster evaluation metric and provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR

    Psoriasis and comorbid diseases: Epidemiology.

    Get PDF
    Psoriasis is a common chronic inflammatory disease of the skin that is increasingly being recognized as a systemic inflammatory disorder. Psoriatic arthritis is a well-known comorbidity of psoriasis. A rapidly expanding body of literature in various populations and settings supports additional associations between psoriasis and cardiometabolic diseases, gastrointestinal diseases, kidney disease, malignancy, infection, and mood disorders. The pathogenesis of comorbid disease in patients with psoriasis remains unknown; however, shared inflammatory pathways, cellular mediators, genetic susceptibility, and common risk factors are hypothesized to be contributing elements. As additional psoriasis comorbidities continue to emerge, education of health care providers is essential to ensuring comprehensive medical care for patients with psoriasis

    A study assessing the characteristics of big data environments that predict high research impact: application of qualitative and quantitative methods

    Full text link
    BACKGROUND: Big data offers new opportunities to enhance healthcare practice. While researchers have shown increasing interest to use them, little is known about what drives research impact. We explored predictors of research impact, across three major sources of healthcare big data derived from the government and the private sector. METHODS: This study was based on a mixed methods approach. Using quantitative analysis, we first clustered peer-reviewed original research that used data from government sources derived through the Veterans Health Administration (VHA), and private sources of data from IBM MarketScan and Optum, using social network analysis. We analyzed a battery of research impact measures as a function of the data sources. Other main predictors were topic clusters and authors’ social influence. Additionally, we conducted key informant interviews (KII) with a purposive sample of high impact researchers who have knowledge of the data. We then compiled findings of KIIs into two case studies to provide a rich understanding of drivers of research impact. RESULTS: Analysis of 1,907 peer-reviewed publications using VHA, IBM MarketScan and Optum found that the overall research enterprise was highly dynamic and growing over time. With less than 4 years of observation, research productivity, use of machine learning (ML), natural language processing (NLP), and the Journal Impact Factor showed substantial growth. Studies that used ML and NLP, however, showed limited visibility. After adjustments, VHA studies had generally higher impact (10% and 27% higher annualized Google citation rates) compared to MarketScan and Optum (p<0.001 for both). Analysis of co-authorship networks showed that no single social actor, either a community of scientists or institutions, was dominating. Other key opportunities to achieve high impact based on KIIs include methodological innovations, under-studied populations and predictive modeling based on rich clinical data. CONCLUSIONS: Big data for purposes of research analytics has grown within the three data sources studied between 2013 and 2016. Despite important challenges, the research community is reacting favorably to the opportunities offered both by big data and advanced analytic methods. Big data may be a logical and cost-efficient choice to emulate research initiatives where RCTs are not possible

    Persistent environmental pollutants and risk of cardiovascular disease

    Get PDF
    Persistent chemicals emitted in the environment can have a considerable impact on ecosystems and human health, now and in the future. One notorious group of persistent organic pollutants (POPs) is the per- and polyfluoroalkyl substances (PFAS). Since their production in 1940s for household and consumer products, they have accumulated in the environment and in humans via consumption of contaminated drinking water and food. They are hypothesized to induce metabolic disturbances, due to shared chemical similarities with fatty acids. Consequently, PFAS may have high societal and economic impact by increasing risk of obesity, type 2 diabetes (T2D) and cardiovascular disease (CVD). However, reports on these associations are scarce, and the underlying molecular pathways are still unclear. Therefore, in this PhD project, we aimed to i) investigate associations between PFAS and risk of several cardiometabolic diseases and ii) explore potential underlying pathways. In Paper I, we investigated cross-sectional associations between PFAS mixtures and body mass index (BMI) in European teenagers using meta-regression. Results showed a tendency for inverse associations between PFAS and BMI and indicated a potential for diverging contributions between PFAS compounds. In Paper II, using a nested casecontrol study on T2D including metabolomics data in Swedish adults, we found that PFAS correlated positively with glycerophospholipids and diacylglycerols. But whilst glycerophospholipids associated with lower T2D risk, diacylglycerols associated with higher T2D risk. This indicates a potential for diverging effects on disease risk. In Paper III, we investigated whether genetic polymorphisms in peroxisome proliferator-activated receptor gamma coactivator-1 alpha (PPARGC1A), which encodes a master regulator of pathways potentially disrupted by PFAS exposure, associated with secondary cardiovascular events in a large consortium study. However, we did not find clear evidence for such associations. In Paper IV, we assessed associations of PFAS with blood lipids and incident CVD using case-control studies nested in two Swedish adult cohorts. We observed overall null associations with stroke, but a tendency for inverse associations with myocardial infarction as well as associations with higher HDL-cholesterol and lower triglycerides, but also with higher LDL-cholesterol. In Paper V, we included OMICs data (metabolites, proteins and genes), which linked PFAS to lower myocardial infarction risk via lipid and inflammatory pathways. Likewise, a group of ‘old POPs’, the organochlorine compounds (OCs), were linked to higher myocardial infarction risk via the same pathways and to higher stroke risk via mitochondrial pathways. Thus, although we found no evidence for associations between PFAS and increased cardiometabolic disease risk, the overall findings indicate associations of PFAS with metabolic disturbances, particularly lipid metabolism. This is a potential adverse effect on human physiology and warrants further attention

    Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

    Get PDF
    The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

    Discovering hidden relationships between renal diseases and regulated genes through 3D network visualizations

    Get PDF
    Abstract Background In a recent study, two-dimensional (2D) network layouts were used to visualize and quantitatively analyze the relationship between chronic renal diseases and regulated genes. The results revealed complex relationships between disease type, gene specificity, and gene regulation type, which led to important insights about the underlying biological pathways. Here we describe an attempt to extend our understanding of these complex relationships by reanalyzing the data using three-dimensional (3D) network layouts, displayed through 2D and 3D viewing methods. Findings The 3D network layout (displayed through the 3D viewing method) revealed that genes implicated in many diseases (non-specific genes) tended to be predominantly down-regulated, whereas genes regulated in a few diseases (disease-specific genes) tended to be up-regulated. This new global relationship was quantitatively validated through comparison to 1000 random permutations of networks of the same size and distribution. Our new finding appeared to be the result of using specific features of the 3D viewing method to analyze the 3D renal network. Conclusions The global relationship between gene regulation and gene specificity is the first clue from human studies that there exist common mechanisms across several renal diseases, which suggest hypotheses for the underlying mechanisms. Furthermore, the study suggests hypotheses for why the 3D visualization helped to make salient a new regularity that was difficult to detect in 2D. Future research that tests these hypotheses should enable a more systematic understanding of when and how to use 3D network visualizations to reveal complex regularities in biological networks.http://deepblue.lib.umich.edu/bitstream/2027.42/112972/1/13104_2010_Article_700.pd

    Construction of predictive model of interstitial fibrosis and tubular atrophy after kidney transplantation with machine learning algorithms

    Get PDF
    Background: Interstitial fibrosis and tubular atrophy (IFTA) are the histopathological manifestations of chronic kidney disease (CKD) and one of the causes of long-term renal loss in transplanted kidneys. Necroptosis as a type of programmed death plays an important role in the development of IFTA, and in the late functional decline and even loss of grafts. In this study, 13 machine learning algorithms were used to construct IFTA diagnostic models based on necroptosis-related genes.Methods: We screened all 162 “kidney transplant”–related cohorts in the GEO database and obtained five data sets (training sets: GSE98320 and GSE76882, validation sets: GSE22459 and GSE53605, and survival set: GSE21374). The training set was constructed after removing batch effects of GSE98320 and GSE76882 by using the SVA package. The differentially expressed gene (DEG) analysis was used to identify necroptosis-related DEGs. A total of 13 machine learning algorithms—LASSO, Ridge, Enet, Stepglm, SVM, glmboost, LDA, plsRglm, random forest, GBM, XGBoost, Naive Bayes, and ANNs—were used to construct 114 IFTA diagnostic models, and the optimal models were screened by the AUC values. Post-transplantation patients were then grouped using consensus clustering, and the different subgroups were further explored using PCA, Kaplan–Meier (KM) survival analysis, functional enrichment analysis, CIBERSOFT, and single-sample Gene Set Enrichment Analysis.Results: A total of 55 necroptosis-related DEGs were identified by taking the intersection of the DEGs and necroptosis-related gene sets. Stepglm[both]+RF is the optimal model with an average AUC of 0.822. A total of four molecular subgroups of renal transplantation patients were obtained by clustering, and significant upregulation of fibrosis-related pathways and upregulation of immune response–related pathways were found in the C4 group, which had poor prognosis.Conclusion: Based on the combination of the 13 machine learning algorithms, we developed 114 IFTA classification models. Furthermore, we tested the top model using two independent data sets from GEO

    The concept of justifiable healthcare and how big data can help us to achieve it

    Get PDF
    Over the last decades, the face of health care has changed dramatically, with big improvements in what is technically feasible. However, there are indicators that the current approach to evaluating evidence in health care is not holistic and hence in the long run, health care will not be sustainable. New conceptual and normative frameworks for the evaluation of health care need to be developed and investigated. The current paper presents a novel framework of justifiable health care and explores how the use of artificial intelligence and big data can contribute to achieving the goals of this framework
    • 

    corecore