31 research outputs found

    Phenotyping with Partially Labeled, Partially Observed Data

    Get PDF
    Identifying a group of individuals that share a common set of characteristics is a conceptually simple task, which is often difficult in practice. Such phenotyping problems emerge in various settings, including the analysis of clinical data. In this setting, phenotyping is often stymied by persistent data quality issues. These include a lack of reliable labels to indicate the presence of absence of characteristics of interest, and significant missingness in observed variables. This dissertation introduces methods for learning phenotypes when the data contain missing values (partially observed) and labels are scarce (partially labeled). Aim 1 utilizes an unsupervised probabilistic graphical model to learn phenotypes from partially observed data. Aim 2 introduces a related semi-supervised probabilistic graphical model for learning phenotypes from partially labeled clinical data. Finally, Aim 3 describes a method for training deep generative models when the training data contain missing values. The algorithm is then applied in a semi-supervised setting where it accounts for partially labeled data as well

    Data-driven approaches for predicting asthma attacks in adults in primary care

    Get PDF
    Background Asthma attacks cause approximately 270 hospitalisations and four deaths per day in the United Kingdom (UK). Previous attempts to construct data-driven risk prediction models of asthma attacks have lacked clinical utility: either producing inaccurate predictions or requiring patient data which are not cost-effective to collect on a large scale (such as electronic monitoring device data). Electronic Health Record (EHR) use throughout the UK enables researchers to harness comprehensive and panoramic patient data, but their cleaning and pre-processing requires sophisticated empirical experimentation and data analytics approaches. My objectives were to appraise the previously utilised methods in asthma attack risk prediction modelling for feature extraction, model development, and model selection, and to train and test a model in Scottish EHRs. Methods In this thesis, I used a Scottish longitudinal primary care EHR dataset with linked secondary care records, to investigate the optimisation of an asthma attack risk prediction model. To inform the model, I refined methods for estimation of asthma medication adherence from EHRs, compared model training data enrichment procedures, and evaluated measures for validating model performance. After conducting a critical appraisal of the methods employed in the literature, I trained and tested four statistical learning algorithms for prediction in the next four weeks, i.e. logistic regression, naïve Bayes classification, random forests, and extreme gradient boosting, and validated model performance in an unseen hold-out dataset. Training data enrichment methods were compared across all algorithms to establish whether the sensitivity of estimating relatively uncommon event incidence, such as asthma attacks in the general asthma population, could be improved. Secondary event horizons were also examined, such as prediction in the next six months. Empirical experimentation established the balanced accuracy to be the most appropriate prediction model performance measure, and the calibration between estimated and observed risk was additionally assessed using the Area Under the Receiver-Operator Curve (AUC). Results Data were available for over 670,000 individuals, followed for up to 17 years (177,306 person-years in total). Binary prediction of asthma attacks in the following four-week period resulted in 1,203,476 data samples, of which 1% contained one or more attacks (12,193 total attacks). In the preliminary model selection phase, the random forest algorithm provided the best balance between accuracy in those with asthma attacks (sensitivity) and in those predicted to have attacks (positive predictive value) in the following four weeks. In an unseen data partition, the final random forest model, with optimised hyper-parameters, achieved an AUC of 0.91, and a balanced accuracy of 73.6% after the application of an optimised decision threshold. Accurate predictions were made for a median of 99.6% of those who did not go on to have attacks (specificity). As expected with rare event predictions, the sensitivity was lower at 47.7%, but this was well balanced with the positive predictive value of 48.9%. Furthermore, several of the secondary models, including predicting asthma attacks in the following 12 weeks, achieved state-of-the-art performance and still had high potential clinical utility. Conclusions I successfully developed an EHR-based model for predicting asthma attacks in the next four weeks. Accurately predicting asthma attacks occurrence may facilitate closer monitoring to ensure that preventative therapy is adequately managing symptoms, reinforce the need to keep abreast of triggers, and allow rescue treatments to be administered quickly when necessary

    Investigating the natural history of liver disease in type 2 diabetes and predicting the risk of its progression to advanced disease

    Get PDF
    INTRODUCTION: In people with Type 2 diabetes; chronic liver disease, particularly non-alcoholic fatty liver disease (NAFLD), is more common and has an increased risk of progression to cirrhosis and hepatocellular carcinoma. European guidelines (European Association for the Study of the Liver, European Association for the Study of Diabetes and European Association for the Study of Obesity) recommend screening for NAFLD in Type 2 diabetes yet both the natural history of liver disease in Type 2 diabetes and the factors associated with higher risk of progression to clinically significant disease are still incompletely understood. Further, it is thought that the recommended generic NAFLD risk prediction tools may perform sub-optimally in people with Type 2 diabetes.AIMS: This study aimed to use a community cohort of over one thousand older people with Type 2 diabetes followed for 11 years to:•Define the absolute and relative cohort incidence of liver disease to date.•Determine whether current non-invasive fibrosis risk prediction tools reliably identified incident cirrhosis and hepatocellular carcinoma in people with Type 2 diabetes.•Determine whether the addition of baseline biomarkers to existing fibrosis risk prediction tools improved their ability to predict incident cirrhosis and hepatocellular carcinoma.•Identify whether potential non-invasive tests for non-alcoholic fatty liver disease (those identifying steatosis, serum liver enzymes, markers of fibrosis) are associated with incident cirrhosis, hepatocellular carcinoma or all-cause mortality.METHODS: The Edinburgh Type 2 Diabetes Study recruited men and women with Type 2 diabetes (n=1,066, aged 60–75 at baseline) in 2006. Liver markers were measured at baseline and year 1; steatosis and fibrosis markers were calculated according to independently published formulae. During follow-up, cases of cirrhosis and HCC were identified. Logistic regression (odds ratio) was used to determine associations between markers and outcomes, with competing risks regression used for sensitivity analyses. The predictive ability of tests was assessed using sensitivity, specificity, positive predictive value, negative predictive value, false positive and false negative rates.RESULTS: Over 11 years 43/1059 participants with no baseline cirrhosis or hepatocellular carcinoma developed incident liver disease. The 11-year incidence of liver cirrhosis was 3.92 per 1000 person years and of hepatocellular carcinoma 1.28 per 1000 person years (whole population rates). 58% of those with cirrhosis had clinical complications of varices, ascites or hepatic encephalopathy.Existing non-invasive NAFLD fibrosis risk-stratification tools (AST:ALT ratio, AST: platelet ratio index (APRI), Enhanced Liver Fibrosis panel (ELF), Fibrosis 4 index (FIB-4), NAFLD Fibrosis Score (NFS)) were significantly associated (Odds Ratios, p20% or >35% respectively for the identification of cirrhosis or HCC. A raised Fatty Liver Index was statistically associated with mortality (hazard ratio 1.45 (1.13-1.87)) but all tests showed high false positive and false negative rates (>20% or >75% respectively) for mortality.CONCLUSIONS:The increased incidence of cirrhosis and hepatocellular carcinoma in people with Type 2 diabetes were confirmed, with NAFLD the predominant aetiology. Markers of fibrosis were associated with incident cirrhosis and hepatocellular carcinoma but no non-invasive risk prediction tools reliably identified participants at increased risk of incident disease. The addition of hyaluronic acid to FIB-4 showed promise by reducing the proportion of people inappropriately identified as ‘high-risk’ but no combination of tests examined, provided a ‘good balance’ between false positive and negative rates in the identification of risk for cirrhosis, HCC or mortality. These results need to be validated in independent cohorts but suggest that the evidence does not exist for formal liver disease screening in people with Type 2 diabetes and presently the only option for non-invasive liver disease surveillance is to use tests with a relatively low false positive rate in order to identify a proportion of those likely to develop incident cirrhosis and HCC

    Genetic studies of cardiometabolic traits

    Get PDF
    Diet and lifestyle have changed dramatically in the last few decades, leading to an increase in prevalence of obesity, defined as a body mass index >30Kg/m2, dyslipidaemias (defined as abnormal lipid profiles) and type 2 diabetes (T2D). Together, these cardiometabolic traits and diseases, have contributed to the increased burden of cardiovascular disease, the leading cause of death in Western societies. Complex traits and diseases, such as cardiometabolic traits, arise as a result of the interaction between an individual’s predisposing genetic makeup and a permissive environment. Since 2007, genome-wide association studies (GWAS) have been successfully applied to complex traits leading to the discovery of thousands of trait-associated variants. Nonetheless, much is still to be understood regarding the genetic architecture of these traits, as well as their underlying biology. This thesis aims to further explore the genetic architecture of cardiometabolic traits by using complementary approaches with greater genetic and phenotype resolution, ranging from studying clinically ascertained extreme phenotypes, deep molecular profiling, or sequence level data. In chapter 2, I investigated the genetic architecture of healthy human thinness (N=1,471) and contrasted it to that of severe early onset childhood obesity (N=1,456). I demonstrated that healthy human thinness, like severe obesity, is a heritable trait, with a polygenic component. I identified a novel BMI-associated locus at PKHD1, and found evidence of association at several loci that had only been discovered using large cohorts with >40,000 individuals demonstrating the power gains in studying clinical extreme phenotypes. In chapter 3, I coupled high-resolution nuclear magnetic resonance (NMR) measurements in healthy blood donors, with next-generation sequencing to establish the role of rare coding variation in circulating metabolic biomarker biology. In gene-based analysis, I identified ACSL1, MYCN, FBXO36 and B4GALNT3 as novel gene-trait associations (P<2.5x10-6). I also found a novel link between loss-of-function mutations in the “regulation of the pyruvate dehydrogenase (PDH) complex” pathway and intermediate-density lipoprotein (IDL), low-density lipoprotein (LDL) and circulating cholesterol measurements. In addition, I demonstrated that rare “protective” variation in lipoprotein metabolism genes was present in the lower tails of four measurements which are CVD risk factors in this healthy population, demonstrating a role for rare coding variation and the extremes of healthy phenotypes. In chapter 4, I performed a genome-wide association study of fructosamine, a measurement of total serum protein glycation which is useful to monitor rapid changes in glycaemic levels after treatment, as it reflects average glycaemia over 2-3 weeks. In contrast to HbA1c, which reflects average glucose concentration over the life-span of the erythrocyte (~3 months), fructosamine levels are not predicted to be influenced by factors affecting the erythrocyte. Surprisingly, I found that in this dataset fructosamine had low heritability (2% vs 20% for HbA1c), and was poorly correlated with HbA1c and other glycaemic traits. Despite this, I found two loci previously associated with glycaemic or albumin traits, G6PC2 and FCGRT respectively (P<5x10-8), associated with fructosamine suggesting shared genetic influence.. Altogether my results demonstrate the utility of higher resolution genotype and phenotype data in further elucidating the genetic architecture of a range of cardiometabolic traits, and the power advantages of study designs that focus on individuals at the extremes of phenotype distribution. As large cohorts and national biobanks with sequencing and deep multi-dimensional phenotyping become more prevalent, we will be moving closer to understanding the multiple aetiological mechanisms leading to CVD, and subsequently improve diagnosis and treatment of these conditions.Wellcome Sanger Institute CONACy

    Liver Transplantation

    Get PDF
    This book covers a wide spectrum of topics including, but not limited to, the technical issues in living and deceased donor liver transplant procedures, cell and experimental liver transplantation, and the complications of liver transplantation. Some of the very important topics, such as the arterial reconstruction in living donor liver transplantation, biliary complications, and the post-transplant-lymphoprolifrative disorders (PTLD), have been covered in more than one chapter
    corecore