3 research outputs found

    Improving Type 2 Diabetes Phenotypic Classification by Combining Genetics and Conventional Risk Factors

    Get PDF
    Type 2 Diabetes condition is a multifactorial disorder involves the convergence of genetics, environment, diet and lifestyle risk factors. This paper investigates genetic and conventional (clinical, sociodemographic) risk factors and their predictive power in classifying Type 2 Diabetes. Six statistically significant Single Nucleotide Polymorphisms (SNPs) associated with Type 2 Diabetes are derived by conducting logistic association analysis. The derived SNPs in addition to conventional risk factors are used to model supervised machine learning algorithms to classify cases and controls in genome wide association studies (GWAS). Models are trained using genetic variable analysis, genetic and conventional variable analysis, and conventional variable analysis. The results demonstrate of the three models, higher predictive capacity is evident when genetic and conventional predictors are combined. Using a Random Forest classifier, the Area Under the Curve=73.96%, Sensitivity=68.42%, and Specificity=78.67%

    RaSaR: A Novel Methodology for the Detection of Epistasis

    Get PDF
    Complex diseases which affect a large proportion of our population today demand more strategic methods to produce significant association results. As it currently stands there are numerous disorders and diseases which are yet to be identified with a genetic causal variant despite evidence produced by research efforts which indicate the existence of high genetic concordance. Breast Cancer is one of the most prominent cancers in the female population with approximately 55K new cases each year in the UK and approximately 11K deaths. The genetic component of Breast Cancer is a popular research area and has uncovered many genetic associations from high to low penetrance. The dataset used within this research is obtained from the DRIVE project, one of five introduced under the GAME-ON initiative. The general research use DRIVE dataset contains approximately 533K single-nucleotide polymorphisms (SNPs), with more than 280K sequenced with reference to the 5 most prominent cancers; colon, breast, ovarian, prostate and lung. SNP’s are sequenced for approximately 28K subjects, of which approximately 14K were diagnosed with one of three stages of Breast Cancer; unknown, in-situ and invasive. Epistasis is a progressive approach that complements the ‘common disease, common variant’ hypothesis that highlights the potential for connected networks of genetic variants collaborating to produce a phenotypic expression. Epistasis is commonly performed as a pairwise or limitless-arity capacity that considers variant networks as either variant vs variant or as high order interactions. This type of analysis extends the number of tests that were previously performed in a standard approach such as GWAS, in which FDR was already an issue, therefore by multiplying the number of tests up to a factorial rate also increases the issue of FDR. Further to this, epistasis introduces its own limitations of computational complexity that are generated based on the analysis performed; to consider the most intense approach, a multivariate analysis introduces a time complexity of ( !) On . Throughout this thesis, approaches, methods and techniques for epistasis analysis and GWAS are discussed, as well as the limitations that exist and how to address these issues. Proposed in this thesis is a novel methodology, methodology and methods for the detection of epistasis using interpretable methods and best practice to outline interactions through filtering processes. RaSaR refers to process of Random Sampling Regularisation which randomly splits and produces sample sets to conduct a voting system to regularise the significance and reliability of biological markers, SNPs. Parallel to this, the proposed methodology takes into consideration and adjusts for the common limitations of computational complexity and false discovery using filter selection and a novel method to association analysis. Preliminary results are promising, outlining a concise detection of interactions using benchmarking standard approaches that consider the common approaches to multiple testing. Results for the detection of epistasis, in the classification of breast cancer patients, indicated nine outlined risk candidate interactions from five variants and a singular candidate variant with high protective association

    High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms

    Get PDF
    The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes
    corecore