7 research outputs found

    The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases

    Get PDF
    Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies. In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted. Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases

    Protocol for investigating genetic determinants of posttraumatic stress disorder in women from the Nurses' Health Study II

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One in nine American women will meet criteria for the diagnosis of posttraumatic stress disorder (PTSD) in their lifetime. Although twin studies suggest genetic influences account for substantial variance in PTSD risk, little progress has been made in identifying variants in specific genes that influence liability to this common, debilitating disorder.</p> <p>Methods and design</p> <p>We are using the unique resource of the Nurses Health Study II, a prospective epidemiologic cohort of 68,518 women, to conduct what promises to be the largest candidate gene association study of PTSD to date. The entire cohort will be screened for trauma exposure and PTSD; 3,000 women will be selected for PTSD diagnostic interviews based on the screening data. Our nested case-control study will genotype1000 women who developed PTSD following a history of trauma exposure; 1000 controls will be selected from women who experienced similar traumas but did not develop PTSD.</p> <p>The primary aim of this study is to detect genetic variants that predict the development of PTSD following trauma. We posit inherited vulnerability to PTSD is mediated by genetic variation in three specific neurobiological systems whose alterations are implicated in PTSD etiology: the hypothalamic-pituitary-adrenal axis, the locus coeruleus/noradrenergic system, and the limbic-frontal neuro-circuitry of fear. The secondary, exploratory aim of this study is to dissect genetic influences on PTSD in the broader genetic and environmental context for the candidate genes that show significant association with PTSD in detection analyses. This will involve: conducting conditional tests to identify the causal genetic variant among multiple correlated signals; testing whether the effect of PTSD genetic risk variants is moderated by age of first trauma, trauma type, and trauma severity; and exploring gene-gene interactions using a novel gene-based statistical approach.</p> <p>Discussion</p> <p>Identification of liability genes for PTSD would represent a major advance in understanding the pathophysiology of the disorder. Such understanding could advance the development of new pharmacological agents for PTSD treatment and prevention. Moreover, the addition of PTSD assessment data will make the NHSII cohort an unparalleled resource for future genetic studies of PTSD as well as provide the unique opportunity for the prospective examination of PTSD-disease associations.</p

    High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms

    Get PDF
    The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes

    Phenotypes and genetic markers of cancer cachexia

    Get PDF
    Cancer cachexia is a chronic wasting syndrome characterised by loss of weight, composed principally of muscle and fat. Patients with advanced cachexia demonstrate loss of appetite, early satiety, severe weight loss, weakness, anaemia and fluid retention. Affected individuals are also likely to report/experience decreased quality of life, decreased levels of physical performance, increased levels of fatigue, increased risks of treatment failure (be it chemotherapy, radiotherapy or surgery), increased risks of treatment side effects, and an increased mortality rate. Cachexia is therefore an extremely important, yet often underappreciated cause of cancer patient morbidity and mortality which requires urgent attention. Weight loss is significantly associated with cancer morbidity and mortality. It has been observed that half of all cancer patients experience weight loss and one-third lose more than 5% of their original body weight. Skeletal muscle loss appears to be the most significant event in cachexia and is associated with a poor outcome. However it is not known why some patients with the same tumour lose weight and muscle mass whilst others do not. The main aim of this thesis was to determine if the genetic makeup of individual patients might contribute to their propensity to lose weight or skeletal muscle. Previous studies had suggested an association between weight loss and SNPs on genes concerned with innate immunity and particularly the cell adhesion molecule Pselectin, however the strength of any gene association study depends on the precision with which it is possible to characterise the phenotype in question. A second aim of this thesis was to explore refining the clinical phenotyping of patients to discriminate those with evidence of muscle fibre atrophy versus those without. Phenotype The conventional phenotype for cachexia is weight loss (WL) but it is unknown the extent to which loss of body mass reflects loss of muscle or fat mass. Recent progress in cross sectional imaging analysis means that it is now possible to gain a direct measure of muscle mass from routine diagnostic CT scanning. However, in the absence of a longitudinal series of scans it is not possible to estimate whether low muscularity (LM) is longstanding or not. By combining a measure of active weight loss with low muscularity it was hoped that such a composite measure would reflect actual muscle loss / fibre atrophy. Compared with non-cachectic cancer patients, patients with LM or LM+>2%WL, mean muscle fibre diameter was reduced by about 25% (p = 0.02 and p = 0.001 respectively). No significant difference in muscle fibre diameter was observed if patients had WL alone. Regardless of classification, there was no difference in fibre number or proportion of fibre type across all myosin heavy chain isoforms. Mean muscle protein content was reduced and the ratio of RNA/DNA decreased in patients with either >5%WL or LM+>2%WL. These findings support the use of composite measures (WL and LM) to try and identify those patients with evidence of active muscle fibre atrophy. This novel clinical phenotyping provides an accurate method to enable the conduct of candidate gene studies in the investigation of the genetics of cancer cachexia where the primary focus is on muscle wasting rather than overall weight loss. Genotype In an ideal world it would be possible to explore the entire genome and look for associations with the different phenotypes of cachexia. However, to do so would require considerable resource in terms of the cost of genome wide analysis and the cost of phenotyping large enough cohorts of patients (3000-10000). To address these issues I therefore adopted a candidate gene approach. A total of 154 genes associated with cancer cachexia were identified and explored for associated polymorphisms. Of these 154 genes, 119 had a combined total of 281 polymorphisms with functional and/or clinical significance in terms of cachexia associated with them. Of these, 80 polymorphisms (in 51 genes) were replicated in more than one study with 24 polymorphisms found to influence two or more hallmarks of cachexia (i.e. inflammation, loss of fat mass and/or lean mass and reduced survival). Such election of candidate genes and polymorphisms is a key element of multigene study design. The systematic review provides a contemporary basis to select genes and/or polymorphisms for further association studies in cancer cachexia, and to develop their potential as susceptibility biomarkers of cachexia. Phenotype – genotype associations A total of 1276 patients were recruited, phenotyped and genotyped. There were 545 new patients and 731 patients from a previous study. In our new cohort and in keeping with the previous literature, patients who carried the C allele of the rs6136 SNP in the SELP gene, were at a reduced risk of developing cachexia defined by WL. This association applied to all degrees of weight loss (>5%, >10% or >15%), and not just at the >10% level as described previously in the literature. When examining newly identified SNPs in a stage 1 analysis for the weight loss phenotype that included 1276 cancer patients, twelve new candidate SNPs were significant. Six of these SNPs are associated with muscle metabolism in five genes (IGF1, CPN1, FOXO1, FOXO3, and ACVR2B), three are associated with adipose tissue metabolism in two genes (LEPR and TOMM40 (APOE on the reverse strand)), two with corticosteroid signalling in one gene (IFT172 (GCKR on the reverse strand)) and one with the immune response in one gene (TLR4). Two polymorphisms (rs1935949 and rs4946935) in the gene encoding for FOXO3 were consistently associated with WL of increasing severity (>5% and >10%). On the basis that WL is a continuum in the cachectic process, the observation that both SELP and FOXO3 associate with the higher degrees of WL suggests that these genetic signatures may be of particular significance. The role of P-selectin in the genesis of cachexia remains to be determined. When examining all SNPs in a stage 1 analysis for the LM phenotype, 5 SNPs were associated significantly with the cachexia phenotype: (i) rs4291 in the angiotensin converting enzyme (ACE) gene in chromosome 17; this gene has been associated with muscle function and metabolism; (ii) rs10636 in chromosome 16 in the metallothionein 2a gene; this gene has been shown to be involved in zinc dyshomeostasis which may contribute to cancer cachexia; (iii) rs1190584 in chromosome 14 in the WDR20 gene; this gene encodes a WD repeat-containing protein that functions to preserve and regulate the activity of the USP12-UAF1 deubiquitinating enzyme complex; (iv) rs3856806 in the peroxisome proliferator-activated receptor gamma (PPARG) gene in chromosome 3 which has been demonstrated to be involved in fatty acid and glucose metabolism; and (v) rs3745012 in chromosome 18 in the lipin 2 (LPIN2) gene; this gene represents a candidate gene for human lipodystrophy, characterised by loss of body fat, fatty liver, hypertriglyceridemia, and insulin resistance. When examining all SNPs in a stage 1 analysis for the LM +>2%WL phenotype 4 SNPs were associated significantly with the cachexia phenotype. rs12409877 in the leptin receptor (LEPR) located on chromosome 3, LEPR binds leptin and is involved in adipose tissue regulation. rs2268757 located in the activin receptor type-2B (ACVR2B) gene on chromosome 3, ACVR2B is a high affinity activin type 2 receptor which mediates signalling by a subset of TGF-β family ligands including myostatin, activin, GDF11 and others. SNPs in the tumour necrosis factor (TNF) (rs1799964) and ACE (rs4291) genes were also significantly associated with the phenotype. Whether genes demonstrating significant associations with the cachexia phenotypes had altered transcript expression in muscle from cancer patients with or without those phenotypes was also investigated. Expression of ACVR2B, FOXO1 and 3, LEPR, PPARG, TLR4, and TOMM40 transcripts was significantly associated with different levels of skeletal muscle index (SMI) or WL (P<0.05). Specifically, these were all negatively correlated with muscularity. FOXO1 and 3 and TOMM40 were the only genes significantly correlated with WL; these were correlated negatively with WL. Of the SNPs found to be significant across the range of phenotypes the majority are exons falling within coding sequences of genes or non-coding regions of genes. Some are introns in the intergenic regions between genes. SNPs may exert differing effects on genes leading to an aberrant gene product. Polymorphisms in promoter regions potentially contribute to differential gene expression, presumably affecting the binding of transcription factors to DNA. Sequence variation in the 5’ untranslated region (UTR) could disrupt mRNA translation; mutations in the 3’ UTR could affect mRNA through post-transcriptional mechanisms such as splicing, maturation, stability and export. Polymorphisms in intronic regions may result in cis- or trans regulation of genes, unmask cryptic splice sites or promoters leading to alternative transcripts. Synonymous and non-synonymous SNPs in exons could alter protein function or activity and may introduce codon bias contributing to the relative abundance of the proteins, respectively, finally non sense mutations cause a stop altogether in the translation of mRNA. The genomic distribution of SNPs is not homogenous, SNPs usually occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and fixating the allele of the SNP that constitutes the most favourable genetic adaptation. It has been estimated that 10% of all SNPs in the genome are functional, thereby having the potential of altering some biological process. Whether altering function directly or potentially indirectly all could possibly be used as biomarkers of predisposition to develop cancer cachexia. The studies presented in this thesis identify new diagnostic criteria that identify patients with evidence of muscle atrophy. They also confirm previous associations with patients who carry the C allele of the rs6136 SNP in the SELP gene are at a reduced risk of developing cachexia defined by WL and beg the question as to the role of this molecule in cachexia. Whilst achieving these outcomes this thesis also identifies a set of new SNPs that associate with the phenotype which is shown to correlate with actual muscle atrophy
    corecore