291 research outputs found

    Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multifactor Dimensionality Reduction (MDR) has been introduced previously as a non-parametric statistical method for detecting gene-gene interactions. MDR performs a dimensional reduction by assigning multi-locus genotypes to either high- or low-risk groups and measuring the percentage of cases and controls incorrectly labelled by this classification – the classification error. The combination of variables that produces the lowest classification error is selected as the best or most fit model. The correctly and incorrectly labelled cases and controls can be expressed as a two-way contingency table. We sought to improve the ability of MDR to detect gene-gene interactions by replacing classification error with a different measure to score model quality.</p> <p>Results</p> <p>In this study, we compare the detection and power of MDR using a variety of measures for two-way contingency table analysis. We simulated 40 genetic models, varying the number of disease loci in the model (2 – 5), allele frequencies of the disease loci (.2/.8 or .4/.6) and the broad-sense heritability of the model (.05 – .3). Overall, detection using NMI was 65.36% across all models, and specific detection was 59.4% versus detection using classification error at 62% and specific detection was 52.2%.</p> <p>Conclusion</p> <p>Of the 10 measures evaluated, the likelihood ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model. Thus, MDR, which has already been demonstrated as a powerful tool for detecting gene-gene interactions, can be improved with the use of alternative fitness functions.</p

    An R package implementation of multifactor dimensionality reduction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A breadth of high-dimensional data is now available with unprecedented numbers of genetic markers and data-mining approaches to variable selection are increasingly being utilized to uncover associations, including potential gene-gene and gene-environment interactions. One of the most commonly used data-mining methods for case-control data is Multifactor Dimensionality Reduction (MDR), which has displayed success in both simulations and real data applications. Additional software applications in alternative programming languages can improve the availability and usefulness of the method for a broader range of users.</p> <p>Results</p> <p>We introduce a package for the R statistical language to implement the Multifactor Dimensionality Reduction (MDR) method for nonparametric variable selection of interactions. This package is designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research. The 'MDR' package is freely available online at <url>http://www.r-project.org/</url>. We also provide data examples to illustrate the use and functionality of the package.</p> <p>Conclusions</p> <p>MDR is a frequently-used data-mining method to identify potential gene-gene interactions, and alternative implementations will further increase this usage. We introduce a flexible software package for R users.</p

    Risk score modeling of multiple gene to gene interactions using aggregated-multifactor dimensionality reduction

    Get PDF
    BACKGROUND: Multifactor Dimensionality Reduction (MDR) has been widely applied to detect gene-gene (GxG) interactions associated with complex diseases. Existing MDR methods summarize disease risk by a dichotomous predisposing model (high-risk/low-risk) from one optimal GxG interaction, which does not take the accumulated effects from multiple GxG interactions into account. RESULTS: We propose an Aggregated-Multifactor Dimensionality Reduction (A-MDR) method that exhaustively searches for and detects significant GxG interactions to generate an epistasis enriched gene network. An aggregated epistasis enriched risk score, which takes into account multiple GxG interactions simultaneously, replaces the dichotomous predisposing risk variable and provides higher resolution in the quantification of disease susceptibility. We evaluate this new A-MDR approach in a broad range of simulations. Also, we present the results of an application of the A-MDR method to a data set derived from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX) that revealed several GxG interactions in the folate pathway that were associated with treatment response. The epistasis enriched risk score that pooled information from 82 significant GxG interactions distinguished MTX responders from non-responders with 82% accuracy. CONCLUSIONS: The proposed A-MDR is innovative in the MDR framework to investigate aggregated effects among GxG interactions. New measures (pOR, pRR and pChi) are proposed to detect multiple GxG interactions

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    A General Framework for Formal Tests of Interaction after Exhaustive Search Methods with Applications to MDR and MDR-PDT

    Get PDF
    The initial presentation of multifactor dimensionality reduction (MDR) featured cross-validation to mitigate over-fitting, computationally efficient searches of the epistatic model space, and variable construction with constructive induction to alleviate the curse of dimensionality. However, the method was unable to differentiate association signals arising from true interactions from those due to independent main effects at individual loci. This issue leads to problems in inference and interpretability for the results from MDR and the family-based compliment the MDR-pedigree disequilibrium test (PDT). A suggestion from previous work was to fit regression models post hoc to specifically evaluate the null hypothesis of no interaction for MDR or MDR-PDT models. We demonstrate with simulation that fitting a regression model on the same data as that analyzed by MDR or MDR-PDT is not a valid test of interaction. This is likely to be true for any other procedure that searches for models, and then performs an uncorrected test for interaction. We also show with simulation that when strong main effects are present and the null hypothesis of no interaction is true, that MDR and MDR-PDT reject at far greater than the nominal rate. We also provide a valid regression-based permutation test procedure that specifically tests the null hypothesis of no interaction, and does not reject the null when only main effects are present. The regression-based permutation test implemented here conducts a valid test of interaction after a search for multilocus models, and can be applied to any method that conducts a search to find a multilocus model representing an interaction

    고차원 유전체 자료에서의 유전자-유전자 상호작용 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2015. 2. 박태성.With the development of high-throughput genotyping and sequencing technology, there are growing evidences of association with genetic variants and common complex traits. In spite of thousands of genetic variants discovered, such genetic markers have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. Gene-gene interaction (GGI) analysis and rare variant analysis is expected to unveil a large portion of unexplained heritability of complex traits. In GGI, there are several practical issues. First, in order to conduct GGI analysis with high-dimensional genomic data, GGI methods requires the efficient computation and high accuracy. Second, it is hard to detect GGI for rare variants due to its sparsity. Third, analysing GGI using genome-wide scale suffers from a computational burden as exploring a huge search space. It requires much greater number of tests to find optimal GGI. For k variants, we have k(k-1)/2 combinations for two-order interactions, and nCk combinations for n-order interactions. The number of possible interaction models increase exponentially as the interaction order increases or the number of variant increases. Forth, though the biological interpretation of GGI is important, it is hard to interpret GGI due to its complex manner. In order to overcome these four main issues in GGI analysis with high-dimensional genomic data, the four novel methods are proposed. First, to provide efficient GGI method, we propose IGENT, Information theory-based GEnome-wide gene-gene iNTeraction method. IGENT is an efficient algorithm for identifying genome-wide GGI and gene-environment interaction (GEI). For detecting significant GGIs in genome-wide scale, it is important to reduce computational burden significantly. IGENT uses information gain (IG) and evaluates its significance without resampling. Through our simulation studies, the power of the IGENT is shown to be better than or equivalent to that of that of BOOST. The proposed method successfully detected GGI for bipolar disorder in the Wellcome Trust Case Control Consortium (WTCCC) and age-related macular degeneration (AMD). Second, for GGI analysis of rare variants, we propose a new gene-gene interaction method in the framework of the multifactor dimensionality reduction (MDR) analysis. The proposed method consists of two steps. The first step is to collapse the rare variants in a specific region such as gene. The second step is to perform MDR analysis for the collapsed rare variants. The proposed method is applied in whole exome sequencing data of Korean population to identify causal gene-gene interaction for rare variants for type 2 diabetes (T2D). Third, to increase computational performance for GGI in genome-wide scale, we developed CUDA (Compute Unified Device Architecture) based genome-wide association MDR (cuGWAM) software using efficient hardware accelerators. cuGWAM has better performance than CPU-based MDR methods and other GPU-based methods through our simulation studies. Fourth, to efficiently provide the statistical interpretation and biological evidences of gene-gene interactions, we developed the VisEpis, a tool for visualizing of gene-gene interactions in genetic association analysis and mapping of epistatic interaction to the biological evidence from public interaction databases. Using interaction network and circular plot, the VisEpis provides to explore the interaction network integrated with biological evidences in epigenetic regulation, splicing, transcription, translation and post-translation level. To aid statistical interaction in genotype level, the VisEpis provides checkerboard, pairwise checkerboard, forest, funnel and ring chart.Abstract i Contents iv List of Figures viii List of Tables xi 1 Introduction 1 1.1 Background of high-dimensional genomic data 1 1.1.1 History of genome-wide association studies (GWAS) 1 1.1.2 Association studies of massively parallel sequencing (MPS) 3 1.1.3 Missing heritability and proposed alternative methods 6 1.2 Purpose and novelty of this study 7 1.3 Outline of the thesis 8 2 Overview of gene-gene interaction 9 2.1 Definition of gene-gene interaction 9 2.2 Practical issues of gene-gene interaction 12 2.3 Overview of gene-gene interaction methods 14 2.3.1 Regression-based gene-gene interaction methods 14 2.3.2 Multifactor dimensionality reduction (MDR) 15 2.3.3 Gene-gene interaction methods using machine learning methods 18 2.3.3 Entropy-based method gene-gene interaction methods 20 3 Entropy-based Gene-gene interaction 22 3.1 Introduction 22 3.2 Methods 23 3.2.1 Entropy-based gene-gene interaction analysis 23 3.2.2 Exhaustive searching approach and Stepwise selection approach 24 3.2.3 Simulation setting 27 3.2.4 Genome-wide data for Biopolar disorder (BD) 31 3.2.5 Genome-wide data for Age-related macular degeneration (AMD) 31 3.3 Results 33 3.3.1 Simulation results 33 3.3.2 Analysis of WTCCC bipolar disorder (BD) data 43 3.3.3 Analysis of age-related macular degeneration (AMD) data 44 3.4 Discussion 47 3.5 Conclusion 47 4 Gene-gene interaction for rare variants 48 4.1 Introduction 48 4.2 Methods 50 4.2.1 Collapsing-based gene-gene interaction 50 4.2.2 Simulation setting 50 4.3 Results 55 4.3.1 Simulation study 55 4.3.2 Real data analysis of the Type 2 diabetes data 55 4.4 Discussion and Conclusion 68 5 Computation enhancement for gene-gene interaction 5.1 Introduction 69 5.2 Methods 71 5.2.1 MDR implementation 71 5.2.2 Implementation using high-performance computation based on GPU 72 5.2.3 Environment of performance comparison 75 5.3 Results 76 5.3.1 Computational improvement 76 5.4 Discussion 84 5.5 Conclusion 87 6 Visualization for gene-gene interaction interpretation 88 6.1 Introduction 88 6.2 Methods 91 6.2.1 Interaction mapping procedure 91 6.2.1 Checker board plot 91 6.2.2 Forest and funnel plot 94 6.3 Case study 100 6.3.1 Interpretation of gene-gene interaction in WTCC bipolar disorder data 100 6.3.2 Interpretation of gene-gene interaction in Age-related macular degeneration (AMD) data 101 6.4 Conclusion 102 7 Summary and Conclusion 103 Bibliography 107 Abstract (Korean) 113Docto

    In search of complex disease risk through genome wide association studies

    Get PDF
    The identification and characterisation of genomic changes (variants) that can lead to human diseases is one of the central aims of biomedical research. The generation of catalogues of genetic variants that have an impact on specific diseases is the basis of Personalised Medicine, where diagnoses and treatment protocols are selected according to each patient&rsquo;s profile. In this context, the study of complex diseases, such as Type 2 diabetes or cardiovascular alterations, is fundamental. However, these diseases result from the combination of multiple genetic and environmental factors, which makes the discovery of causal variants particularly challenging at a statistical and computational level. Genome-Wide Association Studies (GWAS), which are based on the statistical analysis of genetic variant frequencies across non-diseased and diseased individuals, have been successful in finding genetic variants that are associated to specific diseases or phenotypic traits. But GWAS methodology is limited when considering important genetic aspects of the disease and has not yet resulted in meaningful translation to clinical practice. This review presents an outlook on the study of the link between genetics and complex phenotypes. We first present an overview of the past and current statistical methods used in the field. Next, we discuss current practices and their main limitations. Finally, we describe the open challenges that remain and that might benefit greatly from further mathematical developments.L.A. was supported by grant BES-2017-081635. This publication is part of R&D and Innovation grant BES-2017-081635 funded by MCIN and by “FSE Investing in your future”I.M. was supported by grant FJCI-2017-31878. This publication is part of R&D and Innovation grant FJCI-2017-31878 funded by MCIN. C.S. received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement H2020-MSCA-COFUND-2016-754433.Peer ReviewedPostprint (published version

    Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multifactorial diseases such as cancer and cardiovascular diseases are caused by the complex interplay between genes and environment. The detection of these interactions remains challenging due to computational limitations. Information theoretic approaches use computationally efficient directed search strategies and thus provide a feasible solution to this problem. However, the power of information theoretic methods for interaction analysis has not been systematically evaluated. In this work, we compare power and Type I error of an information-theoretic approach to existing interaction analysis methods.</p> <p>Methods</p> <p>The <it>k-</it>way interaction information (KWII) metric for identifying variable combinations involved in gene-gene interactions (GGI) was assessed using several simulated data sets under models of genetic heterogeneity driven by susceptibility increasing loci with varying allele frequency, penetrance values and heritability. The power and proportion of false positives of the KWII was compared to multifactor dimensionality reduction (MDR), restricted partitioning method (RPM) and logistic regression.</p> <p>Results</p> <p>The power of the KWII was considerably greater than MDR on all six simulation models examined. For a given disease prevalence at high values of heritability, the power of both RPM and KWII was greater than 95%. For models with low heritability and/or genetic heterogeneity, the power of the KWII was consistently greater than RPM; the improvements in power for the KWII over RPM ranged from 4.7% to 14.2% at for α = 0.001 in the three models at the lowest heritability values examined. KWII performed similar to logistic regression.</p> <p>Conclusions</p> <p>Information theoretic models are flexible and have excellent power to detect GGI under a variety of conditions that characterize complex diseases.</p

    Bioinformatics challenges for genome-wide association studies

    Get PDF
    Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

    Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.</p> <p>Results</p> <p>The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global <it>p</it>-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in <it>PGM1 </it>(phosphoglucomutase 1), two intronic SNPs in <it>LMX1A </it>(LIM homeobox transcription factor 1, alpha), two intronic SNPs in <it>PARK2 </it>(Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in <it>GYS2 </it>(glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.</p> <p>Conclusion</p> <p>An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.</p
    corecore