539 research outputs found

    Learning genetic epistasis using Bayesian network scoring criteria

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is <it>Multifactor Dimensionality Reduction </it>(MDR). Jiang et al. created a combinatorial epistasis learning method called <it>BNMBL </it>to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.</p> <p>Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.</p> <p>Results</p> <p>We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at <it>recall </it>using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.</p> <p>Conclusions</p> <p>We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.</p

    Mining Pure, Strict Epistatic Interactions from High-Dimensional Datasets: Ameliorating the Curse of Dimensionality

    Get PDF
    Background: The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. Methodology/Findings: A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. Conclusions/Significance: We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets. © 2012 Jiang, Neapolitan

    Discovering causal interactions using Bayesian network scoring and information gain

    Get PDF
    Background: The problem of learning causal influences from data has recently attracted much attention. Standard statistical methods can have difficulty learning discrete causes, which interacting to affect a target, because the assumptions in these methods often do not model discrete causal relationships well. An important task then is to learn such interactions from data. Motivated by the problem of learning epistatic interactions from datasets developed in genome-wide association studies (GWAS), researchers conceived new methods for learning discrete interactions. However, many of these methods do not differentiate a model representing a true interaction from a model representing non-interacting causes with strong individual affects. The recent algorithm MBS-IGain addresses this difficulty by using Bayesian network learning and information gain to discover interactions from high-dimensional datasets. However, MBS-IGain requires marginal effects to detect interactions containing more than two causes. If the dataset is not high-dimensional, we can avoid this shortcoming by doing an exhaustive search. Results: We develop Exhaustive-IGain, which is like MBS-IGain but does an exhaustive search. We compare the performance of Exhaustive-IGain to MBS-IGain using low-dimensional simulated datasets based on interactions with marginal effects and ones based on interactions without marginal effects. Their performance is similar on the datasets based on marginal effects. However, Exhaustive-IGain compellingly outperforms MBS-IGain on the datasets based on 3 and 4-cause interactions without marginal effects. We apply Exhaustive-IGain to investigate how clinical variables interact to affect breast cancer survival, and obtain results that agree with judgements of a breast cancer oncologist. Conclusions: We conclude that the combined use of information gain and Bayesian network scoring enables us to discover higher order interactions with no marginal effects if we perform an exhaustive search. We further conclude that Exhaustive-IGain can be effective when applied to real data

    A Bayesian method for evaluating and discovering disease loci associations

    Get PDF
    Background: A genome-wide association study (GWAS) typically involves examining representative SNPs in individuals from some population. A GWAS data set can concern a million SNPs and may soon concern billions. Researchers investigate the association of each SNP individually with a disease, and it is becoming increasingly commonplace to also analyze multi-SNP associations. Techniques for handling so many hypotheses include the Bonferroni correction and recently developed Bayesian methods. These methods can encounter problems. Most importantly, they are not applicable to a complex multi-locus hypothesis which has several competing hypotheses rather than only a null hypothesis. A method that computes the posterior probability of complex hypotheses is a pressing need. Methodology/Findings: We introduce the Bayesian network posterior probability (BNPP) method which addresses the difficulties. The method represents the relationship between a disease and SNPs using a directed acyclic graph (DAG) model, and computes the likelihood of such models using a Bayesian network scoring criterion. The posterior probability of a hypothesis is computed based on the likelihoods of all competing hypotheses. The BNPP can not only be used to evaluate a hypothesis that has previously been discovered or suspected, but also to discover new disease loci associations. The results of experiments using simulated and real data sets are presented. Our results concerning simulated data sets indicate that the BNPP exhibits both better evaluation and discovery performance than does a p-value based method. For the real data sets, previous findings in the literature are confirmed and additional findings are found. Conclusions/Significance: We conclude that the BNPP resolves a pressing problem by providing a way to compute the posterior probability of complex multi-locus hypotheses. A researcher can use the BNPP to determine the expected utility of investigating a hypothesis further. Furthermore, we conclude that the BNPP is a promising method for discovering disease loci associations. © 2011 Jiang et al

    Towards knowledge-based gene expression data mining

    Get PDF
    The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Bayesian neural networks for detecting epistasis in genetic association studies

    Get PDF
    Background: Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. Results: A non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. By using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships. Conclusions: The proposed framework is shown to be a powerful method for detecting causal SNPs while being computationally efficient enough to handle large datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0368-0) contains supplementary material, which is available to authorized users
    corecore