810 research outputs found

    Mining Pure, Strict Epistatic Interactions from High-Dimensional Datasets: Ameliorating the Curse of Dimensionality

    Get PDF
    Background: The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. Methodology/Findings: A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. Conclusions/Significance: We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets. © 2012 Jiang, Neapolitan

    The UKB envirome of depression

    Get PDF
    Major depressive disorder is a result of the complex interplay between a large number of environmental and genetic factors but the comprehensive analysis of contributing environmental factors is still an open challenge. The primary aim of this work was to create a Bayesian dependency map of environmental factors of depression, including life stress, social and lifestyle factors, using the UK Biobank data to determine direct dependencies and to characterize mediating or interacting effects of other mental health, metabolic or pain conditions. As a complementary approach, we also investigated the non-linear, synergistic multi-factorial risk of the UKB envirome on depression using deep neural network architectures. Our results showed that a surprisingly small number of core factors mediate the effects of the envirome on lifetime depression: neuroticism, current depressive symptoms, parental depression, body fat, while life stress and household income have weak direct effects. Current depressive symptom showed strong or moderate direct relationships with life stress, pain conditions, falls, age, insomnia, weight change, satisfaction, confiding in someone, exercise, sports and Townsend index. In conclusion, the majority of envirome exerts their effects in a dynamic network via transitive, interactive and synergistic relationships explaining why environmental effects may be obscured in studies which consider them individually

    Learning genetic epistasis using Bayesian network scoring criteria

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is <it>Multifactor Dimensionality Reduction </it>(MDR). Jiang et al. created a combinatorial epistasis learning method called <it>BNMBL </it>to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.</p> <p>Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.</p> <p>Results</p> <p>We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at <it>recall </it>using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.</p> <p>Conclusions</p> <p>We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.</p

    A hybrid algorithm for Bayesian network structure learning with application to multi-label learning

    Get PDF
    We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC's ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author

    Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

    Get PDF
    Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

    Bayesian network learning and applications in Bioinformatics

    Get PDF
    Abstract A Bayesian network (BN) is a compact graphic representation of the probabilistic re- lationships among a set of random variables. The advantages of the BN formalism include its rigorous mathematical basis, the characteristics of locality both in knowl- edge representation and during inference, and the innate way to deal with uncertainty. Over the past decades, BNs have gained increasing interests in many areas, including bioinformatics which studies the mathematical and computing approaches to under- stand biological processes. In this thesis, I develop new methods for BN structure learning with applications to bi- ological network reconstruction and assessment. The first application is to reconstruct the genetic regulatory network (GRN), where each gene is modeled as a node and an edge indicates a regulatory relationship between two genes. In this task, we are given time-series microarray gene expression measurements for tens of thousands of genes, which can be modeled as true gene expressions mixed with noise in data generation, variability of the underlying biological systems etc. We develop a novel BN structure learning algorithm for reconstructing GRNs. The second application is to develop a BN method for protein-protein interaction (PPI) assessment. PPIs are the foundation of most biological mechanisms, and the knowl- edge on PPI provides one of the most valuable resources from which annotations of genes and proteins can be discovered. Experimentally, recently-developed high- throughput technologies have been carried out to reveal protein interactions in many organisms. However, high-throughput interaction data often contain a large number of iv spurious interactions. In this thesis, I develop a novel in silico model for PPI assess- ment. Our model is based on a BN that integrates heterogeneous data sources from different organisms. The main contributions are: 1. A new concept to depict the dynamic dependence relationships among random variables, which widely exist in biological processes, such as the relationships among genes and genes' products in regulatory networks and signaling pathways. This con- cept leads to a novel algorithm for dynamic Bayesian network learning. We apply it to time-series microarray gene expression data, and discover some missing links in a well-known regulatory pathway. Those new causal relationships between genes have been found supportive evidences in literature. 2. Discovery and theoretical proof of an asymptotic property of K2 algorithm ( a well-known efficient BN structure learning approach). This property has been used to identify Markov blankets (MB) in a Bayesian network, and further recover the BN structure. This hybrid algorithm is evaluated on a benchmark regulatory pathway, and obtains better results than some state-of-art Bayesian learning approaches. 3. A Bayesian network based integrative method which incorporates heterogeneous data sources from different organisms to predict protein-protein interactions (PPI) in a target organism. The framework is employed in human PPI prediction and in as- sessment of high-throughput PPI data. Furthermore, our experiments reveal some interesting biological results. 4. We introduce the learning of a TAN (Tree Augmented Naïve Bayes) based net- work, which has the computational simplicity and robustness to high-throughput PPI assessment. The empirical results show that our method outperforms naïve Bayes and a manual constructed Bayesian Network, additionally demonstrate sufficient informa- tion from model organisms can achieve high accuracy in PPI prediction

    Learning condition-specific networks

    Get PDF
    Condition-specific cellular networks are networks of genes and proteins that describe functional interactions among genes occurring under different environmental conditions. These networks provide a systems-level view of how the parts-list (genes and proteins) interact within the cell as it functions under changing environmental conditions and can provide insight into mechanisms of stress response, cellular differentiation and disease susceptibility. The principle challenge, however, is that cellular networks remain unknown for most conditions and must be inferred from activity levels of genes (mRNA levels) under different conditions. This dissertation aims to develop computational approaches for inferring, analyzing and validating cellular networks of genes from expression data. This dissertation first describes an unsupervised machine learning framework for inferring cellular networks using expression data from a single condition. Here cellular networks are represented as undirected probabilistic graphical models and are learned using a novel, data-driven algorithm. Then several approaches are described that can learn networks using data from multiple conditions. These approaches apply to cases where the condition may or may not be known and, therefore, must be inferred as part of the learning problem. For the latter, the condition variable is allowed to influence expression of genes at different levels of granularity: condition variable per gene to a single condition variable for all genes. Results on simulated data suggest that the algorithm performance depends greatly on the size and number of connected components of the union network of all conditions. These algorithms are also applied to microarray data from two yeast populations, quiescent and non-quiescent, isolated from glucose starved cultures. Our results suggest that by sharing information across multiple conditions, better networks can be learned for both conditions, with many more biologically meaningful dependencies, than if networks were learned for these conditions independently. In particular, processes that were shared among both cell populations were involved in response to glucose starvation, whereas the processes specific to individual populations captured characteristics unique to each population. These algorithms were also applied for learning networks across multiple species: yeast (S. cerevisiae) and fly (D. melanogaster). Preliminary analysis suggests that sharing patterns across species is much more complex than across different populations of the same species and basic metabolic processes are shared across the two species. Finally, this dissertation focuses on validation of cellular networks. This validation framework describes scores for measuring how well network learning algorithms capture higher-order dependencies. This framework also introduces a measure for evaluating the entire inferred network structure based on the extent to which similarly functioning genes are close together on the network
    corecore