14 research outputs found

    Structural and parametric uncertainties in full Bayesian and graphical lasso based approaches: beyond edge weights in psychological networks

    Get PDF
    Uncertainty over model structures poses a challenge for many approaches exploring effect strength parameters at system-level. Monte Carlo methods for full Bayesian model averaging over model structures require considerable computational resources, whereas bootstrapped graphical lasso and its approximations offer scalable alternatives with lower complexity. Although the computational efficiency of graphical lasso based approaches has prompted growing number of applications, the restrictive assumptions of this approach are frequently ignored, such as its lack of coping with interactions. We demonstrate using an artificial and a real-world example that full Bayesian averaging using Bayesian networks provides detailed estimates through posterior distributions for structural and parametric uncertainties and it is a feasible alternative, which is routinely applicable in mid-sized biomedical problems with hundreds of variables. We compare Bayesian estimates with corresponding frequentist quantities from bootstrapped graphical lasso using pairwise Markov Random Fields, discussing also their interpretational differences. We present results using synthetic data from an artificial model and using the UK Biobank data set to explore a psychopathological network centered around depression (this research has been conducted using the UK Biobank Resource under Application Number 1602)

    Information Theoretically Optimal Sample Complexity of Learning Dynamical Directed Acyclic Graphs

    Full text link
    In this article, the optimal sample complexity of learning the underlying interaction/dependencies of a Linear Dynamical System (LDS) over a Directed Acyclic Graph (DAG) is studied. The sample complexity of learning a DAG's structure is well-studied for static systems, where the samples of nodal states are independent and identically distributed (i.i.d.). However, such a study is less explored for DAGs with dynamical systems, where the nodal states are temporally correlated. We call such a DAG underlying an LDS as \emph{dynamical} DAG (DDAG). In particular, we consider a DDAG where the nodal dynamics are driven by unobserved exogenous noise sources that are wide-sense stationary (WSS) in time but are mutually uncorrelated, and have the same {power spectral density (PSD)}. Inspired by the static settings, a metric and an algorithm based on the PSD matrix of the observed time series are proposed to reconstruct the DDAG. The equal noise PSD assumption can be relaxed such that identifiability conditions for DDAG reconstruction are not violated. For the LDS with WSS (sub) Gaussian exogenous noise sources, it is shown that the optimal sample complexity (or length of state trajectory) needed to learn the DDAG is n=Θ(qlog(p/q))n=\Theta(q\log(p/q)), where pp is the number of nodes and qq is the maximum number of parents per node. To prove the sample complexity upper bound, a concentration bound for the PSD estimation is derived, under two different sampling strategies. A matching min-max lower bound using generalized Fano's inequality also is provided, thus showing the order optimality of the proposed algorithm.Comment: 27 page

    A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data.</p> <p>Results</p> <p>The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (<it>l</it><sub>1</sub>-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the I<smcaps>N</smcaps>S<smcaps>ILICO</smcaps>1, I<smcaps>N</smcaps>S<smcaps>ILICO</smcaps>2 and I<smcaps>N</smcaps>S<smcaps>ILICO</smcaps>3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published <it>Saccharomyces cerevisae </it>cell cycle transcript profiling data sets capture known regulatory associations. In each <it>S. cerevisiae </it>LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification.</p> <p>Conclusion</p> <p>A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.</p

    Learning genetic epistasis using Bayesian network scoring criteria

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is <it>Multifactor Dimensionality Reduction </it>(MDR). Jiang et al. created a combinatorial epistasis learning method called <it>BNMBL </it>to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.</p> <p>Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.</p> <p>Results</p> <p>We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at <it>recall </it>using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.</p> <p>Conclusions</p> <p>We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.</p

    The accuracy of a Bayesian Network

    Get PDF
    A Bayesian network is a construct that represents a joint probability distribution, and can be used in order to model a given joint probability distribution. A principal characteristic of a Bayesian network is the degree to which it models the given joint probability distribution accurately; the accuracy of a Bayesian network. Although the accuracy of a Bayesian network can be well defined in theory, it is rarely possible to determine the accuracy of a Bayesian network in practice for real-world applications. Instead, alternative characteristics of a Bayesian network, which relate to and reflect the accuracy, are used to model the accuracy of a Bayesian network, and appropriate measures are devised. A popular formalism that adopts such methods to study the accuracy of a Bayesian network is the Minimum Description Length (MDL) formalism, which models the accuracy of a Bayesian network as the probability of the Bayesian network given the data set that describes the joint probability distribution the Bayesian network models. However, in the context of Bayesian Networks, the MDL formalism is flawed, exhibiting several shortcomings, and thus inappropriate for examining the accuracy of a Bayesian network. An alternative framework for Bayesian Networks is proposed, which models the accuracy of a Bayesian network as the accuracy of the conditional independencies implied by the structure of the Bayesian network, and specifies an appropriate measure called the Network Conditional Independencies Mutual Information (NCIMI) measure. The proposed framework is inspired by the principles governing the field of Bayesian Networks, and is based on formal theoretical foundations. Experiments have been conducted, using real-world problems, that evaluate both the MDL formalism and the proposed framework for Bayesian Networks. The experimental results support the theoretical claims, and confirm the significance of the proposed framework
    corecore