6,036 research outputs found

    Markov Network Structure Learning via Ensemble-of-Forests Models

    Full text link
    Real world systems typically feature a variety of different dependency types and topologies that complicate model selection for probabilistic graphical models. We introduce the ensemble-of-forests model, a generalization of the ensemble-of-trees model. Our model enables structure learning of Markov random fields (MRF) with multiple connected components and arbitrary potentials. We present two approximate inference techniques for this model and demonstrate their performance on synthetic data. Our results suggest that the ensemble-of-forests approach can accurately recover sparse, possibly disconnected MRF topologies, even in presence of non-Gaussian dependencies and/or low sample size. We applied the ensemble-of-forests model to learn the structure of perturbed signaling networks of immune cells and found that these frequently exhibit non-Gaussian dependencies with disconnected MRF topologies. In summary, we expect that the ensemble-of-forests model will enable MRF structure learning in other high dimensional real world settings that are governed by non-trivial dependencies.Comment: 13 pages, 6 figure

    High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

    Full text link
    We consider the problem of high-dimensional Gaussian graphical model selection. We identify a set of graphs for which an efficient estimation algorithm exists, and this algorithm is based on thresholding of empirical conditional covariances. Under a set of transparent conditions, we establish structural consistency (or sparsistency) for the proposed algorithm, when the number of samples n=omega(J_{min}^{-2} log p), where p is the number of variables and J_{min} is the minimum (absolute) edge potential of the graphical model. The sufficient conditions for sparsistency are based on the notion of walk-summability of the model and the presence of sparse local vertex separators in the underlying graph. We also derive novel non-asymptotic necessary conditions on the number of samples required for sparsistency

    Variable selection for BART: An application to gene regulation

    Get PDF
    We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Lower Bounds for Two-Sample Structural Change Detection in Ising and Gaussian Models

    Full text link
    The change detection problem is to determine if the Markov network structures of two Markov random fields differ from one another given two sets of samples drawn from the respective underlying distributions. We study the trade-off between the sample sizes and the reliability of change detection, measured as a minimax risk, for the important cases of the Ising models and the Gaussian Markov random fields restricted to the models which have network structures with pp nodes and degree at most dd, and obtain information-theoretic lower bounds for reliable change detection over these models. We show that for the Ising model, Ω(d2(logd)2logp)\Omega\left(\frac{d^2}{(\log d)^2}\log p\right) samples are required from each dataset to detect even the sparsest possible changes, and that for the Gaussian, Ω(γ2log(p))\Omega\left( \gamma^{-2} \log(p)\right) samples are required from each dataset to detect change, where γ\gamma is the smallest ratio of off-diagonal to diagonal terms in the precision matrices of the distributions. These bounds are compared to the corresponding results in structure learning, and closely match them under mild conditions on the model parameters. Thus, our change detection bounds inherit partial tightness from the structure learning schemes in previous literature, demonstrating that in certain parameter regimes, the naive structure learning based approach to change detection is minimax optimal up to constant factors.Comment: Presented at the 55th Annual Allerton Conference on Communication, Control, and Computing, Oct. 201

    Short-segment heart sound classification using an ensemble of deep convolutional neural networks

    Get PDF
    This paper proposes a framework based on deep convolutional neural networks (CNNs) for automatic heart sound classification using short-segments of individual heart beats. We design a 1D-CNN that directly learns features from raw heart-sound signals, and a 2D-CNN that takes inputs of two- dimensional time-frequency feature maps based on Mel-frequency cepstral coefficients (MFCC). We further develop a time-frequency CNN ensemble (TF-ECNN) combining the 1D-CNN and 2D-CNN based on score-level fusion of the class probabilities. On the large PhysioNet CinC challenge 2016 database, the proposed CNN models outperformed traditional classifiers based on support vector machine and hidden Markov models with various hand-crafted time- and frequency-domain features. Best classification scores with 89.22% accuracy and 89.94% sensitivity were achieved by the ECNN, and 91.55% specificity and 88.82% modified accuracy by the 2D-CNN alone on the test set.Comment: 8 pages, 1 figure, conferenc

    High-dimensional structure estimation in Ising models: Local separation criterion

    Get PDF
    We consider the problem of high-dimensional Ising (graphical) model selection. We propose a simple algorithm for structure estimation based on the thresholding of the empirical conditional variation distances. We introduce a novel criterion for tractable graph families, where this method is efficient, based on the presence of sparse local separators between node pairs in the underlying graph. For such graphs, the proposed algorithm has a sample complexity of n=Ω(Jmin2logp)n=\Omega(J_{\min}^{-2}\log p), where pp is the number of variables, and JminJ_{\min} is the minimum (absolute) edge potential in the model. We also establish nonasymptotic necessary and sufficient conditions for structure estimation.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1009 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    corecore