6,036 research outputs found
Markov Network Structure Learning via Ensemble-of-Forests Models
Real world systems typically feature a variety of different dependency types
and topologies that complicate model selection for probabilistic graphical
models. We introduce the ensemble-of-forests model, a generalization of the
ensemble-of-trees model. Our model enables structure learning of Markov random
fields (MRF) with multiple connected components and arbitrary potentials. We
present two approximate inference techniques for this model and demonstrate
their performance on synthetic data. Our results suggest that the
ensemble-of-forests approach can accurately recover sparse, possibly
disconnected MRF topologies, even in presence of non-Gaussian dependencies
and/or low sample size. We applied the ensemble-of-forests model to learn the
structure of perturbed signaling networks of immune cells and found that these
frequently exhibit non-Gaussian dependencies with disconnected MRF topologies.
In summary, we expect that the ensemble-of-forests model will enable MRF
structure learning in other high dimensional real world settings that are
governed by non-trivial dependencies.Comment: 13 pages, 6 figure
High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
We consider the problem of high-dimensional Gaussian graphical model
selection. We identify a set of graphs for which an efficient estimation
algorithm exists, and this algorithm is based on thresholding of empirical
conditional covariances. Under a set of transparent conditions, we establish
structural consistency (or sparsistency) for the proposed algorithm, when the
number of samples n=omega(J_{min}^{-2} log p), where p is the number of
variables and J_{min} is the minimum (absolute) edge potential of the graphical
model. The sufficient conditions for sparsistency are based on the notion of
walk-summability of the model and the presence of sparse local vertex
separators in the underlying graph. We also derive novel non-asymptotic
necessary conditions on the number of samples required for sparsistency
Variable selection for BART: An application to gene regulation
We consider the task of discovering gene regulatory networks, which are
defined as sets of genes and the corresponding transcription factors which
regulate their expression levels. This can be viewed as a variable selection
problem, potentially with high dimensionality. Variable selection is especially
challenging in high-dimensional settings, where it is difficult to detect
subtle individual effects and interactions between predictors. Bayesian
Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a
novel nonparametric alternative to parametric regression approaches, such as
the lasso or stepwise regression, especially when the number of relevant
predictors is sparse relative to the total number of available predictors and
the fundamental relationships are nonlinear. We develop a principled
permutation-based inferential approach for determining when the effect of a
selected predictor is likely to be real. Going further, we adapt the BART
procedure to incorporate informed prior information about variable importance.
We present simulations demonstrating that our method compares favorably to
existing parametric and nonparametric procedures in a variety of data settings.
To demonstrate the potential of our approach in a biological context, we apply
it to the task of inferring the gene regulatory network in yeast (Saccharomyces
cerevisiae). We find that our BART-based procedure is best able to recover the
subset of covariates with the largest signal compared to other variable
selection methods. The methods developed in this work are readily available in
the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Lower Bounds for Two-Sample Structural Change Detection in Ising and Gaussian Models
The change detection problem is to determine if the Markov network structures
of two Markov random fields differ from one another given two sets of samples
drawn from the respective underlying distributions. We study the trade-off
between the sample sizes and the reliability of change detection, measured as a
minimax risk, for the important cases of the Ising models and the Gaussian
Markov random fields restricted to the models which have network structures
with nodes and degree at most , and obtain information-theoretic lower
bounds for reliable change detection over these models. We show that for the
Ising model, samples are
required from each dataset to detect even the sparsest possible changes, and
that for the Gaussian, samples are
required from each dataset to detect change, where is the smallest
ratio of off-diagonal to diagonal terms in the precision matrices of the
distributions. These bounds are compared to the corresponding results in
structure learning, and closely match them under mild conditions on the model
parameters. Thus, our change detection bounds inherit partial tightness from
the structure learning schemes in previous literature, demonstrating that in
certain parameter regimes, the naive structure learning based approach to
change detection is minimax optimal up to constant factors.Comment: Presented at the 55th Annual Allerton Conference on Communication,
Control, and Computing, Oct. 201
Short-segment heart sound classification using an ensemble of deep convolutional neural networks
This paper proposes a framework based on deep convolutional neural networks
(CNNs) for automatic heart sound classification using short-segments of
individual heart beats. We design a 1D-CNN that directly learns features from
raw heart-sound signals, and a 2D-CNN that takes inputs of two- dimensional
time-frequency feature maps based on Mel-frequency cepstral coefficients
(MFCC). We further develop a time-frequency CNN ensemble (TF-ECNN) combining
the 1D-CNN and 2D-CNN based on score-level fusion of the class probabilities.
On the large PhysioNet CinC challenge 2016 database, the proposed CNN models
outperformed traditional classifiers based on support vector machine and hidden
Markov models with various hand-crafted time- and frequency-domain features.
Best classification scores with 89.22% accuracy and 89.94% sensitivity were
achieved by the ECNN, and 91.55% specificity and 88.82% modified accuracy by
the 2D-CNN alone on the test set.Comment: 8 pages, 1 figure, conferenc
High-dimensional structure estimation in Ising models: Local separation criterion
We consider the problem of high-dimensional Ising (graphical) model
selection. We propose a simple algorithm for structure estimation based on the
thresholding of the empirical conditional variation distances. We introduce a
novel criterion for tractable graph families, where this method is efficient,
based on the presence of sparse local separators between node pairs in the
underlying graph. For such graphs, the proposed algorithm has a sample
complexity of , where is the number of
variables, and is the minimum (absolute) edge potential in the
model. We also establish nonasymptotic necessary and sufficient conditions for
structure estimation.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1009 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …