2,089 research outputs found
Integrating biological knowledge into variable selection : an empirical Bayes approach with an application in cancer biology
Background:
An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data.
Results:
We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information.
Conclusions:
The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge
Robust causal structure learning with some hidden variables
We introduce a new method to estimate the Markov equivalence class of a
directed acyclic graph (DAG) in the presence of hidden variables, in settings
where the underlying DAG among the observed variables is sparse, and there are
a few hidden variables that have a direct effect on many of the observed ones.
Building on the so-called low rank plus sparse framework, we suggest a
two-stage approach which first removes the effect of the hidden variables, and
then estimates the Markov equivalence class of the underlying DAG under the
assumption that there are no remaining hidden variables. This approach is
consistent in certain high-dimensional regimes and performs favourably when
compared to the state of the art, both in terms of graphical structure recovery
and total causal effect estimation
Locally adaptive smoothing with Markov random fields and shrinkage priors
We present a locally adaptive nonparametric curve fitting method that
operates within a fully Bayesian framework. This method uses shrinkage priors
to induce sparsity in order-k differences in the latent trend function,
providing a combination of local adaptation and global control. Using a scale
mixture of normals representation of shrinkage priors, we make explicit
connections between our method and kth order Gaussian Markov random field
smoothing. We call the resulting processes shrinkage prior Markov random fields
(SPMRFs). We use Hamiltonian Monte Carlo to approximate the posterior
distribution of model parameters because this method provides superior
performance in the presence of the high dimensionality and strong parameter
correlations exhibited by our models. We compare the performance of three prior
formulations using simulated data and find the horseshoe prior provides the
best compromise between bias and precision. We apply SPMRF models to two
benchmark data examples frequently used to test nonparametric methods. We find
that this method is flexible enough to accommodate a variety of data generating
models and offers the adaptive properties and computational tractability to
make it a useful addition to the Bayesian nonparametric toolbox.Comment: 38 pages, to appear in Bayesian Analysi
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Neural Connectivity with Hidden Gaussian Graphical State-Model
The noninvasive procedures for neural connectivity are under questioning.
Theoretical models sustain that the electromagnetic field registered at
external sensors is elicited by currents at neural space. Nevertheless, what we
observe at the sensor space is a superposition of projected fields, from the
whole gray-matter. This is the reason for a major pitfall of noninvasive
Electrophysiology methods: distorted reconstruction of neural activity and its
connectivity or leakage. It has been proven that current methods produce
incorrect connectomes. Somewhat related to the incorrect connectivity
modelling, they disregard either Systems Theory and Bayesian Information
Theory. We introduce a new formalism that attains for it, Hidden Gaussian
Graphical State-Model (HIGGS). A neural Gaussian Graphical Model (GGM) hidden
by the observation equation of Magneto-encephalographic (MEEG) signals. HIGGS
is equivalent to a frequency domain Linear State Space Model (LSSM) but with
sparse connectivity prior. The mathematical contribution here is the theory for
high-dimensional and frequency-domain HIGGS solvers. We demonstrate that HIGGS
can attenuate the leakage effect in the most critical case: the distortion EEG
signal due to head volume conduction heterogeneities. Its application in EEG is
illustrated with retrieved connectivity patterns from human Steady State Visual
Evoked Potentials (SSVEP). We provide for the first time confirmatory evidence
for noninvasive procedures of neural connectivity: concurrent EEG and
Electrocorticography (ECoG) recordings on monkey. Open source packages are
freely available online, to reproduce the results presented in this paper and
to analyze external MEEG databases
- …