12,588 research outputs found
On the Consistency of Optimal Bayesian Feature Selection in the Presence of Correlations
Optimal Bayesian feature selection (OBFS) is a multivariate supervised
screening method designed from the ground up for biomarker discovery. In this
work, we prove that Gaussian OBFS is strongly consistent under mild conditions,
and provide rates of convergence for key posteriors in the framework. These
results are of enormous importance, since they identify precisely what features
are selected by OBFS asymptotically, characterize the relative rates of
convergence for posteriors on different types of features, provide conditions
that guarantee convergence, justify the use of OBFS when its internal
assumptions are invalid, and set the stage for understanding the asymptotic
behavior of other algorithms based on the OBFS framework.Comment: 33 pages, 1 figur
Bayesian Variable Selection with Structure Learning: Applications in Integrative Genomics
Significant advances in biotechnology have allowed for simultaneous
measurement of molecular data points across multiple genomic and transcriptomic
levels from a single tumor/cancer sample. This has motivated systematic
approaches to integrate multi-dimensional structured datasets since cancer
development and progression is driven by numerous co-ordinated molecular
alterations and the interactions between them. We propose a novel two-step
Bayesian approach that combines a variable selection framework with integrative
structure learning between multiple sources of data. The structure learning in
the first step is accomplished through novel joint graphical models for
heterogeneous (mixed scale) data allowing for flexible incorporation of prior
knowledge. This structure learning subsequently informs the variable selection
in the second step to identify groups of molecular features within and across
platforms associated with outcomes of cancer progression. The variable
selection strategy adjusts for collinearity and multiplicity, and also has
theoretical justifications. We evaluate our methods through simulations and
apply them to a motivating genomic (DNA copy number and methylation) and
transcriptomic (mRNA expression) data for assessing important markers
associated with Glioblastoma progression
Ranking and Selection as Stochastic Control
Under a Bayesian framework, we formulate the fully sequential sampling and
selection decision in statistical ranking and selection as a stochastic control
problem, and derive the associated Bellman equation. Using value function
approximation, we derive an approximately optimal allocation policy. We show
that this policy is not only computationally efficient but also possesses both
one-step-ahead and asymptotic optimality for independent normal sampling
distributions. Moreover, the proposed allocation policy is easily generalizable
in the approximate dynamic programming paradigm.Comment: 15 pages, 8 figures, to appear in IEEE Transactions on Automatic
Contro
Bayesian variable selection with shrinking and diffusing priors
We consider a Bayesian approach to variable selection in the presence of high
dimensional covariates based on a hierarchical model that places prior
distributions on the regression coefficients as well as on the model space. We
adopt the well-known spike and slab Gaussian priors with a distinct feature,
that is, the prior variances depend on the sample size through which
appropriate shrinkage can be achieved. We show the strong selection consistency
of the proposed method in the sense that the posterior probability of the true
model converges to one even when the number of covariates grows nearly
exponentially with the sample size. This is arguably the strongest selection
consistency result that has been available in the Bayesian variable selection
literature; yet the proposed method can be carried out through posterior
sampling with a simple Gibbs sampler. Furthermore, we argue that the proposed
method is asymptotically similar to model selection with the penalty. We
also demonstrate through empirical work the fine performance of the proposed
approach relative to some state of the art alternatives.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1207 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Anchored Discrete Factor Analysis
We present a semi-supervised learning algorithm for learning discrete factor
analysis models with arbitrary structure on the latent variables. Our algorithm
assumes that every latent variable has an "anchor", an observed variable with
only that latent variable as its parent. Given such anchors, we show that it is
possible to consistently recover moments of the latent variables and use these
moments to learn complete models. We also introduce a new technique for
improving the robustness of method-of-moment algorithms by optimizing over the
marginal polytope or its relaxations. We evaluate our algorithm using two
real-world tasks, tag prediction on questions from the Stack Overflow website
and medical diagnosis in an emergency department
Bayes Regularized Graphical Model Estimation in High Dimensions
There has been an intense development of Bayes graphical model estimation
approaches over the past decade - however, most of the existing methods are
restricted to moderate dimensions. We propose a novel approach suitable for
high dimensional settings, by decoupling model fitting and covariance
selection. First, a full model based on a complete graph is fit under novel
class of continuous shrinkage priors on the precision matrix elements, which
induces shrinkage under an equivalence with Cholesky-based regularization while
enabling conjugate updates of entire precision matrices. Subsequently, we
propose a post-fitting graphical model estimation step which proceeds using
penalized joint credible regions to perform neighborhood selection sequentially
for each node. The posterior computation proceeds using straightforward fully
Gibbs sampling, and the approach is scalable to high dimensions. The proposed
approach is shown to be asymptotically consistent in estimating the graph
structure for fixed when the truth is a Gaussian graphical model.
Simulations show that our approach compares favorably with Bayesian competitors
both in terms of graphical model estimation and computational efficiency. We
apply our methods to high dimensional gene expression and microRNA datasets in
cancer genomics.Comment: 42 Pages, 4 figures, 5 table
Deep Learning Interfacial Momentum Closures in Coarse-Mesh CFD Two-Phase Flow Simulation Using Validation Data
Multiphase flow phenomena have been widely observed in the industrial
applications, yet it remains a challenging unsolved problem. Three-dimensional
computational fluid dynamics (CFD) approaches resolve of the flow fields on
finer spatial and temporal scales, which can complement dedicated experimental
study. However, closures must be introduced to reflect the underlying physics
in multiphase flow. Among them, the interfacial forces, including drag, lift,
turbulent-dispersion and wall-lubrication forces, play an important role in
bubble distribution and migration in liquid-vapor two-phase flows. Development
of those closures traditionally rely on the experimental data and analytical
derivation with simplified assumptions that usually cannot deliver a universal
solution across a wide range of flow conditions. In this paper, a data-driven
approach, named as feature-similarity measurement (FSM), is developed and
applied to improve the simulation capability of two-phase flow with coarse-mesh
CFD approach. Interfacial momentum transfer in adiabatic bubbly flow serves as
the focus of the present study. Both a mature and a simplified set of
interfacial closures are taken as the low-fidelity data. Validation data
(including relevant experimental data and validated fine-mesh CFD simulations
results) are adopted as high-fidelity data. Qualitative and quantitative
analysis are performed in this paper. These reveal that FSM can substantially
improve the prediction of the coarse-mesh CFD model, regardless of the choice
of interfacial closures, and it provides scalability and consistency across
discontinuous flow regimes. It demonstrates that data-driven methods can aid
the multiphase flow modeling by exploring the connections between local
physical features and simulation errors.Comment: This paper has been submitted to International Journal of Multi-phase
Flo
Variable Selection with Exponential Weights and -Penalization
In the context of a linear model with a sparse coefficient vector,
exponential weights methods have been shown to be achieve oracle inequalities
for prediction. We show that such methods also succeed at variable selection
and estimation under the necessary identifiability condition on the design
matrix, instead of much stronger assumptions required by other methods such as
the Lasso or the Dantzig Selector. The same analysis yields consistency results
for Bayesian methods and BIC-type variable selection under similar conditions.Comment: 23 pages; 1 figure
Selection of a Model of Cerebral Activity for fMRI Group Data Analysis
This thesis is dedicated to the statistical analysis of multi-sub ject fMRI
data, with the purpose of identifying bain structures involved in certain
cognitive or sensori-motor tasks, in a reproducible way across sub jects. To
overcome certain limitations of standard voxel-based testing methods, as
implemented in the Statistical Parametric Mapping (SPM) software, we introduce
a Bayesian model selection approach to this problem, meaning that the most
probable model of cerebral activity given the data is selected from a
pre-defined collection of possible models. Based on a parcellation of the brain
volume into functionally homogeneous regions, each model corresponds to a
partition of the regions into those involved in the task under study and those
inactive. This allows to incorporate prior information, and avoids the
dependence of the SPM-like approach on an arbitrary threshold, called the
cluster- forming threshold, to define active regions. By controlling a Bayesian
risk, our approach balances false positive and false negative risk control.
Furthermore, it is based on a generative model that accounts for the spatial
uncertainty on the localization of individual effects, due to spatial
normalization errors. On both simulated and real fMRI datasets, we show that
this new paradigm corrects several biases of the SPM-like approach, which
either swells or misses the different active regions, depending on the choice
of a cluster-forming threshold.Comment: PhD Thesis, 208 pages, Applied Statistics and Neuroimaging,
University of Orsay, Franc
Variable Selection Using Shrinkage Priors
Variable selection has received widespread attention over the last decade as
we routinely encounter high-throughput datasets in complex biological and
environment research. Most Bayesian variable selection methods are restricted
to mixture priors having separate components for characterizing the signal and
the noise. However, such priors encounter computational issues in high
dimensions. This has motivated continuous shrinkage priors, resembling the
two-component priors facilitating computation and interpretability. While such
priors are widely used for estimating high-dimensional sparse vectors,
selecting a subset of variables remains a daunting task. In this article, we
propose a general approach for variable selection with shrinkage priors. The
presence of very few tuning parameters makes our method attractive in
comparison to adhoc thresholding approaches. The applicability of the approach
is not limited to continuous shrinkage priors, but can be used along with any
shrinkage prior. Theoretical properties for near-collinear design matrices are
investigated and the method is shown to have good performance in a wide range
of synthetic data examples
- …