2,445 research outputs found
Efficient regularized isotonic regression with application to gene--gene interaction search
Isotonic regression is a nonparametric approach for fitting monotonic models
to data that has been widely studied from both theoretical and practical
perspectives. However, this approach encounters computational and statistical
overfitting issues in higher dimensions. To address both concerns, we present
an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic
regression based on recursively partitioning the covariate space through
solution of progressively smaller "best cut" subproblems. This creates a
regularized sequence of isotonic models of increasing model complexity that
converges to the global isotonic regression solution. The models along the
sequence are often more accurate than the unregularized isotonic regression
model because of the complexity control they offer. We quantify this complexity
control through estimation of degrees of freedom along the path. Success of the
regularized models in prediction and IRPs favorable computational properties
are demonstrated through a series of simulated and real data experiments. We
discuss application of IRP to the problem of searching for gene--gene
interactions and epistasis, and demonstrate it on data from genome-wide
association studies of three common diseases.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS504 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian clustering of curves and the search of the partition space
This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes
Quasiconvex Programming
We define quasiconvex programming, a form of generalized linear programming
in which one seeks the point minimizing the pointwise maximum of a collection
of quasiconvex functions. We survey algorithms for solving quasiconvex programs
either numerically or via generalizations of the dual simplex method from
linear programming, and describe varied applications of this geometric
optimization technique in meshing, scientific computation, information
visualization, automated algorithm analysis, and robust statistics.Comment: 33 pages, 14 figure
Bayesian clustering of curves and the search of the partition space
This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes
Learning and predicting with chain event graphs
Graphical models provide a very promising avenue for making sense of large,
complex datasets. The most popular graphical models in use at the moment are
Bayesian networks (BNs). This thesis shows, however, they are not always ideal factorisations
of a system. Instead, I advocate for the use of a relatively new graphical
model, the chain event graph (CEG), that is based on event trees.
Event trees directly represent graphically the event space of a system. Chain
event graphs reduce their potentially huge dimensionality by taking into account
identical probability distributions on some of the event tree’s subtrees, with the
added benefits of showing the conditional independence relationships of the system
— one of the advantages of the Bayesian network representation that event trees
lack — and implementation of causal hypotheses that is just as easy, and arguably
more natural, than is the case with Bayesian networks, with a larger domain of
implementation using purely graphical means.
The trade-off for this greater expressive power, however, is that model specification
and selection are much more difficult to undertake with the larger set of
possible models for a given set of variables. My thesis is the first exposition of how
to learn CEGs. I demonstrate that not only is conjugate (and hence quick) learning
of CEGs possible, but I characterise priors that imply conjugate updating based
on very reasonable assumptions that also have direct Bayesian network analogues.
By re-casting CEGs as partition models, I show how established partition learning
algorithms can be adapted for the task of learning CEGs.
I then develop a robust yet flexible prediction machine based on CEGs for
any discrete multivariate time series — the dynamic CEG model — which combines
the power of CEGs, multi-process and steady modelling, lattice theory and Occam’s
razor. This is also an exact method that produces reliable predictions without
requiring much a priori modelling. I then demonstrate how easily causal analysis
can be implemented with this model class that can express a wide variety of causal
hypotheses. I end with an application of these techniques to real educational data,
drawing inferences that would not have been possible simply using BNs
Testing for Common Breaks in a Multiple Equations System
The issue addressed in this paper is that of testing for common breaks across
or within equations of a multivariate system. Our framework is very general and
allows integrated regressors and trends as well as stationary regressors. The
null hypothesis is that breaks in different parameters occur at common
locations and are separated by some positive fraction of the sample size unless
they occur across different equations. Under the alternative hypothesis, the
break dates across parameters are not the same and also need not be separated
by a positive fraction of the sample size whether within or across equations.
The test considered is the quasi-likelihood ratio test assuming normal errors,
though as usual the limit distribution of the test remains valid with
non-normal errors. Of independent interest, we provide results about the rate
of convergence of the estimates when searching over all possible partitions
subject only to the requirement that each regime contains at least as many
observations as some positive fraction of the sample size, allowing break dates
not separated by a positive fraction of the sample size across equations.
Simulations show that the test has good finite sample properties. We also
provide an application to issues related to level shifts and persistence for
various measures of inflation to illustrate its usefulness.Comment: 44 pages, 2 tables and 1 figur
Quantum Annealing and Analog Quantum Computation
We review here the recent success in quantum annealing, i.e., optimization of
the cost or energy functions of complex systems utilizing quantum fluctuations.
The concept is introduced in successive steps through the studies of mapping of
such computationally hard problems to the classical spin glass problems. The
quantum spin glass problems arise with the introduction of quantum
fluctuations, and the annealing behavior of the systems as these fluctuations
are reduced slowly to zero. This provides a general framework for realizing
analog quantum computation.Comment: 22 pages, 7 figs (color online); new References Added. Reviews of
Modern Physics (in press
- …