2,445 research outputs found

    Efficient regularized isotonic regression with application to gene--gene interaction search

    Full text link
    Isotonic regression is a nonparametric approach for fitting monotonic models to data that has been widely studied from both theoretical and practical perspectives. However, this approach encounters computational and statistical overfitting issues in higher dimensions. To address both concerns, we present an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic regression based on recursively partitioning the covariate space through solution of progressively smaller "best cut" subproblems. This creates a regularized sequence of isotonic models of increasing model complexity that converges to the global isotonic regression solution. The models along the sequence are often more accurate than the unregularized isotonic regression model because of the complexity control they offer. We quantify this complexity control through estimation of degrees of freedom along the path. Success of the regularized models in prediction and IRPs favorable computational properties are demonstrated through a series of simulated and real data experiments. We discuss application of IRP to the problem of searching for gene--gene interactions and epistasis, and demonstrate it on data from genome-wide association studies of three common diseases.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS504 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian clustering of curves and the search of the partition space

    Get PDF
    This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes

    Quasiconvex Programming

    Full text link
    We define quasiconvex programming, a form of generalized linear programming in which one seeks the point minimizing the pointwise maximum of a collection of quasiconvex functions. We survey algorithms for solving quasiconvex programs either numerically or via generalizations of the dual simplex method from linear programming, and describe varied applications of this geometric optimization technique in meshing, scientific computation, information visualization, automated algorithm analysis, and robust statistics.Comment: 33 pages, 14 figure

    Bayesian clustering of curves and the search of the partition space

    Get PDF
    This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes

    Learning and predicting with chain event graphs

    Get PDF
    Graphical models provide a very promising avenue for making sense of large, complex datasets. The most popular graphical models in use at the moment are Bayesian networks (BNs). This thesis shows, however, they are not always ideal factorisations of a system. Instead, I advocate for the use of a relatively new graphical model, the chain event graph (CEG), that is based on event trees. Event trees directly represent graphically the event space of a system. Chain event graphs reduce their potentially huge dimensionality by taking into account identical probability distributions on some of the event tree’s subtrees, with the added benefits of showing the conditional independence relationships of the system — one of the advantages of the Bayesian network representation that event trees lack — and implementation of causal hypotheses that is just as easy, and arguably more natural, than is the case with Bayesian networks, with a larger domain of implementation using purely graphical means. The trade-off for this greater expressive power, however, is that model specification and selection are much more difficult to undertake with the larger set of possible models for a given set of variables. My thesis is the first exposition of how to learn CEGs. I demonstrate that not only is conjugate (and hence quick) learning of CEGs possible, but I characterise priors that imply conjugate updating based on very reasonable assumptions that also have direct Bayesian network analogues. By re-casting CEGs as partition models, I show how established partition learning algorithms can be adapted for the task of learning CEGs. I then develop a robust yet flexible prediction machine based on CEGs for any discrete multivariate time series — the dynamic CEG model — which combines the power of CEGs, multi-process and steady modelling, lattice theory and Occam’s razor. This is also an exact method that produces reliable predictions without requiring much a priori modelling. I then demonstrate how easily causal analysis can be implemented with this model class that can express a wide variety of causal hypotheses. I end with an application of these techniques to real educational data, drawing inferences that would not have been possible simply using BNs

    Testing for Common Breaks in a Multiple Equations System

    Full text link
    The issue addressed in this paper is that of testing for common breaks across or within equations of a multivariate system. Our framework is very general and allows integrated regressors and trends as well as stationary regressors. The null hypothesis is that breaks in different parameters occur at common locations and are separated by some positive fraction of the sample size unless they occur across different equations. Under the alternative hypothesis, the break dates across parameters are not the same and also need not be separated by a positive fraction of the sample size whether within or across equations. The test considered is the quasi-likelihood ratio test assuming normal errors, though as usual the limit distribution of the test remains valid with non-normal errors. Of independent interest, we provide results about the rate of convergence of the estimates when searching over all possible partitions subject only to the requirement that each regime contains at least as many observations as some positive fraction of the sample size, allowing break dates not separated by a positive fraction of the sample size across equations. Simulations show that the test has good finite sample properties. We also provide an application to issues related to level shifts and persistence for various measures of inflation to illustrate its usefulness.Comment: 44 pages, 2 tables and 1 figur

    Quantum Annealing and Analog Quantum Computation

    Full text link
    We review here the recent success in quantum annealing, i.e., optimization of the cost or energy functions of complex systems utilizing quantum fluctuations. The concept is introduced in successive steps through the studies of mapping of such computationally hard problems to the classical spin glass problems. The quantum spin glass problems arise with the introduction of quantum fluctuations, and the annealing behavior of the systems as these fluctuations are reduced slowly to zero. This provides a general framework for realizing analog quantum computation.Comment: 22 pages, 7 figs (color online); new References Added. Reviews of Modern Physics (in press
    • …
    corecore