On semiparametric regression and data mining

Abstract

Semiparametric regression is playing an increasingly large role in the analysis of datasets exhibiting various complications (Ruppert, Wand & Carroll, 2003). In particular semiparametric regression a plays prominent role in the area of data mining where such complications are numerous (Hastie, Tibshirani & Friedman, 2001). In this thesis we develop fast, interpretable methods addressing many of the difficulties associated with data mining applications including: model selection, missing value analysis, outliers and heteroscedastic noise. We focus on function estimation using penalised splines via mixed model methodology (Wahba 1990; Speed 1991; Ruppert et al. 2003). In dealing with the difficulties associated with data mining applications many of the models we consider deviate from typical normality assumptions. These models lead to likelihoods involving analytically intractable integrals. Thus, in keeping with the aim of speed, we seek analytic approximations to such integrals which are typically faster than numeric alternatives. These analytic approximations not only include popular penalised quasi-likelihood (PQL) approximations (Breslow & Clayton, 1993) but variational approximations. Originating in physics, variational approximations are a relatively new class of approximations (to statistics) which are simple, fast, flexible and effective. They have recently been applied to statistical problems in machine learning where they are rapidly gaining popularity (Jordan, Ghahramani, Jaakkola & Sau11999; Corduneanu & Bishop, 2001; Ueda & Ghahramani, 2002; Bishop & Winn, 2003; Winn & Bishop 2005). We develop variational approximations to: generalized linear mixed models (GLMMs); Bayesian GLMMs; simple missing values models; and for outlier and heteroscedastic noise models, which are, to the best of our knowledge, new. These methods are quite effective and extremely fast, with fitting taking minutes if not seconds on a typical 2008 computer. We also make a contribution to variational methods themselves. Variational approximations often underestimate the variance of posterior densities in Bayesian models (Humphreys & Titterington, 2000; Consonni & Marin, 2004; Wang & Titterington, 2005). We develop grid-based variational posterior approximations. These approximations combine a sequence of variational posterior approximations, can be extremely accurate and are reasonably fast

    Similar works

    Full text

    thumbnail-image

    Available Versions