Semiparametric regression is playing an increasingly large role in the analysis of datasets
exhibiting various complications (Ruppert, Wand & Carroll, 2003). In particular semiparametric
regression a plays prominent role in the area of data mining where such
complications are numerous (Hastie, Tibshirani & Friedman, 2001). In this thesis we
develop fast, interpretable methods addressing many of the difficulties associated with
data mining applications including: model selection, missing value analysis, outliers and
heteroscedastic noise.
We focus on function estimation using penalised splines via mixed model methodology
(Wahba 1990; Speed 1991; Ruppert et al. 2003). In dealing with the difficulties
associated with data mining applications many of the models we consider deviate from
typical normality assumptions. These models lead to likelihoods involving analytically
intractable integrals. Thus, in keeping with the aim of speed, we seek analytic approximations
to such integrals which are typically faster than numeric alternatives.
These analytic approximations not only include popular penalised quasi-likelihood
(PQL) approximations (Breslow & Clayton, 1993) but variational approximations. Originating
in physics, variational approximations are a relatively new class of approximations
(to statistics) which are simple, fast, flexible and effective. They have recently been
applied to statistical problems in machine learning where they are rapidly gaining popularity
(Jordan, Ghahramani, Jaakkola & Sau11999; Corduneanu & Bishop, 2001; Ueda &
Ghahramani, 2002; Bishop & Winn, 2003; Winn & Bishop 2005).
We develop variational approximations to: generalized linear mixed models
(GLMMs); Bayesian GLMMs; simple missing values models; and for outlier and heteroscedastic
noise models, which are, to the best of our knowledge, new. These methods
are quite effective and extremely fast, with fitting taking minutes if not seconds on a
typical 2008 computer.
We also make a contribution to variational methods themselves. Variational approximations
often underestimate the variance of posterior densities in Bayesian models
(Humphreys & Titterington, 2000; Consonni & Marin, 2004; Wang & Titterington, 2005).
We develop grid-based variational posterior approximations. These approximations combine
a sequence of variational posterior approximations, can be extremely accurate and are
reasonably fast