Division of Molecular Biosciences, Imperial College London
Doi
Abstract
The modern biological sciences are fraught with statistical difficulties. Biomolecular
stochasticity, experimental noise, and the βlarge p, small nβ problem all contribute to
the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful
conclusions from observations. In this thesis, we explore methods for assessing
the effects of data variability upon downstream inference, in an attempt to quantify and
promote the stability of the inferences we make.
We start with a review of existing methods for addressing this problem, focusing upon the
bootstrap and similar methods. The key requirement for all such approaches is a statistical
model that approximates the data generating process.
We move on to consider biomarker discovery problems. We present a novel algorithm for
proposing putative biomarkers on the strength of both their predictive ability and the stability
with which they are selected. In a simulation study, we find our approach to perform
favourably in comparison to strategies that select on the basis of predictive performance
alone.
We then consider the real problem of identifying protein peak biomarkers for HAM/TSP,
an inflammatory condition of the central nervous system caused by HTLV-1 infection.
We apply our algorithm to a set of SELDI mass spectral data, and identify a number of
putative biomarkers. Additional experimental work, together with known results from the
literature, provides corroborating evidence for the validity of these putative biomarkers.
Having focused on static observations, we then make the natural progression to time
course data sets. We propose a (Bayesian) bootstrap approach for such data, and then
apply our method in the context of gene network inference and the estimation of parameters
in ordinary differential equation models. We find that the inferred gene networks
are relatively unstable, and demonstrate the importance of finding distributions of ODE
parameter estimates, rather than single point estimates