thesis

Getting the most from medical VOC data using Bayesian feature learning

Abstract

The metabolic processes in the body naturally produce a diverse set of Volatile Organic Compounds (VOCs), which are excreted in breath, urine, stool and other biological samples. The VOCs produced are odorous and influenced by disease, meaning olfaction can provide information on a person’s disease state. A variety of instruments exist for performing “artificial olfaction”: measuring a sample, such as patient breath, and producing a high dimensional output representing the odour. Such instruments may be paired with machine learning techniques to identify properties of interest, such as the presence of a given disease. Research shows good disease-predictive ability of artificial olfaction instrumentation. However, the statistical methods employed are typically off-the-shelf, and do not take advantage of prior knowledge of the structure of the high dimensional data. Since sample sizes are also typically small, this can lead to suboptimal results due to a poorly-learned model. In this thesis we explore ways to get more out of artificial olfaction data. We perform statistical analyses in a medical setting, investigating disease diagnosis from breath, urine and vaginal swab measurements, and illustrating both successful identification and failure cases. We then introduce two new latent variable models constructed for dimension reduction of artificial olfaction data, but which are widely applicable. These models place a Gaussian Process (GP) prior on the mapping from latent variables to observations. Specifying a covariance function for the GP prior is an intuitive way for a user to describe their prior knowledge of the data covariance structure. We also enable an approximate posterior and marginal likelihood to be computed, and introduce a sparse variant. Both models have been made available in the R package stpca hosted at https://github.com/JimSkinner/stpca. In experiments with artificial olfaction data, these models outperform standard feature learning methods in a predictive pipeline

    Similar works