Autoencoder-based techniques for improved classification in settings with high dimensional and small sized data

Abstract

Neural network models have been widely tested and analysed usinglarge sized high dimensional datasets. In real world application prob-lems, the available datasets are often limited in size due to reasonsrelated to the cost or difficulties encountered while collecting the data.This limitation in the number of examples may challenge the clas-sification algorithms and degrade their performance. A motivatingexample for this kind of problem is predicting the health status of atissue given its gene expression, when the number of samples availableto learn from is very small.Gene expression data has distinguishing characteristics attracting themachine learning research community. The high dimensionality ofthe data is one of the integral features that has to be considered whenbuilding predicting models. A single sample of the data is expressedby thousands of gene expressions compared to the benchmark imagesand texts that only have a few hundreds of features and commonlyused for analysing the existing models. Gene expression data samplesare also distributed unequally among the classes; in addition, theyinclude noisy features which degrade the prediction accuracy of themodels. These characteristics give rise to the need for using effec-tive dimensionality reduction methods that are able to discover thecomplex relationships between the features such as the autoencoders. This thesis investigates the problem of predicting from small sizedhigh dimensional datasets by introducing novel autoencoder-basedtechniques to increase the classification accuracy of the data. Twoautoencoder-based methods for generating synthetic data examplesand synthetic representations of the data were respectively introducedin the first stage of the study. Both of these methods are applicableto the testing phase of the autoencoder and showed successful in in-creasing the predictability of the data.Enhancing the autoencoder’s ability in learning from small sized im-balanced data was investigated in the second stage of the projectto come up with techniques that improved the autoencoder’s gener-ated representations. Employing the radial basis activation mecha-nism used in radial-basis function networks, which learn in a super-vised manner, was a solution provided by this thesis to enhance therepresentations learned by unsupervised algorithms. This techniquewas later applied to stochastic variational autoencoders and showedpromising results in learning discriminating representations from thegene expression data.The contributions of this thesis can be described by a number of differ-ent methods applicable to different stages (training and testing) anddifferent autoencoder models (deterministic and stochastic) which, in-dividually, allow for enhancing the predictability of small sized highdimensional datasets compared to well known baseline methods

    Similar works