Variational autoencoders for supervision, calibration and multimodal learning

Abstract

Learning representations of data has long been a desirable goal in machine learning. Constructing such representations enables downstream tasks such as classification or object detection to be preformed efficiently. Furthermore, it is desirable to have these representations be constructed in such a way so they are interpretable, which allows for fine grained intervention and reasoning on characteristics of the input. Other tasks may include, cross-generation between modalities, or calibrating predictions such that their confidence matches their accuracy. An effective way to learn representations is through a Variational Autoencoder (VAE), which performs variational inference on the latent variables of the observable input. In this thesiswe show how the VAE, can be utilised to: incorporate label information into the learning process; learn shared-representations of multimodal data; and calibrate predictions of existing neural classifiers. Data sources are often accompanied by additional label information, which may indicate the presence of a characteristic in the input. A question naturally arises as to whether the additional label information can be used to structure the representation such that it provides a notion of interoperability about the characteristic; such as “to what extent is the person smiling?”. The first contribution of this thesis is to address the aforementioned problem and propose a method which successfully uses label information to structure the latent space. Furthermore, this allows us to perform additional tasks such as fine grained interventions; classification; and conditional generations. Moreover, we are also successfully able to handle the case when label information is missing, drastically reducing the data burden when training these models. Rather than being presented with labels, we sometimes instead observe another unstructured observation of the same object, e.g. a caption of an image. In this scenario, the objective changes slightly to one where the model is able to learn shared-representations of data, allowing it to perform cross-generations between modalities. The second contribution of this theses addresses this problem. Here, learning is performed by employing mutual supervision between the modalities and introducing a bi-directional objective, which faithfully ensures symmetry in the model. Furthermore, by virtue of this approach, we are able to learn these representations in situations where some of the modalities may be missing during training. Uncertainty quantification is an important task in machine learning, with it now being well known that current deep learning models severely overestimate their confidence. The final contribution of this thesis is to address how the representations of VAEs can be used to extract reliable confidence estimates for neural-classifiers. This investigation leads to a novel approach to calibrate neural-classifiers, which is applied post-hoc to off the shelf classifiers and is very fast to train and test

    Similar works