Applications of Diversity and the Self-Attention Mechanism in Neural Networks

Abstract

This thesis covers three contributions in applications of neural networks. The first is related to diversity and ensemble learning, while the other two cover novel applications of the self-attention mechanism. An important aspect of training a neural network is the choice of objective function. Regression via Classification (RvC) is often used to tackle problems in deep learning where the target variable is continuous, but standard regression objectives fail to capture the underlying distance metric of the domain. This can result in better performance of the trained model, but the optimal choice of discrete classes used in RvC is not well understood. In Paper 1, we introduce the concept of label diversity by generalizing the RvC method. By exploiting the fact that labels can be generated in arbitrary ways for continuous and ordinal target variables, we show that using multiple labels can improve the prediction accuracy of a neural network compared to using a single label and provide theoretical justification from ensemble theory. We apply our method to several tasks in computer vision and show increased performance compared to regression and RvC baselines. The performance of a neural network is also influenced by the choice of network architecture, and in the design process it is important to consider the domain of the inputs and its symmetries. Graph neural networks (GNNs) is the family of networks that operates on graphs, where in-formation is propagated between the graph nodes using for example self-attention. However, self-attention can be used for other data domains as well if the inputs can be converted into graphs, which is not always trivial. In Paper 2, we do this for audio by using a complete graph over audio features extracted from different time slots. We apply this technique to the task of keyword spotting and show that a neural network solely based on self-attention is more accurate than previously considered architectures. Finally, in Paper 3 we apply attention-based learning to point cloud processing, where the permutation symmetry must be preserved. In order to make the self-attention mechanism both more efficient and more expressive, we propose a hierarchical approach that allows individual points to interact on both a local and global scale. By extensive experiments on several bench-marks, we show that this approach improves the descriptiveness of the learned features, while simultaneously reducing the computational complexity compared to an architecture that applies self-attention naively on all input points

    Similar works