New sampling and optimization methods for topic inference and text classification

Abstract

Topic modelling (TM) methods, such as latent Dirichlet allocation (LDA), are advanced statistical models which are used to uncover hidden thematic structures or topics in the unstructured text. In this context, a topic is a distribution over words, and a document is a distribution over topics. Topic models are usually unsupervised; however, supervised variants have been proposed, such as supervised LDA (SLDA) which can be used for text classification. To evaluate a supervised topic model, one could measure its classification accuracy. However, unsupervised topic model’s evaluation is not straightforward, and it is usually done by calculating metrics known as held-out perplexity and coherence. Held-out perplexity evaluates the model’s ability to generalize to unseen documents; coherence calculates a semantic distance between the words within each topic. This thesis explores ideas for enhancing the performance of TM, both supervised and unsupervised. Firstly, multi-objective topic modelling (MOEA-TM) is proposed, which uses a multi-objective evolutionary algorithm (MOEA) to optimize two objectives: coverage and coherence. MOEA-TM has two settings: ’start from scratch’ and ’start from an estimated topic model’. In the later, the held-out perplexity is added as another objective. In both settings, MOEA-TM achieves highly coherent topics. Further, a genetic algorithm is developed with LDA log-likelihood as a fitness function. This algorithm can improve log-likelihood by up to 10%; however, perplexity scores slightly deteriorate due to over-fitting. Hyperparameters play a significant role in TM; thus, Gibbs-Newton (GN), which is an efficient approach to learn a multivariate Pólya distribution parameter, is proposed. A closer look at the LDA model reveals that it comprises two multivariate Pólya distributions: one is used to model topics, whereas the other is used to model topics proportions in documents. Consequently, a better approach to learn multivariate Pólya distribution parameter may enhance TM. GN is benchmarked against Minka’s fixed-point iteration approach, a slice sampling technique and the moments’ method. We find that GN provides the same level of accuracy as Minka’s fixed-point iteration method but in less time, and with better accuracy than the other approaches. Also, LDA-GN is proposed, which makes use of the GN method in topic modelling. This algorithm can achieve better perplexity scores than the original LDA on three corpora tested. Moreover, LDA-GN is tested on a supervised task using SLDA-GN, which is the SLDA model equipped with the GN method to learn its hyperparameters. SLDA-GN outperforms the original SLDA, which optimizes its hyperparameters using Minka’s fixed point iteration method. Furthermore, LDA-GN is evaluated on a spam filtering task using the Multi-corpus LDA (MC-LDA) model; where LDA-GN shows a more stable performance compared with the standard LDA. Finally, most topic models are based on the “Bag of Words” assumption, where a document word order is lost, and only frequency is preserved. We propose LDA-crr model, which represents word order as an observed variable. LDA-crr introduces only minor additional complexity to TM; thus, it can be applied readily to large corpora. LDA-crr is benchmarked against the original LDA using fixed hyperparameters to isolate their influence. LDA-crr outperforms LDA in terms of perplexity and shows slightly more coherent topics when the number of topics increases. Also, LDA-crr is equipped with both the GN approach and the slice sampling technique in LDA-crrGN and LDA-crrGSS models respectively. LDA-crrGN shows a slightly better ability to generalize to unseen documents compared with LDA-GN on one corpus when the number of topics is high. However, in general, LDA-crrGSS shows better coherence scores compared with the LDA-GN and the original LDA. Furthermore, experiments to investigate LDA-crr performance in a classification task were run; thus, SLDA is extended to incorporate word orders in the SLDA-crr model. The GN and the GSS techniques are used in the SLDA-crrGN and the SLDA-crrGSS models respectively to learn its parameters. Compared with the SLDA-GN and the original SLDA, the SLDA-crrGN shows better accuracy results in classifying unseen documents. This reveals that SLDA-crrGN can pick up more useful information from the training corpus which consequently helps the model to perform better

    Similar works