4 research outputs found

    Fast and modular regularized topic modelling

    Get PDF
    Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. In applications, there are often many requirements, such as, for example, problem-specific knowledge and additional data, to be taken into account. Therefore, it is natural for topic modelling to be considered a multiobjective optimization problem. However, historically, Bayesian learning became the most popular approach for topic modelling. In the Bayesian paradigm, all requirements are formalized in terms of a probabilistic generative process. This approach is not always convenient due to some limitations and technical difficulties. In this work, we develop a non-Bayesian multiobjective approach called the Additive Regularization of Topic Models (ARTM). It is based on regularized Maximum Likelihood Estimation (MLE), and we show that many of the well-known Bayesian topic models can be re-formulated in a much simpler way using the regularization point of view. We review some of the most important types of topic models: multimodal, multilingual, temporal, hierarchical, graph-based, and short-text. The ARTM framework enables easy combination of different types of models to create new models with the desired properties for applications. This modular “lego-style” technology for topic modelling is implemented in the open-source library BigARTM

    Supervised topic models with word order structure for document classification and retrieval learning

    Get PDF
    One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic models, do not consider the word order structure in documents. One of the motivations behind considering the word order structure is to capture the semantic fabric of the document. We investigate a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into a supervised topic model enabling a more effective interaction among such information for solving document classification. We derive a collapsed Gibbs sampler for our model. Likewise, supervised topic models with word order structure have not been explored in document retrieval learning. We propose a novel supervised topic model for document retrieval learning which can be regarded as a pointwise model for tackling the learning-to-rank task. Available relevance assessments and word order structure are integrated into the topic model itself. We conduct extensive experiments on several publicly available benchmark datasets, and show that our model improves upon the state-of-the-art models
    corecore