571,208 research outputs found

    Topic-based mixture language modelling

    Get PDF
    This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost

    MOOCs Meet Measurement Theory: A Topic-Modelling Approach

    Full text link
    This paper adapts topic models to the psychometric testing of MOOC students based on their online forum postings. Measurement theory from education and psychology provides statistical models for quantifying a person's attainment of intangible attributes such as attitudes, abilities or intelligence. Such models infer latent skill levels by relating them to individuals' observed responses on a series of items such as quiz questions. The set of items can be used to measure a latent skill if individuals' responses on them conform to a Guttman scale. Such well-scaled items differentiate between individuals and inferred levels span the entire range from most basic to the advanced. In practice, education researchers manually devise items (quiz questions) while optimising well-scaled conformance. Due to the costly nature and expert requirements of this process, psychometric testing has found limited use in everyday teaching. We aim to develop usable measurement models for highly-instrumented MOOC delivery platforms, by using participation in automatically-extracted online forum topics as items. The challenge is to formalise the Guttman scale educational constraint and incorporate it into topic models. To favour topics that automatically conform to a Guttman scale, we introduce a novel regularisation into non-negative matrix factorisation-based topic modelling. We demonstrate the suitability of our approach with both quantitative experiments on three Coursera MOOCs, and with a qualitative survey of topic interpretability on two MOOCs by domain expert interviews.Comment: 12 pages, 9 figures; accepted into AAAI'201

    Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

    Get PDF
    Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure

    Topic based language models for ad hoc information retrieval

    Get PDF
    We propose a topic based approach lo language modelling for ad-hoc Information Retrieval (IR). Many smoothed estimators used for the multinomial query model in IR rely upon the estimated background collection probabilities. In this paper, we propose a topic based language modelling approach, that uses a more informative prior based on the topical content of a document. In our experiments, the proposed model provides comparable IR performance to the standard models, but when combined in a two stage language model, it outperforms all other estimated models

    Exploring Time-Sensitive Variational Bayesian Inference LDA for Social Media Data

    Get PDF
    There is considerable interest among both researchers and the mass public in understanding the topics of discussion on social media as they occur over time. Scholars have thoroughly analysed sampling-based topic modelling approaches for various text corpora including social media; however, another LDA topic modelling implementation—Variational Bayesian (VB)—has not been well studied, despite its known efficiency and its adaptability to the volume and dynamics of social media data. In this paper, we examine the performance of the VB-based topic modelling approach for producing coherent topics, and further, we extend the VB approach by proposing a novel time-sensitive Variational Bayesian implementation, denoted as TVB. Our newly proposed TVB approach incorporates time so as to increase the quality of the generated topics. Using a Twitter dataset covering 8 events, our empirical results show that the coherence of the topics in our TVB model is improved by the integration of time. In particular, through a user study, we find that our TVB approach generates less mixed topics than state-of-the-art topic modelling approaches. Moreover, our proposed TVB approach can more accurately estimate topical trends, making it particularly suitable to assist end-users in tracking emerging topics on social media

    Examining Information on Social Media: Topic Modelling, Trend Prediction and Community Classification

    Get PDF
    In the past decade, the use of social media networks (e.g. Twitter) increased dramatically becoming the main channels for the mass public to express their opinions, ideas and preferences, especially during an election or a referendum. Both researchers and the public are interested in understanding what topics are discussed during a real social event, what are the trends of the discussed topics and what is the future topical trend. Indeed, modelling such topics as well as trends offer opportunities for social scientists to continue a long-standing research, i.e. examine the information exchange between people in different communities. We argue that computing science approaches can adequately assist social scientists to extract topics from social media data, to predict their topical trends, or to classify a social media user (e.g. a Twitter user) into a community. However, while topic modelling approaches and classification techniques have been widely used, challenges still exist, such as 1) existing topic modelling approaches can generate topics lacking of coherence for social media data; 2) it is not easy to evaluate the coherence of topics; 3) it can be challenging to generate a large training dataset for developing a social media user classifier. Hence, we identify four tasks to solve these problems and assist social scientists. Initially, we aim to propose topic coherence metrics that effectively evaluate the coherence of topics generated by topic modelling approaches. Such metrics are required to align with human judgements. Since topic modelling approaches cannot always generate useful topics, it is necessary to present users with the most coherent topics using the coherence metrics. Moreover, an effective coherence metric helps us evaluate the performance of our proposed topic modelling approaches. The second task is to propose a topic modelling approach that generates more coherent topics for social media data. We argue that the use of time dimension of social media posts helps a topic modelling approach to distinguish the word usage differences over time, and thus allows to generate topics with higher coherence as well as their trends. A more coherent topic with its trend allows social scientists to quickly identify the topic subject and to focus on analysing the connections between the extracted topics with the social events, e.g., an election. Third, we aim to model and predict the topical trend. Given the timestamps of social media posts within topics, a topical trend can be modelled as a continuous distribution over time. Therefore, we argue that the future trends of topics can be predicted by estimating the density function of their continuous time distribution. By examining the future topical trend, social scientists can ensure the timeliness of their focused events. Politicians and policymakers can keep abreast of the topics that remain salient over time. Finally, we aim to offer a general method that can quickly obtain a large training dataset for constructing a social media user classifier. A social media post contains hashtags and entities. These hashtags (e.g. "#YesScot" in Scottish Independence Referendum) and entities (e.g., job title or parties' name) can reflect the community affiliation of a social media user. We argue that a large and reliable training dataset can be obtained by distinguishing the usage of these hashtags and entities. Using the obtained training dataset, a social media user community classifier can be quickly achieved, and then used as input to assist in examining the different topics discussed in communities. In conclusion, we have identified four aspects for assisting social scientists to better understand the discussed topics on social media networks. We believe that the proposed tools and approaches can help to examine the exchanges of topics among communities on social media networks

    Topic Modelling of Everyday Sexism Project Entries

    Get PDF
    The Everyday Sexism Project documents everyday examples of sexism reported by volunteer contributors from all around the world. It collected 100,000 entries in 13+ languages within the first 3 years of its existence. The content of reports in various languages submitted to Everyday Sexism is a valuable source of crowdsourced information with great potential for feminist and gender studies. In this paper, we take a computational approach to analyze the content of reports. We use topic-modelling techniques to extract emerging topics and concepts from the reports, and to map the semantic relations between those topics. The resulting picture closely resembles and adds to that arrived at through qualitative analysis, showing that this form of topic modeling could be useful for sifting through datasets that had not previously been subject to any analysis. More precisely, we come up with a map of topics for two different resolutions of our topic model and discuss the connection between the identified topics. In the low resolution picture, for instance, we found Public space/Street, Online, Work related/Office, Transport, School, Media harassment, and Domestic abuse. Among these, the strongest connection is between Public space/Street harassment and Domestic abuse and sexism in personal relationships.The strength of the relationships between topics illustrates the fluid and ubiquitous nature of sexism, with no single experience being unrelated to another.Comment: preprint, under revie

    A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams

    Get PDF
    In the age of Web 2.0, a substantial amount of unstructured content are distributed through multiple text streams in an asynchronous fashion, which makes it increasingly difficult to glean and distill useful information. An effective way to explore the information in text streams is topic modelling, which can further facilitate other applications such as search, information browsing, and pattern mining. In this paper, we propose a semantic graph based topic modelling approach for structuring asynchronous text streams. Our model in- tegrates topic mining and time synchronization, two core modules for addressing the problem, into a unified model. Specifically, for handling the lexical gap issues, we use global semantic graphs of each timestamp for capturing the hid- den interaction among entities from all the text streams. For dealing with the sources asynchronism problem, local semantic graphs are employed to discover similar topics of different entities that can be potentially separated by time gaps. Our experiment on two real-world datasets shows that the proposed model significantly outperforms the existing ones
    corecore