571,208 research outputs found
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
MOOCs Meet Measurement Theory: A Topic-Modelling Approach
This paper adapts topic models to the psychometric testing of MOOC students
based on their online forum postings. Measurement theory from education and
psychology provides statistical models for quantifying a person's attainment of
intangible attributes such as attitudes, abilities or intelligence. Such models
infer latent skill levels by relating them to individuals' observed responses
on a series of items such as quiz questions. The set of items can be used to
measure a latent skill if individuals' responses on them conform to a Guttman
scale. Such well-scaled items differentiate between individuals and inferred
levels span the entire range from most basic to the advanced. In practice,
education researchers manually devise items (quiz questions) while optimising
well-scaled conformance. Due to the costly nature and expert requirements of
this process, psychometric testing has found limited use in everyday teaching.
We aim to develop usable measurement models for highly-instrumented MOOC
delivery platforms, by using participation in automatically-extracted online
forum topics as items. The challenge is to formalise the Guttman scale
educational constraint and incorporate it into topic models. To favour topics
that automatically conform to a Guttman scale, we introduce a novel
regularisation into non-negative matrix factorisation-based topic modelling. We
demonstrate the suitability of our approach with both quantitative experiments
on three Coursera MOOCs, and with a qualitative survey of topic
interpretability on two MOOCs by domain expert interviews.Comment: 12 pages, 9 figures; accepted into AAAI'201
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure
Topic based language models for ad hoc information retrieval
We propose a topic based approach lo language
modelling for ad-hoc Information Retrieval (IR). Many smoothed estimators used for the multinomial query model in IR rely upon the estimated background collection probabilities. In this paper, we propose a topic based language modelling approach, that uses a more informative prior based on the topical content of a document. In our experiments, the proposed model provides comparable IR performance to the standard models, but when combined in a two stage language model, it outperforms all other estimated models
Exploring Time-Sensitive Variational Bayesian Inference LDA for Social Media Data
There is considerable interest among both researchers and the mass public in understanding the topics of discussion on social media as they occur over time. Scholars have thoroughly analysed sampling-based topic modelling approaches for various text corpora including social media; however, another LDA topic modelling implementation—Variational Bayesian (VB)—has not been well studied, despite its known efficiency and its adaptability to the volume and dynamics of social media data. In this paper, we examine the performance of the VB-based topic modelling approach for producing coherent topics, and further, we extend the VB approach by proposing a novel time-sensitive Variational Bayesian implementation, denoted as TVB. Our newly proposed TVB approach incorporates time so as to increase the quality of the generated topics. Using a Twitter dataset covering 8 events, our empirical results show that the coherence of the topics in our TVB model is improved by the integration of time. In particular, through a user study, we find that our TVB approach generates less mixed topics than state-of-the-art topic modelling approaches. Moreover, our proposed TVB approach can more accurately estimate topical trends, making it particularly suitable to assist end-users in tracking emerging topics on social media
Examining Information on Social Media: Topic Modelling, Trend Prediction and Community Classification
In the past decade, the use of social media networks (e.g. Twitter) increased dramatically becoming the main channels for the mass public to express their opinions, ideas and preferences, especially during an election or a referendum. Both researchers and the public are interested in understanding what topics are discussed during a real social event, what are the trends of the discussed topics and what is the future topical trend. Indeed, modelling such topics as well as trends offer opportunities for social scientists to continue a long-standing research, i.e. examine the information exchange between people in different communities. We argue that computing science approaches can adequately assist social scientists to extract topics from social media data, to predict their topical trends, or to classify a social media user (e.g. a Twitter user) into a community. However, while topic modelling approaches and classification techniques have been widely used, challenges still exist, such as 1) existing topic modelling approaches can generate topics lacking of coherence for social media data; 2) it is not easy to evaluate the coherence of topics; 3) it can be challenging to generate a large training dataset for developing a social media user classifier. Hence, we identify four tasks to solve these problems and assist social scientists. Initially, we aim to propose topic coherence metrics that effectively evaluate the coherence of topics generated by topic modelling approaches. Such metrics are required to align with human judgements. Since topic modelling approaches cannot always generate useful topics, it is necessary to present users with the most coherent topics using the coherence metrics. Moreover, an effective coherence metric helps us evaluate the performance of our proposed topic modelling approaches. The second task is to propose a topic modelling approach that generates more coherent topics for social media data. We argue that the use of time dimension of social media posts helps a topic modelling approach to distinguish the word usage differences over time, and thus allows to generate topics with higher coherence as well as their trends. A more coherent topic with its trend allows social scientists to quickly identify the topic subject and to focus on analysing the connections between the extracted topics with the social events, e.g., an election. Third, we aim to model and predict the topical trend. Given the timestamps of social media posts within topics, a topical trend can be modelled as a continuous distribution over time. Therefore, we argue that the future trends of topics can be predicted by estimating the density function of their continuous time distribution. By examining the future topical trend, social scientists can ensure the timeliness of their focused events. Politicians and policymakers can keep abreast of the topics that remain salient over time. Finally, we aim to offer a general method that can quickly obtain a large training dataset for constructing a social media user classifier. A social media post contains hashtags and entities. These hashtags (e.g. "#YesScot" in Scottish Independence Referendum) and entities (e.g., job title or parties' name) can reflect the community affiliation of a social media user. We argue that a large and reliable training dataset can be obtained by distinguishing the usage of these hashtags and entities. Using the obtained training dataset, a social media user community classifier can be quickly achieved, and then used as input to assist in examining the different topics discussed in communities. In conclusion, we have identified four aspects for assisting social scientists to better understand the discussed topics on social media networks. We believe that the proposed tools and approaches can help to examine the exchanges of topics among communities on social media networks
Topic Modelling of Everyday Sexism Project Entries
The Everyday Sexism Project documents everyday examples of sexism reported by
volunteer contributors from all around the world. It collected 100,000 entries
in 13+ languages within the first 3 years of its existence. The content of
reports in various languages submitted to Everyday Sexism is a valuable source
of crowdsourced information with great potential for feminist and gender
studies. In this paper, we take a computational approach to analyze the content
of reports. We use topic-modelling techniques to extract emerging topics and
concepts from the reports, and to map the semantic relations between those
topics. The resulting picture closely resembles and adds to that arrived at
through qualitative analysis, showing that this form of topic modeling could be
useful for sifting through datasets that had not previously been subject to any
analysis. More precisely, we come up with a map of topics for two different
resolutions of our topic model and discuss the connection between the
identified topics. In the low resolution picture, for instance, we found Public
space/Street, Online, Work related/Office, Transport, School, Media harassment,
and Domestic abuse. Among these, the strongest connection is between Public
space/Street harassment and Domestic abuse and sexism in personal
relationships.The strength of the relationships between topics illustrates the
fluid and ubiquitous nature of sexism, with no single experience being
unrelated to another.Comment: preprint, under revie
A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams
In the age of Web 2.0, a substantial amount of unstructured
content are distributed through multiple text streams in an
asynchronous fashion, which makes it increasingly difficult
to glean and distill useful information. An effective way to
explore the information in text streams is topic modelling,
which can further facilitate other applications such as search,
information browsing, and pattern mining. In this paper, we
propose a semantic graph based topic modelling approach
for structuring asynchronous text streams. Our model in-
tegrates topic mining and time synchronization, two core
modules for addressing the problem, into a unified model.
Specifically, for handling the lexical gap issues, we use global
semantic graphs of each timestamp for capturing the hid-
den interaction among entities from all the text streams.
For dealing with the sources asynchronism problem, local
semantic graphs are employed to discover similar topics of
different entities that can be potentially separated by time
gaps. Our experiment on two real-world datasets shows that
the proposed model significantly outperforms the existing
ones
- …
