11,541 research outputs found
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure
DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases
Keyphrase extraction from documents is useful to a variety of applications
such as information retrieval and document summarization. This paper presents
an end-to-end method called DivGraphPointer for extracting a set of diversified
keyphrases from a document. DivGraphPointer combines the advantages of
traditional graph-based ranking methods and recent neural network-based
approaches. Specifically, given a document, a word graph is constructed from
the document based on word proximity and is encoded with graph convolutional
networks, which effectively capture document-level word salience by modeling
long-range dependency between words in the document and aggregating multiple
appearances of identical words into one node. Furthermore, we propose a
diversified point network to generate a set of diverse keyphrases out of the
word graph in the decoding process. Experimental results on five benchmark data
sets show that our proposed method significantly outperforms the existing
state-of-the-art approaches.Comment: Accepted to SIGIR 201
Search strategies of Wikipedia readers
The quest for information is one of the most common activity of human beings. Despite the the impressive progress of search engines, not to miss the needed piece of information could be still very tough, as well as to acquire specific competences and knowledge by shaping and following the proper learning paths. Indeed, the need to find sensible paths in information networks is one of the biggest challenges of our societies and, to effectively address it, it is important to investigate the strategies adopted by human users to cope with the cognitive bottleneck of finding their way in a growing sea of information. Here we focus on the case of Wikipedia and investigate a recently released dataset about users’ click on the English Wikipedia, namely the English Wikipedia Clickstream. We perform a semantically charged analysis to uncover the general patterns followed by information seekers in the multi-dimensional space of Wikipedia topics/categories. We discover the existence of well defined strategies in which users tend to start from very general, i.e., semantically broad, pages and progressively narrow down the scope of their navigation, while keeping a growing semantic coherence. This is unlike strategies associated to tasks with predefined search goals, namely the case of the Wikispeedia game. In this case users first move from the ‘particular’ to the ‘universal’ before focusing down again to the required target. The clear picture offered here represents a very important stepping stone towards a better design of information networks and recommendation strategies, as well as the construction of radically new learning paths
Quantifying Priorities in Business Cycle Reports: Analysis of Recurring Textual Patterns around Peaks and Troughs
I propose a novel approach to uncover business cycle reports’ priorities and relate them to economic fluctuations. To this end, I leverage quantitative business-cycle forecasts published by leading German economic research institutes since 1970 to estimate the proportions of latent topics in associated business cycle reports. I then employ a supervised approach to aggregate topics with similar themes, thus revealing the proportions of broader macroeconomic subjects. I obtain measures of forecasters’ subject priorities by extracting the subject proportions’ cyclic components. Correlating these priorities with key macroeconomic variables reveals consistent priority patterns throughout economic peaks and troughs. The forecasters prioritize inflation-related matters over recession-related considerations around peaks. This finding suggests that forecasters underestimate growth and overestimate
inflation risks during contractive monetary policies, which might explain their failure to predict recessions. Around troughs, forecasters prioritize investment matters, potentially suggesting a better understanding of macroeconomic developments during those periods compared to peaks
- …