145 research outputs found
The Design of an Interactive Topic Modeling Application for Media Content
Topic Modeling has been widely used by data scientists to analyze the increasing amount of text documents. Documents can be assigned to a distribution of topics with techniques like LDA or NMF, that are related to unsupervised soft clustering but consider text semantics. More recently, Interactive Topic Modeling (ITM) has been introduced to incorporate human expertise in the modeling process. This enables real-time hyperparameter optimization and topic manipulation on document and keyword level. However, current ITM applications are mostly accessible to experienced data scientists, who lack domain knowledge. Domain experts, on the other hand, usually lack the data science expertise to build and use ITM applications.
This thesis presents an Interactive Topic Modeling application accessible to non-technical data analysts in the broadcasting domain. The application allows domain experts, like journalists, to explore themes in various produced media content in a dynamic, intuitive and efficient manner. An interactive interface, with an embedded NMF topic model, enables users to filter on various data sources, configure and refine the topic model, interpret and evaluate the output by visualizations, and analyze the data in wider context. This application was designed in collaboration with domain experts in focus group sessions, according to human-centered design principles.
An evaluation study with ten participants shows that journalists and data analysts without any natural language processing knowledge agree that the application is not only usable, but also very user-friendly, effective and efficient. A SUS score of 81 was received, and user experience and user perceptions of control questionnaires both received an average of 4.1 on a five-point Likert scale. The ITM application thus enables this specific user group to extract meaningful topics from their produced media content, and use these results in broader perspective to perform exploratory data analysis.
The success of the final application design presented in this thesis shows that the knowledge gap between data scientists and domain experts in the broadcasting field has been filled. In bigger perspective; machine learning applications can be made more accessible by translating hidden low-level details of complex models into high-level model interactions, presented in a user interface
Iterative Seed Word Generation for Interactive Topic Modelling: a Mixed Text Processing and Qualitative Content Analysis Approach
Topic models have great potential for helping researchers and practitioners understand the electronic word of mouth (eWoM). This potential is thwarted by their purely unsupervised nature, which often leads to topics that are not entirely explainable. We develop a novel method to iteratively generate seed words to guide the interactive topic models. We assess the validity and applicability of the proposed method by investigating the critical phenomenon of Contact Tracing Mobile Applications (CTMAs) post-adoption during a time of the COVID-19 pandemic. The results show that constructs developed through our interactive topic modeling can capture primary research variables related to the phenomenon. Compared to existing topic modeling methods, our approach shows superior performance in explaining users’ satisfaction with CTMAs
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations
We investigate the pertinence of methods from algebraic topology for text
data analysis. These methods enable the development of
mathematically-principled isometric-invariant mappings from a set of vectors to
a document embedding, which is stable with respect to the geometry of the
document in the selected metric space. In this work, we evaluate the utility of
these topology-based document representations in traditional NLP tasks,
specifically document clustering and sentiment classification. We find that the
embeddings do not benefit text analysis. In fact, performance is worse than
simple techniques like , indicating that the geometry of the
document does not provide enough variability for classification on the basis of
topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201
Fast Parallel Randomized Algorithm for Nonnegative Matrix Factorization with KL Divergence for Large Sparse Datasets
Nonnegative Matrix Factorization (NMF) with Kullback-Leibler Divergence
(NMF-KL) is one of the most significant NMF problems and equivalent to
Probabilistic Latent Semantic Indexing (PLSI), which has been successfully
applied in many applications. For sparse count data, a Poisson distribution and
KL divergence provide sparse models and sparse representation, which describe
the random variation better than a normal distribution and Frobenius norm.
Specially, sparse models provide more concise understanding of the appearance
of attributes over latent components, while sparse representation provides
concise interpretability of the contribution of latent components over
instances. However, minimizing NMF with KL divergence is much more difficult
than minimizing NMF with Frobenius norm; and sparse models, sparse
representation and fast algorithms for large sparse datasets are still
challenges for NMF with KL divergence. In this paper, we propose a fast
parallel randomized coordinate descent algorithm having fast convergence for
large sparse datasets to archive sparse models and sparse representation. The
proposed algorithm's experimental results overperform the current studies' ones
in this problem
Labeled Interactive Topic Models
Topic models are valuable for understanding extensive document collections,
but they don't always identify the most relevant topics. Classical
probabilistic and anchor-based topic models offer interactive versions that
allow users to guide the models towards more pertinent topics. However, such
interactive features have been lacking in neural topic models. To correct this
lacuna, we introduce a user-friendly interaction for neural topic models. This
interaction permits users to assign a word label to a topic, leading to an
update in the topic model where the words in the topic become closely aligned
with the given label. Our approach encompasses two distinct kinds of neural
topic models. The first includes models where topic embeddings are trainable
and evolve during the training process. The second kind involves models where
topic embeddings are integrated post-training, offering a different approach to
topic refinement. To facilitate user interaction with these neural topic
models, we have developed an interactive interface. This interface enables
users to engage with and re-label topics as desired. We evaluate our method
through a human study, where users can relabel topics to find relevant
documents. Using our method, user labeling improves document rank scores,
helping to find more relevant documents to a given query when compared to no
user labeling
- …