36,633 research outputs found
Recommended from our members
New topic detection in microblogs and topic model evaluation using topical alignment
textThis thesis deals with topic model evaluation and new topic detection in microblogs. Microblogs are short and thus may not carry any contextual clues. Hence it becomes challenging to apply traditional natural language processing algorithms on such data. Graphical models have been traditionally used for topic discovery and text clustering on sets of text-based documents. Their unsupervised nature allows topic models to be trained easily on datasets meant for specific domains. However the advantage of not requiring annotated data comes with a drawback with respect to evaluation difficulties. The problem aggravates when the data comprises microblogs which are unstructured and noisy.
We demonstrate the application of three types of such models to microblogs - the Latent Dirichlet Allocation, the Author-Topic and the Author-Recipient-Topic model. We extensively evaluate these models under different settings, and our results show that the Author-Recipient-Topic model extracts the most coherent topics. We also addressed the problem of topic modeling on short text by using clustering techniques. This technique helps in boosting the performance of our models.
Topical alignment is used for large scale assessment of topical relevance by comparing topics to manually generated domain specific concepts. In this thesis we use this idea to evaluate topic models by measuring misalignments between topics. Our study on comparing topic models reveals interesting traits about Twitter messages, users and their interactions and establishes that joint modeling on author-recipient pairs and on the content of tweet leads to qualitatively better topic discovery.
This thesis gives a new direction to the well known problem of topic discovery in microblogs. Trend prediction or topic discovery for microblogs is an extensive research area. We propose the idea of using topical alignment to detect new topics by comparing topics from the current week to those of the previous week. We measure correspondence between a set of topics from the current week and a set of topics from the previous week to quantify five types of misalignments: \textit{junk, fused, missing} and \textit{repeated}. Our analysis compares three types of topic models under different settings and demonstrates how our framework can detect new topics from topical misalignments. In particular so-called \textit{junk} topics are more likely to be new topics and the \textit{missing} topics are likely to have died or die out.
To get more insights into the nature of microblogs we apply topical alignment to hashtags. Comparing topics to hashtags enables us to make interesting inferences about Twitter messages and their content. Our study revealed that although a very small proportion of Twitter messages explicitly contain hashtags, the proportion of tweets that discuss topics related to hashtags is much higher.Computer Science
Data Sets: Word Embeddings Learned from Tweets and General Data
A word embedding is a low-dimensional, dense and real- valued vector
representation of a word. Word embeddings have been used in many NLP tasks.
They are usually gener- ated from a large text corpus. The embedding of a word
cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and
have unique lexical and semantic features that are different from other types
of text. Therefore, it is necessary to have word embeddings learned
specifically from tweets. In this paper, we present ten word embedding data
sets. In addition to the data sets learned from just tweet data, we also built
embedding sets from the general data and the combination of tweets with the
general data. The general data consist of news articles, Wikipedia data and
other web data. These ten embedding models were learned from about 400 million
tweets and 7 billion words from the general text. In this paper, we also
present two experiments demonstrating how to use the data sets in some NLP
tasks, such as tweet sentiment analysis and tweet topic classification tasks
A Survey on Bayesian Deep Learning
A comprehensive artificial intelligence system needs to not only perceive the
environment with different `senses' (e.g., seeing and hearing) but also infer
the world's conditional (or even causal) relations and corresponding
uncertainty. The past decade has seen major advances in many perception tasks
such as visual object recognition and speech recognition using deep learning
models. For higher-level inference, however, probabilistic graphical models
with their Bayesian nature are still more powerful and flexible. In recent
years, Bayesian deep learning has emerged as a unified probabilistic framework
to tightly integrate deep learning and Bayesian models. In this general
framework, the perception of text or images using deep learning can boost the
performance of higher-level inference and in turn, the feedback from the
inference process is able to enhance the perception of text or images. This
survey provides a comprehensive introduction to Bayesian deep learning and
reviews its recent applications on recommender systems, topic models, control,
etc. Besides, we also discuss the relationship and differences between Bayesian
deep learning and other related topics such as Bayesian treatment of neural
networks.Comment: To appear in ACM Computing Surveys (CSUR) 202
Transfer Meets Hybrid: A Synthetic Approach for Cross-Domain Collaborative Filtering with Text
Collaborative filtering (CF) is the key technique for recommender systems
(RSs). CF exploits user-item behavior interactions (e.g., clicks) only and
hence suffers from the data sparsity issue. One research thread is to integrate
auxiliary information such as product reviews and news titles, leading to
hybrid filtering methods. Another thread is to transfer knowledge from other
source domains such as improving the movie recommendation with the knowledge
from the book domain, leading to transfer learning methods. In real-world life,
no single service can satisfy a user's all information needs. Thus it motivates
us to exploit both auxiliary and source information for RSs in this paper. We
propose a novel neural model to smoothly enable Transfer Meeting Hybrid (TMH)
methods for cross-domain recommendation with unstructured text in an end-to-end
manner. TMH attentively extracts useful content from unstructured text via a
memory module and selectively transfers knowledge from a source domain via a
transfer network. On two real-world datasets, TMH shows better performance in
terms of three ranking metrics by comparing with various baselines. We conduct
thorough analyses to understand how the text content and transferred knowledge
help the proposed model.Comment: 11 pages, 7 figures, a full version for the WWW 2019 short pape
LRMM: Learning to Recommend with Missing Modalities
Multimodal learning has shown promising performance in content-based
recommendation due to the auxiliary user and item information of multiple
modalities such as text and images. However, the problem of incomplete and
missing modality is rarely explored and most existing methods fail in learning
a recommendation model with missing or corrupted modalities. In this paper, we
propose LRMM, a novel framework that mitigates not only the problem of missing
modalities but also more generally the cold-start problem of recommender
systems. We propose modality dropout (m-drop) and a multimodal sequential
autoencoder (m-auto) to learn multimodal representations for complementing and
imputing missing modalities. Extensive experiments on real-world Amazon data
show that LRMM achieves state-of-the-art performance on rating prediction
tasks. More importantly, LRMM is more robust to previous methods in alleviating
data-sparsity and the cold-start problem.Comment: 11 pages, EMNLP 201
On the Impact of Entity Linking in Microblog Real-Time Filtering
Microblogging is a model of content sharing in which the temporal locality of
posts with respect to important events, either of foreseeable or unforeseeable
nature, makes applica- tions of real-time filtering of great practical
interest. We propose the use of Entity Linking (EL) in order to improve the
retrieval effectiveness, by enriching the representation of microblog posts and
filtering queries. EL is the process of recognizing in an unstructured text the
mention of relevant entities described in a knowledge base. EL of short pieces
of text is a difficult task, but it is also a scenario in which the information
EL adds to the text can have a substantial impact on the retrieval process. We
implement a start-of-the-art filtering method, based on the best systems from
the TREC Microblog track realtime adhoc retrieval and filtering tasks , and
extend it with a Wikipedia-based EL method. Results show that the use of EL
significantly improves over non-EL based versions of the filtering methods.Comment: 6 pages, 1 figure, 1 table. SAC 2015, Salamanca, Spain - April 13 -
17, 201
- …