232 research outputs found
A Topic Modeling Toolbox Using Belief Propagation
Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model
for probabilistic topic modeling, which attracts worldwide interests and
touches on many important applications in text mining, computer vision and
computational biology. This paper introduces a topic modeling toolbox (TMBP)
based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by
MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing
topic modeling packages, the novelty of this toolbox lies in the BP algorithms
for learning LDA-based topic models. The current version includes BP algorithms
for latent Dirichlet allocation (LDA), author-topic models (ATM), relational
topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project
and more BP-based algorithms for various topic models will be added in the near
future. Interested users may also extend BP algorithms for learning more
complicated topic models. The source codes are freely available under the GNU
General Public Licence, Version 1.0 at https://mloss.org/software/view/399/.Comment: 4 page
Technology Assisted Reviews: Finding the Last Few Relevant Documents by Asking Yes/No Questions to Reviewers
The goal of a technology-assisted review is to achieve high recall with low
human effort. Continuous active learning algorithms have demonstrated good
performance in locating the majority of relevant documents in a collection,
however their performance is reaching a plateau when 80\%-90\% of them has been
found. Finding the last few relevant documents typically requires exhaustively
reviewing the collection. In this paper, we propose a novel method to identify
these last few, but significant, documents efficiently. Our method makes the
hypothesis that entities carry vital information in documents, and that
reviewers can answer questions about the presence or absence of an entity in
the missing relevance documents. Based on this we devise a sequential Bayesian
search method that selects the optimal sequence of questions to ask. The
experimental results show that our proposed method can greatly improve
performance requiring less reviewing effort.Comment: This paper is accepted by SIGIR 201
Ordering-sensitive and Semantic-aware Topic Modeling
Topic modeling of textual corpora is an important and challenging problem. In
most previous work, the "bag-of-words" assumption is usually made which ignores
the ordering of words. This assumption simplifies the computation, but it
unrealistically loses the ordering information and the semantic of words in the
context. In this paper, we present a Gaussian Mixture Neural Topic Model
(GMNTM) which incorporates both the ordering of words and the semantic meaning
of sentences into topic modeling. Specifically, we represent each topic as a
cluster of multi-dimensional vectors and embed the corpus into a collection of
vectors generated by the Gaussian mixture model. Each word is affected not only
by its topic, but also by the embedding vector of its surrounding words and the
context. The Gaussian mixture components and the topic of documents, sentences
and words can be learnt jointly. Extensive experiments show that our model can
learn better topics and more accurate word distributions for each topic.
Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM
obtains significantly better performance in terms of perplexity, retrieval
accuracy and classification accuracy.Comment: To appear in proceedings of AAAI 201
Exploiting Social Annotation for Automatic Resource Discovery
Information integration applications, such as mediators or mashups, that
require access to information resources currently rely on users manually
discovering and integrating them in the application. Manual resource discovery
is a slow process, requiring the user to sift through results obtained via
keyword-based search. Although search methods have advanced to include evidence
from document contents, its metadata and the contents and link structure of the
referring pages, they still do not adequately cover information sources --
often called ``the hidden Web''-- that dynamically generate documents in
response to a query. The recently popular social bookmarking sites, which allow
users to annotate and share metadata about various information sources, provide
rich evidence for resource discovery. In this paper, we describe a
probabilistic model of the user annotation process in a social bookmarking
system del.icio.us. We then use the model to automatically find resources
relevant to a particular information domain. Our experimental results on data
obtained from \emph{del.icio.us} show this approach as a promising method for
helping automate the resource discovery task.Comment: 6 pages, submitted to AAAI07 workshop on Information Integration on
the We
- …