29,104 research outputs found
Computing Web-scale Topic Models using an Asynchronous Parameter Server
Topic models such as Latent Dirichlet Allocation (LDA) have been widely used
in information retrieval for tasks ranging from smoothing and feedback methods
to tools for exploratory search and discovery. However, classical methods for
inferring topic models do not scale up to the massive size of today's publicly
available Web-scale data sets. The state-of-the-art approaches rely on custom
strategies, implementations and hardware to facilitate their asynchronous,
communication-intensive workloads.
We present APS-LDA, which integrates state-of-the-art topic modeling with
cluster computing frameworks such as Spark using a novel asynchronous parameter
server. Advantages of this integration include convenient usage of existing
data processing pipelines and eliminating the need for disk writes as data can
be kept in memory from start to finish. Our goal is not to outperform highly
customized implementations, but to propose a general high-performance topic
modeling framework that can easily be used in today's data processing
pipelines. We compare APS-LDA to the existing Spark LDA implementations and
show that our system can, on a 480-core cluster, process up to 135 times more
data and 10 times more topics without sacrificing model quality.Comment: To appear in SIGIR 201
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
Modeling Documents as Mixtures of Persons for Expert Finding
In this paper we address the problem of searching for knowledgeable
persons within the enterprise, known as the expert finding (or
expert search) task. We present a probabilistic algorithm using the assumption
that terms in documents are produced by people who are mentioned
in them.We represent documents retrieved to a query as mixtures
of candidate experts language models. Two methods of personal language
models extraction are proposed, as well as the way of combining
them with other evidences of expertise. Experiments conducted with the
TREC Enterprise collection demonstrate the superiority of our approach
in comparison with the best one among existing solutions
Language Models
Contains fulltext :
227630.pdf (preprint version ) (Open Access
- …