972 research outputs found
Pyndri: a Python Interface to the Indri Search Engine
We introduce pyndri, a Python interface to the Indri search engine. Pyndri
allows to access Indri indexes from Python at two levels: (1) dictionary and
tokenized document collection, (2) evaluating queries on the index. We hope
that with the release of pyndri, we will stimulate reproducible, open and
fast-paced IR research.Comment: ECIR2017. Proceedings of the 39th European Conference on Information
Retrieval. 2017. The final publication will be available at Springe
Neural Vector Spaces for Unsupervised Information Retrieval
We propose the Neural Vector Space Model (NVSM), a method that learns
representations of documents in an unsupervised manner for news article
retrieval. In the NVSM paradigm, we learn low-dimensional representations of
words and documents from scratch using gradient descent and rank documents
according to their similarity with query representations that are composed from
word representations. We show that NVSM performs better at document ranking
than existing latent semantic vector space methods. The addition of NVSM to a
mixture of lexical language models and a state-of-the-art baseline vector space
model yields a statistically significant increase in retrieval effectiveness.
Consequently, NVSM adds a complementary relevance signal. Next to semantic
matching, we find that NVSM performs well in cases where lexical matching is
needed.
NVSM learns a notion of term specificity directly from the document
collection without feature engineering. We also show that NVSM learns
regularities related to Luhn significance. Finally, we give advice on how to
deploy NVSM in situations where model selection (e.g., cross-validation) is
infeasible. We find that an unsupervised ensemble of multiple models trained
with different hyperparameter values performs better than a single
cross-validated model. Therefore, NVSM can safely be used for ranking documents
without supervised relevance judgments.Comment: TOIS 201
Lexical Query Modeling in Session Search
Lexical query modeling has been the leading paradigm for session search. In
this paper, we analyze TREC session query logs and compare the performance of
different lexical matching approaches for session search. Naive methods based
on term frequency weighing perform on par with specialized session models. In
addition, we investigate the viability of lexical query models in the setting
of session search. We give important insights into the potential and
limitations of lexical query modeling for session search and propose future
directions for the field of session search.Comment: ICTIR2016, Proceedings of the 2nd ACM International Conference on the
Theory of Information Retrieval. 201
Structural Regularities in Text-based Entity Vector Spaces
Entity retrieval is the task of finding entities such as people or products
in response to a query, based solely on the textual documents they are
associated with. Recent semantic entity retrieval algorithms represent queries
and experts in finite-dimensional vector spaces, where both are constructed
from text sequences.
We investigate entity vector spaces and the degree to which they capture
structural regularities. Such vector spaces are constructed in an unsupervised
manner without explicit information about structural aspects. For concreteness,
we address these questions for a specific type of entity: experts in the
context of expert finding. We discover how clusterings of experts correspond to
committees in organizations, the ability of expert representations to encode
the co-author graph, and the degree to which they encode academic rank. We
compare latent, continuous representations created using methods based on
distributional semantics (LSI), topic models (LDA) and neural networks
(word2vec, doc2vec, SERT). Vector spaces created using neural methods, such as
doc2vec and SERT, systematically perform better at clustering than LSI, LDA and
word2vec. When it comes to encoding entity relations, SERT performs best.Comment: ICTIR2017. Proceedings of the 3rd ACM International Conference on the
Theory of Information Retrieval. 201
Semantic Entity Retrieval Toolkit
Unsupervised learning of low-dimensional, semantic representations of words
and entities has recently gained attention. In this paper we describe the
Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our
previously published entity representation models. The toolkit provides a
unified interface to different representation learning algorithms, fine-grained
parsing configuration and can be used transparently with GPUs. In addition,
users can easily modify existing models or implement their own models in the
framework. After model training, SERT can be used to rank entities according to
a textual query and extract the learned entity/word representation for use in
downstream algorithms, such as clustering or recommendation.Comment: SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17). 201
ViTOR: Learning to Rank Webpages Based on Visual Features
The visual appearance of a webpage carries valuable information about its
quality and can be used to improve the performance of learning to rank (LTR).
We introduce the Visual learning TO Rank (ViTOR) model that integrates
state-of-the-art visual features extraction methods by (i) transfer learning
from a pre-trained image classification model, and (ii) synthetic saliency heat
maps generated from webpage snapshots. Since there is currently no public
dataset for the task of LTR with visual features, we also introduce and release
the ViTOR dataset, containing visually rich and diverse webpages. The ViTOR
dataset consists of visual snapshots, non-visual features and relevance
judgments for ClueWeb12 webpages and TREC Web Track queries. We experiment with
the proposed ViTOR model on the ViTOR dataset and show that it significantly
improves the performance of LTR with visual featuresComment: In Proceedings of the 2019 World Wide Web Conference (WWW 2019), May
2019, San Francisc
Broad expertise retrieval in sparse data environments
Expertise retrieval has been largely unexplored on data other than the W3C collection. At the same time, many intranets of universities and other knowledge-intensive organisations offer examples of relatively small but clean multilingual expertise data, covering broad ranges of expertise areas. We first present two main expertise retrieval tasks, along with a set of baseline approaches based on generative language modeling, aimed at finding expertise relations between topics and people. For our experimental evaluation, we introduce (and release) a new test set based on a crawl of a university site. Using this test set, we conduct two series of experiments. The first is aimed at determining the effectiveness of baseline expertise retrieval methods applied to the new test set. The second is aimed at assessing refined models that exploit characteristic features of the new test set, such as the organizational structure of the university, and the hierarchical structure of the topics in the test set. Expertise retrieval models are shown to be robust with respect to environments smaller than the W3C collection, and current techniques appear to be generalizable to other settings
- ā¦