30 research outputs found
Learning Term Weights for Ad-hoc Retrieval
Most Information Retrieval models compute the relevance score of a document
for a given query by summing term weights specific to a document or a query.
Heuristic approaches, like TF-IDF, or probabilistic models, like BM25, are used
to specify how a term weight is computed. In this paper, we propose to leverage
learning-to-rank principles to learn how to compute a term weight for a given
document based on the term occurrence pattern
Reply With: Proactive Recommendation of Email Attachments
Email responses often contain items-such as a file or a hyperlink to an
external document-that are attached to or included inline in the body of the
message. Analysis of an enterprise email corpus reveals that 35% of the time
when users include these items as part of their response, the attachable item
is already present in their inbox or sent folder. A modern email client can
proactively retrieve relevant attachable items from the user's past emails
based on the context of the current conversation, and recommend them for
inclusion, to reduce the time and effort involved in composing the response. In
this paper, we propose a weakly supervised learning framework for recommending
attachable items to the user. As email search systems are commonly available,
we constrain the recommendation task to formulating effective search queries
from the context of the conversations. The query is submitted to an existing IR
system to retrieve relevant items for attachment. We also present a novel
strategy for generating labels from an email corpus---without the need for
manual annotations---that can be used to train and evaluate the query
formulation model. In addition, we describe a deep convolutional neural network
that demonstrates satisfactory performance on this query formulation task when
evaluated on the publicly available Avocado dataset and a proprietary dataset
of internal emails obtained through an employee participation program.Comment: CIKM2017. Proceedings of the 26th ACM International Conference on
Information and Knowledge Management. 201
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling
This paper presents a Kernel Entity Salience Model (KESM) that improves text
understanding and retrieval by better estimating entity salience (importance)
in documents. KESM represents entities by knowledge enriched distributed
representations, models the interactions between entities and words by kernels,
and combines the kernel scores to estimate entity salience. The whole model is
learned end-to-end using entity salience labels. The salience model also
improves ad hoc search accuracy, providing effective ranking features by
modeling the salience of query entities in candidate documents. Our experiments
on two entity salience corpora and two TREC ad hoc search datasets demonstrate
the effectiveness of KESM over frequency-based and feature-based methods. We
also provide examples showing how KESM conveys its text understanding ability
learned from entity salience to search
Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies
Recently, a new paradigm called Differentiable Search Index (DSI) has been
proposed for document retrieval, wherein a sequence-to-sequence model is
learned to directly map queries to relevant document identifiers. The key idea
behind DSI is to fully parameterize traditional ``index-retrieve'' pipelines
within a single neural model, by encoding all documents in the corpus into the
model parameters. In essence, DSI needs to resolve two major questions: (1) how
to assign an identifier to each document, and (2) how to learn the associations
between a document and its identifier. In this work, we propose a
Semantic-Enhanced DSI model (SE-DSI) motivated by Learning Strategies in the
area of Cognitive Psychology. Our approach advances original DSI in two ways:
(1) For the document identifier, we take inspiration from Elaboration
Strategies in human learning. Specifically, we assign each document an
Elaborative Description based on the query generation technique, which is more
meaningful than a string of integers in the original DSI; and (2) For the
associations between a document and its identifier, we take inspiration from
Rehearsal Strategies in human learning. Specifically, we select fine-grained
semantic features from a document as Rehearsal Contents to improve document
memorization. Both the offline and online experiments show improved retrieval
performance over prevailing baselines.Comment: Accepted by KDD 202