1,345 research outputs found
Adaptive Document Retrieval for Deep Question Answering
State-of-the-art systems in deep question answering proceed as follows: (1)
an initial document retrieval selects relevant documents, which (2) are then
processed by a neural network in order to extract the final answer. Yet the
exact interplay between both components is poorly understood, especially
concerning the number of candidate documents that should be retrieved. We show
that choosing a static number of documents -- as used in prior research --
suffers from a noise-information trade-off and yields suboptimal results. As a
remedy, we propose an adaptive document retrieval model. This learns the
optimal candidate number for document retrieval, conditional on the size of the
corpus and the query. We report extensive experimental results showing that our
adaptive approach outperforms state-of-the-art methods on multiple benchmark
datasets, as well as in the context of corpora with variable sizes.Comment: EMNLP 201
Transfer Meets Hybrid: A Synthetic Approach for Cross-Domain Collaborative Filtering with Text
Collaborative filtering (CF) is the key technique for recommender systems
(RSs). CF exploits user-item behavior interactions (e.g., clicks) only and
hence suffers from the data sparsity issue. One research thread is to integrate
auxiliary information such as product reviews and news titles, leading to
hybrid filtering methods. Another thread is to transfer knowledge from other
source domains such as improving the movie recommendation with the knowledge
from the book domain, leading to transfer learning methods. In real-world life,
no single service can satisfy a user's all information needs. Thus it motivates
us to exploit both auxiliary and source information for RSs in this paper. We
propose a novel neural model to smoothly enable Transfer Meeting Hybrid (TMH)
methods for cross-domain recommendation with unstructured text in an end-to-end
manner. TMH attentively extracts useful content from unstructured text via a
memory module and selectively transfers knowledge from a source domain via a
transfer network. On two real-world datasets, TMH shows better performance in
terms of three ranking metrics by comparing with various baselines. We conduct
thorough analyses to understand how the text content and transferred knowledge
help the proposed model.Comment: 11 pages, 7 figures, a full version for the WWW 2019 short pape
Selective Query Processing: a Risk-Sensitive Selection of System Configurations
In information retrieval systems, search parameters are optimized to ensure
high effectiveness based on a set of past searches and these optimized
parameters are then used as the system configuration for all subsequent
queries. A better approach, however, would be to adapt the parameters to fit
the query at hand. Selective query expansion is one such an approach, in which
the system decides automatically whether or not to expand the query, resulting
in two possible system configurations. This approach was extended recently to
include many other parameters, leading to many possible system configurations
where the system automatically selects the best configuration on a per-query
basis. To determine the ideal configurations to use on a per-query basis in
real-world systems we developed a method in which a restricted number of
possible configurations is pre-selected and then used in a meta-search engine
that decides the best search configuration on a per query basis. We define a
risk-sensitive approach for configuration pre-selection that considers the
risk-reward trade-off between the number of configurations kept, and system
effectiveness. For final configuration selection, the decision is based on
query feature similarities. We find that a relatively small number of
configurations (20) selected by our risk-sensitive model is sufficient to
increase effectiveness by about 15% according(P@10, nDCG@10) when compared to
traditional grid search using a single configuration and by about 20% when
compared to learning to rank documents. Our risk-sensitive approach works for
both diversity- and ad hoc-oriented searches. Moreover, the similarity-based
selection method outperforms the more sophisticated approaches. Thus, we
demonstrate the feasibility of developing per-query information retrieval
systems, which will guide future research in this direction.Comment: 30 pages, 5 figures, 8 tables; submitted to TOIS ACM journa
Classification-Aware Hidden-Web Text Database Selection,
Many valuable text databases on the web have noncrawlable contents that are “hidden” behind
search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web”
text databases at once through a unified query interface. An important step in the metasearching
process is database selection, or determining which databases are the most relevant for a given
user query. The state-of-the-art database selection techniques rely on statistical summaries of the
database contents, generally including the database vocabulary and associated word frequencies.
Unfortunately, hidden-web text databases typically do not export such summaries, so previous research
has developed algorithms for constructing approximate content summaries from document
samples extracted from the databases via querying.We present a novel “focused-probing” sampling
algorithm that detects the topics covered in a database and adaptively extracts documents that
are representative of the topic coverage of the database. Our algorithm is the first to construct
content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s
law practically guarantees that for any relatively large database, content summaries built from
moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete
content summaries might negatively affect the database selection process, especially for short
queries with infrequent words. To enhance the sparse document samples and improve the database
selection decisions, we exploit the fact that topically similar databases tend to have similar
vocabularies, so samples extracted from databases with a similar topical focus can complement
each other. We have developed two database selection algorithms that exploit this observation.
The first algorithm proceeds hierarchically and selects the best categories for a query, and then
sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data,
to enhance the database content summaries with category-specific words.We describe how to modify
existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is
beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases
as well as TREC data, suggests that the proposed sampling methods generate high-quality
content summaries and that the database selection algorithms produce significantly more relevant
database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Personalized Web Search via Query Expansion based on User’s Local Hierarchically-Organized Files
Users of Web search engines generally express information needs with short and ambiguous queries, leading to irrelevant results. Personalized search methods improve users’ experience by automatically reformulating queries before sending them to the search engine or rearranging received results, according to their specific interests. A user profile is often built from previous queries, clicked results or in general from the user’s browsing history; different topics must be distinguished in order to obtain an accurate profile. It is quite common that a set of user files, locally stored in sub-directory, are organized by the user into a coherent taxonomy corresponding to own topics of interest, but only a few methods leverage on this potentially useful source of knowledge. We propose a novel method where a user profile is built from those files, specifically considering their consistent arrangement in directories. A bag of keywords is extracted for each directory from text documents with in it. We can infer the topic of each query and expand it by adding the corresponding keywords, in order to obtain a more targeted formulation. Experiments are carried out using benchmark data through a repeatable systematic process, in order to evaluate objectively how much our method can improve relevance of query results when applied upon a third-party search engin
MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation
Online ranker evaluation is one of the key challenges in information
retrieval. While the preferences of rankers can be inferred by interleaving
methods, the problem of how to effectively choose the ranker pair that
generates the interleaved list without degrading the user experience too much
is still challenging. On the one hand, if two rankers have not been compared
enough, the inferred preference can be noisy and inaccurate. On the other, if
two rankers are compared too many times, the interleaving process inevitably
hurts the user experience too much. This dilemma is known as the exploration
versus exploitation tradeoff. It is captured by the -armed dueling bandit
problem, which is a variant of the -armed bandit problem, where the feedback
comes in the form of pairwise preferences. Today's deployed search systems can
evaluate a large number of rankers concurrently, and scaling effectively in the
presence of numerous rankers is a critical aspect of -armed dueling bandit
problems.
In this paper, we focus on solving the large-scale online ranker evaluation
problem under the so-called Condorcet assumption, where there exists an optimal
ranker that is preferred to all other rankers. We propose Merge Double Thompson
Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that
localizes the comparisons carried out by the algorithm to small batches of
rankers, and then employs Thompson Sampling (TS) to reduce the comparisons
between suboptimal rankers inside these small batches. The effectiveness
(regret) and efficiency (time complexity) of MergeDTS are extensively evaluated
using examples from the domain of online evaluation for web search. Our main
finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS
outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI
- …