1,345 research outputs found

    Adaptive Document Retrieval for Deep Question Answering

    Full text link
    State-of-the-art systems in deep question answering proceed as follows: (1) an initial document retrieval selects relevant documents, which (2) are then processed by a neural network in order to extract the final answer. Yet the exact interplay between both components is poorly understood, especially concerning the number of candidate documents that should be retrieved. We show that choosing a static number of documents -- as used in prior research -- suffers from a noise-information trade-off and yields suboptimal results. As a remedy, we propose an adaptive document retrieval model. This learns the optimal candidate number for document retrieval, conditional on the size of the corpus and the query. We report extensive experimental results showing that our adaptive approach outperforms state-of-the-art methods on multiple benchmark datasets, as well as in the context of corpora with variable sizes.Comment: EMNLP 201

    Transfer Meets Hybrid: A Synthetic Approach for Cross-Domain Collaborative Filtering with Text

    Full text link
    Collaborative filtering (CF) is the key technique for recommender systems (RSs). CF exploits user-item behavior interactions (e.g., clicks) only and hence suffers from the data sparsity issue. One research thread is to integrate auxiliary information such as product reviews and news titles, leading to hybrid filtering methods. Another thread is to transfer knowledge from other source domains such as improving the movie recommendation with the knowledge from the book domain, leading to transfer learning methods. In real-world life, no single service can satisfy a user's all information needs. Thus it motivates us to exploit both auxiliary and source information for RSs in this paper. We propose a novel neural model to smoothly enable Transfer Meeting Hybrid (TMH) methods for cross-domain recommendation with unstructured text in an end-to-end manner. TMH attentively extracts useful content from unstructured text via a memory module and selectively transfers knowledge from a source domain via a transfer network. On two real-world datasets, TMH shows better performance in terms of three ranking metrics by comparing with various baselines. We conduct thorough analyses to understand how the text content and transferred knowledge help the proposed model.Comment: 11 pages, 7 figures, a full version for the WWW 2019 short pape

    Selective Query Processing: a Risk-Sensitive Selection of System Configurations

    Full text link
    In information retrieval systems, search parameters are optimized to ensure high effectiveness based on a set of past searches and these optimized parameters are then used as the system configuration for all subsequent queries. A better approach, however, would be to adapt the parameters to fit the query at hand. Selective query expansion is one such an approach, in which the system decides automatically whether or not to expand the query, resulting in two possible system configurations. This approach was extended recently to include many other parameters, leading to many possible system configurations where the system automatically selects the best configuration on a per-query basis. To determine the ideal configurations to use on a per-query basis in real-world systems we developed a method in which a restricted number of possible configurations is pre-selected and then used in a meta-search engine that decides the best search configuration on a per query basis. We define a risk-sensitive approach for configuration pre-selection that considers the risk-reward trade-off between the number of configurations kept, and system effectiveness. For final configuration selection, the decision is based on query feature similarities. We find that a relatively small number of configurations (20) selected by our risk-sensitive model is sufficient to increase effectiveness by about 15% according(P@10, nDCG@10) when compared to traditional grid search using a single configuration and by about 20% when compared to learning to rank documents. Our risk-sensitive approach works for both diversity- and ad hoc-oriented searches. Moreover, the similarity-based selection method outperforms the more sophisticated approaches. Thus, we demonstrate the feasibility of developing per-query information retrieval systems, which will guide future research in this direction.Comment: 30 pages, 5 figures, 8 tables; submitted to TOIS ACM journa

    Classification-Aware Hidden-Web Text Database Selection,

    Get PDF
    Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying.We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words.We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Personalized Web Search via Query Expansion based on User’s Local Hierarchically-Organized Files

    Get PDF
    Users of Web search engines generally express information needs with short and ambiguous queries, leading to irrelevant results. Personalized search methods improve users’ experience by automatically reformulating queries before sending them to the search engine or rearranging received results, according to their specific interests. A user profile is often built from previous queries, clicked results or in general from the user’s browsing history; different topics must be distinguished in order to obtain an accurate profile. It is quite common that a set of user files, locally stored in sub-directory, are organized by the user into a coherent taxonomy corresponding to own topics of interest, but only a few methods leverage on this potentially useful source of knowledge. We propose a novel method where a user profile is built from those files, specifically considering their consistent arrangement in directories. A bag of keywords is extracted for each directory from text documents with in it. We can infer the topic of each query and expand it by adding the corresponding keywords, in order to obtain a more targeted formulation. Experiments are carried out using benchmark data through a repeatable systematic process, in order to evaluate objectively how much our method can improve relevance of query results when applied upon a third-party search engin

    MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation

    Full text link
    Online ranker evaluation is one of the key challenges in information retrieval. While the preferences of rankers can be inferred by interleaving methods, the problem of how to effectively choose the ranker pair that generates the interleaved list without degrading the user experience too much is still challenging. On the one hand, if two rankers have not been compared enough, the inferred preference can be noisy and inaccurate. On the other, if two rankers are compared too many times, the interleaving process inevitably hurts the user experience too much. This dilemma is known as the exploration versus exploitation tradeoff. It is captured by the KK-armed dueling bandit problem, which is a variant of the KK-armed bandit problem, where the feedback comes in the form of pairwise preferences. Today's deployed search systems can evaluate a large number of rankers concurrently, and scaling effectively in the presence of numerous rankers is a critical aspect of KK-armed dueling bandit problems. In this paper, we focus on solving the large-scale online ranker evaluation problem under the so-called Condorcet assumption, where there exists an optimal ranker that is preferred to all other rankers. We propose Merge Double Thompson Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that localizes the comparisons carried out by the algorithm to small batches of rankers, and then employs Thompson Sampling (TS) to reduce the comparisons between suboptimal rankers inside these small batches. The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search. Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms.Comment: Accepted at TOI
    • …
    corecore