Search CORE

12,945 research outputs found

Adaptive query-based sampling of distributed collections

Author: C. Zhai
J.P. Callan
J.P. Callan
M. Baillie
M. Shokouhi
M.H. Degroot
N.J. Belkin
R. Baeza-Yates
R.O. Duda
S. Kullback
T. Hofmann
Publication venue
Publication date: 01/01/2006
Field of study

As part of a Distributed Information Retrieval system a de-scription of each remote information resource, archive or repository is usually stored centrally in order to facilitate resource selection. The ac-quisition ofprecise resourcedescriptionsistherefore animportantphase in Distributed Information Retrieval, as the quality of such represen-tations will impact on selection accuracy, and ultimately retrieval per-formance. While Query-Based Sampling is currently used for content discovery of uncooperative resources, the application of this technique is dependent upon heuristic guidelines to determine when a suﬃciently accurate representation of each remote resource has been obtained. In this paper we address this shortcoming by using the Predictive Likelihood to provide both an indication of thequality of an acquired resource description estimate, and when a suﬃciently good representation of a resource hasbeen obtained during Query-Based Sampling

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

RERO DOC Digital Library

Adaptive query-based sampling for distributed IR

Author: Azzopardi L.
Baillie M.
Crestani F.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

No abstract available

CiteSeerX

University of Strathclyde Institutional Repository

Enlighten

Distributed Information Retrieval using Keyword Auctions

Author: Hiemstra D.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions

CiteSeerX

Radboud Repository

University of Twente Research Information

Query-Based Sampling using Snippets

Author: Hiemstra D.
Tigelaar Almer S.
Publication venue: ACM
Publication date: 01/01/2010
Field of study

Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results

Radboud Repository

University of Twente Research Information

Noisy Submodular Maximization via Adaptive Sampling with Applications to Crowdsourced Image Collection Summarization

Author: Krause Andreas
Singla Adish
Tschiatschek Sebastian
Publication venue
Publication date: 01/12/2015
Field of study

We address the problem of maximizing an unknown submodular function that can only be accessed via noisy evaluations. Our work is motivated by the task of summarizing content, e.g., image collections, by leveraging users' feedback in form of clicks or ratings. For summarization tasks with the goal of maximizing coverage and diversity, submodular set functions are a natural choice. When the underlying submodular function is unknown, users' feedback can provide noisy evaluations of the function that we seek to maximize. We provide a generic algorithm -- \submM{} -- for maximizing an unknown submodular function under cardinality constraints. This algorithm makes use of a novel exploration module -- \blbox{} -- that proposes good elements based on adaptively sampling noisy function evaluations. \blbox{} is able to accommodate different kinds of observation models such as value queries and pairwise comparisons. We provide PAC-style guarantees on the quality and sampling cost of the solution obtained by \submM{}. We demonstrate the effectiveness of our approach in an interactive, crowdsourced image collection summarization application.Comment: Extended version of AAAI'16 pape

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

PENG: integrated search of distributed news archives

Author: Baillie M.
Crestani F.
Landoni M.
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2006
Field of study

The PENG system is intended to provide an integrated and personalized environment for news professionals, providing functionalities for filtering, distributed retrieval, and a flexible interface environment for the display and manipulation of news materials. In this paper we review the progress and results of the PENG system to date, and describe in detail the document filtering part of the system, which is designed to gather and filter documents to user profiles. The current architecture will be described, along with some of the main issues which have so far been found in it's development

University of Strathclyde Institutional Repository

Query-Based Sampling using Only Snippets

Author: Hiemstra Djoerd
Tigelaar Almer S.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2009
Field of study

Query-based sampling is a popular approach to model the content of an uncooperative server. It works by sending queries to the server and downloading the returned documents in the search results in full. This sample of documents then represents the server’s content. We present an approach that uses the document snippets as samples instead of downloading entire documents. This yields more stable results at the same amount of bandwidth usage as the full document approach. Additionally, we show that using snippets does not necessarily incur more latency, but can actually save time

CiteSeerX

Radboud Repository

University of Twente Research Information

Towards better measures: evaluation of estimated resource description quality for distributed IR

Author: Azzopardi L.
Baillie M.
Crestani F.
Publication venue
Publication date: 01/01/2006
Field of study

An open problem for Distributed Information Retrieval systems (DIR) is how to represent large document repositories, also known as resources, both accurately and efficiently. Obtaining resource description estimates is an important phase in DIR, especially in non-cooperative environments. Measuring the quality of an estimated resource description is a contentious issue as current measures do not provide an adequate indication of quality. In this paper, we provide an overview of these currently applied measures of resource description quality, before proposing the Kullback-Leibler (KL) divergence as an alternative. Through experimentation we illustrate the shortcomings of these past measures, whilst providing evidence that KL is a more appropriate measure of quality. When applying KL to compare different QBS algorithms, our experiments provide strong evidence in favour of a previously unsupported hypothesis originally posited in the initial Query-Based Sampling work

Crossref

University of Strathclyde Institutional Repository

Enlighten

Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity

Author: Fahrbach Matthew
Mirrokni Vahab
Zadimoghaddam Morteza
Publication venue
Publication date: 28/05/2019
Field of study

Submodular maximization is a general optimization problem with a wide range of applications in machine learning (e.g., active learning, clustering, and feature selection). In large-scale optimization, the parallel running time of an algorithm is governed by its adaptivity, which measures the number of sequential rounds needed if the algorithm can execute polynomially-many independent oracle queries in parallel. While low adaptivity is ideal, it is not sufficient for an algorithm to be efficient in practice---there are many applications of distributed submodular optimization where the number of function evaluations becomes prohibitively expensive. Motivated by these applications, we study the adaptivity and query complexity of submodular maximization. In this paper, we give the first constant-factor approximation algorithm for maximizing a non-monotone submodular function subject to a cardinality constraint

k

that runs in

O(\log(n))

adaptive rounds and makes

O(n \log(k))

oracle queries in expectation. In our empirical study, we use three real-world applications to compare our algorithm with several benchmarks for non-monotone submodular maximization. The results demonstrate that our algorithm finds competitive solutions using significantly fewer rounds and queries.Comment: 12 pages, 8 figure

arXiv.org e-Print Archive

Unbiased Comparative Evaluation of Ranking Functions

Author: Owen A. B.
Pavlu V.
Peng Ye D. D.
Sparck-Jones K.
Voorhees E. M.
Yuan C.
Zhao P.
Publication venue
Publication date: 25/04/2016
Field of study

Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing

k

systems against a baseline, and ranking

k

systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

arXiv.org e-Print Archive

Crossref