Search CORE

9 research outputs found

How much data resides in a web collection: how to estimate size of a web collection

Author: Hiemstra Djoerd
Keulen Maurice van
Khelghati Mohammadreza
Publication venue: CEUR
Publication date: 01/01/2013
Field of study

With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [4]. The tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [2], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [6], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [5]. In this paper, the approaches in literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent

CiteSeerX

Radboud Repository

University of Twente Research Information

A Coherent Measurement of Web-Search Relevance

Author: Hospodka Peter
Mahoney William
Nickell Ryan
Sousan William
Zhu Qiuming
Publication venue: DigitalCommons@UNO
Publication date: 01/01/2008
Field of study

We present a metric for quantitatively assessing the quality of Web searches. The relevance-of-searching-on-target index measures how relevant a search result is with respect to the searcher\u27s interest and intention. The measurement is established on the basis of the cognitive characteristics of common user\u27s online Web-browsing behavior and processes. We evaluated the accuracy of the index function with respect to a set of surveys conducted on several groups of our college students. While the index is primarily intended to be used to compare the Web-search results and tell which is more relevant, it can be extended to other applications. For example, it can be used to evaluate the techniques that people apply to improve the Web-search quality (including the quality of search engines), as well as other factors such as the expressiveness of search queries and the effectiveness of result-filtering processes

CiteSeerX

The University of Nebraska, Omaha

Optimal Algorithms for Crawling a Hidden Database in the Web

Author: Jin Xin
Sheng Cheng
Tao Yufei
Zhang Nan
Publication venue
Publication date: 01/01/2012
Field of study

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Estimating search engine index size variability: a 9-year longitudinal study

Author: A Anagnostopoulos
A Broder
A Kilgarriff
A Kilgarriff
A Spink
A Uyar
Antal van den Bosch
D Lewandowski
GK Zipf
H Turtle
J Bar-Ilan
J Bar-Ilan
J Bar-Ilan
J Rice
L Vaughan
M Henzinger
M Thelwall
M Thelwall
M Thelwall
M Zimmer
Maurice de Kunder
N Payne
R Rousseau
S Lawrence
S Lawrence
Toine Bogers
Y Hirate
Z Bar-Yossef
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Interactive Proofs for Social Graphs

Author: Clara Shikhelman
Eylon Yogev
Liran Katzir
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 15/11/2020
Field of study

We consider interactive proofs for social graphs, where the verifier has only oracle access to the graph and can query for the

i^{th}

neighbor of a vertex

v

, given

i

and

v

. In this model, we construct a doubly-efficient public-coin two-message interactive protocol for estimating the size of the graph to within a multiplicative factor

\epsilon>0

. The verifier performs

\tilde{O}(1/\epsilon^2 \cdot \tau_{mix} \cdot \Delta)

queries to the graph, where

\tau_{mix}

is the mixing time of the graph and

\Delta

is the average degree of the graph. The prover runs in quasi-linear time in the number of nodes in the graph. Furthermore, we develop a framework for computing the quantiles of essentially any (reasonable) function

f

of vertices/edges of the graph. Using this framework, we can estimate many health measures of social graphs such as the clustering coefficients and the average degree, where the verifier performs only a small number of queries to the graph. Using the Fiat-Shamir paradigm, we are able to transform the above protocols to a non-interactive argument in the random oracle model. The result is that social media companies (e.g., Facebook, Twitter, etc.) can publish, once and for all, a short proof for the size or health of their social network. This proof can be publicly verified by any single user using a small number of queries to the graph

Cryptology ePrint Archive

Discovering the size of a deep web data source by coverage

Author: Liang Jie
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2009
Field of study

The deep web is a part of the web that can only be accessed via query interfaces. Discovering the size of a deep web data source has been an important and challenging problem ever since the web emerged. The size plays an important role in crawling and extracting a deep web data source. The thesis proposes a new estimation method based on coverage to estimate the size. This method relies on the construction of a query pool that can cover most of the data source. We propose two approaches to constructing a query pool so that document frequency variance is small and most of the documents can be covered. Our experiments on four data collections show that using a query pool built from a sample of the collection will result in lower bias and variance. We compared the new method with three existing methods based on the corpora collected by us

Scholarship at UWindsor

Efficient search engine measurements

Author: Maxim Gurevich
Ziv Bar-yossef
Publication venue: ACM
Publication date: 01/01/2007
Field of study

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [5, 10] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in [5] are quite costly, due to their reliance on rejection sampling. We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated. By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in [5]. Furthermore, building on an idea from [10], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.

CiteSeerX

Crossref

Efficient Search Engine Measurements

Author: Bar-Yossef Z.
Baykan E.
Bharat K.
Goldreich O.
Henzinger M. R.
Henzinger M. R.
Lawrence S.
Marshall A. W.
Maxim Gurevich
Rusmevichientong P.
Stuart A.
von Neumann J.
Ziv Bar-Yossef
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref