9 research outputs found

    How much data resides in a web collection: how to estimate size of a web collection

    Get PDF
    With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [4]. The tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [2], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [6], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [5]. In this paper, the approaches in literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent

    A Coherent Measurement of Web-Search Relevance

    Get PDF
    We present a metric for quantitatively assessing the quality of Web searches. The relevance-of-searching-on-target index measures how relevant a search result is with respect to the searcher\u27s interest and intention. The measurement is established on the basis of the cognitive characteristics of common user\u27s online Web-browsing behavior and processes. We evaluated the accuracy of the index function with respect to a set of surveys conducted on several groups of our college students. While the index is primarily intended to be used to compare the Web-search results and tell which is more relevant, it can be extended to other applications. For example, it can be used to evaluate the techniques that people apply to improve the Web-search quality (including the quality of search engines), as well as other factors such as the expressiveness of search queries and the effectiveness of result-filtering processes

    Optimal Algorithms for Crawling a Hidden Database in the Web

    Full text link
    A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.Comment: VLDB201

    Interactive Proofs for Social Graphs

    Get PDF
    We consider interactive proofs for social graphs, where the verifier has only oracle access to the graph and can query for the ithi^{th} neighbor of a vertex vv, given ii and vv. In this model, we construct a doubly-efficient public-coin two-message interactive protocol for estimating the size of the graph to within a multiplicative factor ϵ>0\epsilon>0. The verifier performs O~(1/ϵ2⋅τmix⋅Δ)\tilde{O}(1/\epsilon^2 \cdot \tau_{mix} \cdot \Delta) queries to the graph, where τmix\tau_{mix} is the mixing time of the graph and Δ\Delta is the average degree of the graph. The prover runs in quasi-linear time in the number of nodes in the graph. Furthermore, we develop a framework for computing the quantiles of essentially any (reasonable) function ff of vertices/edges of the graph. Using this framework, we can estimate many health measures of social graphs such as the clustering coefficients and the average degree, where the verifier performs only a small number of queries to the graph. Using the Fiat-Shamir paradigm, we are able to transform the above protocols to a non-interactive argument in the random oracle model. The result is that social media companies (e.g., Facebook, Twitter, etc.) can publish, once and for all, a short proof for the size or health of their social network. This proof can be publicly verified by any single user using a small number of queries to the graph

    Discovering the size of a deep web data source by coverage

    Get PDF
    The deep web is a part of the web that can only be accessed via query interfaces. Discovering the size of a deep web data source has been an important and challenging problem ever since the web emerged. The size plays an important role in crawling and extracting a deep web data source. The thesis proposes a new estimation method based on coverage to estimate the size. This method relies on the construction of a query pool that can cover most of the data source. We propose two approaches to constructing a query pool so that document frequency variance is small and most of the documents can be covered. Our experiments on four data collections show that using a query pool built from a sample of the collection will result in lower bias and variance. We compared the new method with three existing methods based on the corpora collected by us

    Efficient search engine measurements

    No full text
    We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [5, 10] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in [5] are quite costly, due to their reliance on rejection sampling. We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated. By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in [5]. Furthermore, building on an idea from [10], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.
    corecore