8 research outputs found

    A Comparison of Techniques for Sampling Web Pages

    Get PDF
    As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results

    How much is involved in DB publishing?

    Get PDF
    XML has been intensive investigated lately, with the sentence, that "XML is (has been) the standard form for data publishing", especially in data base area.That is, there are assumptions, that the newly published data take mostly the form of XML documents, particularly when databases are involved. This presumption seems to be the reason of the heavy investment applied for researching the topics of handling, querying and comprising XML documents. We check these assumptions by investigating the documents accessible on the Internet, possible going under the surface, into the "deep Web". The investigation involves analyzing large scientific databases, but the commercial data stored in the "deep Web" will be handled also.We used the technique of randomly generated IP addresses for investigating the "deep Web", i.e. the part of the Internet not indexed by the search engines. For the part of the Web that is accessed (indexed) by the large search engines we used the random walk technique to collect uniformly distributed samplings. We found, that XML has not(yet) been the standard of Web publishing, but it is strongly represented on the Web. We add a simple new evaluation method to the known uniformly sampling processes.These investigations can be repeated in the future in order to get a dynamic picture of the growing rate of the number of the XML documents present on the Web

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)

    Get PDF
    The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..

    The Quality of Probabilistic Search in Unstructured Distributed Information Retrieval Systems

    Get PDF
    Searching the web is critical to the Web's success. However, the frequency of searches together with the size of the index prohibit a single computer being able to cope with the computational load. Consequently, a variety of distributed architectures have been proposed. Commercial search engines such as Google, usually use an architecture where the the index is distributed but centrally managed over a number of disjoint partitions. This centralized architecture has a high capital and operating cost that presents a significant barrier preventing any new competitor from entering the search market. The dominance of a few Web search giants brings concerns about the objectivity of search results and the privacy of the user. A promising solution to eliminate the high cost of entry is to conduct the search on a peer-to-peer (P2P) architecture. Peer-to-peer architectures offer a more geographically dispersed arrangement of machines that are not centrally managed. This has the benefit of not requiring an expensive centralized server facility. However, the lack of a centralized management can complicate the communication process. And the storage and computational capabilities of peers may be much less than for nodes in a commercial search engine. P2P architectures are commonly categorized into two broad classes, structured and unstructured. Structured architectures guarantee that the entire index is searched for a query, but suffer high communication cost during retrieval and maintenance. In comparison, unstructured architectures do not guarantee the entire index is searched, but require less maintenance cost and are more robust to attacks. In this thesis we study the quality of the probabilistic search in an unstructured distributed network since such a network has potential for developing a low cost and robust large scale information retrieval system. Search in an unstructured distributed network is a challenge, since a single machine normally can only store a subset of documents, and a query is only sent to a subset of machines, due to limitations on computational and communication resources. Thus, IR systems built on such network do not guarantee that a query finds the required documents in the collection, and the search has to be probabilistic and non-deterministic. The search quality is measured by a new metric called accuracy, defined as the fraction of documents retrieved by a constrained, probabilistic search compared with those that would have been retrieved by an exhaustive search. We propose a mathematical framework for modeling search in an unstructured distributed network, and present a non-deterministic distributed search architecture called Probably Approximately Correct (PAC) search, We provide formulas to estimate the search quality based on different system parameters, and show that PAC can achieve good performance when using the same amount of resources of a centrally managed deterministic distributed information retrieval system. We also study the effects of node selection in a centralized PAC architecture. We theoretically and empirically analyze the search performance across query iterations, and show that the search accuracy can be improved by caching good performing nodes in a centralized PAC architecture. Experiments on a real document collection and query log support our analysis. We then investigate the effects of different document replication policies in a PAC IR system. We show that the traditional square-root replication policy is not optimum for maximizing accuracy, and give an optimality criterion for accuracy. A non-uniform distribution of documents improves the retrieval performance of popular documents at the expense of less popular documents. To compensate for this, we propose a hybrid replication policy consisting of a combination of uniform and non-uniform distributions. Theoretical and experimental results show that such an arrangement significantly improves the accuracy of less popular documents at the expense of only a small degradation in accuracy averaged over all queries. We finally explore the effects of query caching in the PAC architecture. We empirically analyze the search performance of queries being issued from a query log, and show that the search accuracy can be improved by caching the top-kk documents on each node. Simulations on a real document collection and query log support our analysis
    corecore