537 research outputs found

    Page-to-processor assignment techniques for parallel crawlers

    Get PDF
    Cataloged from PDF version of article.In less than a decade, the World Wide Web has evolved from a research project to a cultural phenomena effective in almost every facet of our society. The increase in the popularity and usage of the Web enforced an increase in the efficiency of information retrieval techniques used over the net. Crawling is among such techniques and is used by search engines, web portals, and web caches. A crawler is a program which downloads and stores web pages, generally to feed a search engine or a web repository. In order to be of use for its target applications, a crawler must download huge amounts of data in a reasonable amount of time. Generally, the high download rates required for efficient crawling cannot be achieved by single-processor systems. Thus, existing large-scale applications use multiple parallel processors to solve the crawling problem. Apart from the classical parallelization issues such as load balancing and minimization of the communication overhead, parallel crawling poses problems such as overlap avoidance and early retrieval of high quality pages. This thesis addresses parallelization of the crawling task, and its major contribution is mainly on partitioning/page-to-processor assignment techniques applied in parallel crawlers. We propose two new pageto-processor assignment techniques based on graph and hypergraph partitioning, which respectively minimize the total communication volume and the number of messages, while balancing the storage load and page download requests of processors. We implemented the proposed models, and our theoretic approaches have been supported with empirical findings. We also implemented an efficient parallel crawler which uses the proposed models.Türk, AtaM.S

    Geographical partition for distributed web crawling

    Get PDF
    This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.Fundação para a Ciência e a Tecnologia (FCT) - POSI/CHS/41739/200

    MANAGING QUERY AND UPDATE TRANSACTIONS UNDER QUALITY CONTRACTS IN WEB-DATABASES

    Get PDF
    In modern Web-database systems, users typically perform read-only queries, whereas all write-only data updates are performed in the background, concurrently with queries.For most of these services to be successful and their users to be kept satisfied, two criteria need to be met: user requests must be answered in a timely fashion and must return fresh data. This is relatively easy when the system is lightly loaded and, as such, both queries and updates can be executed quickly. However, this goal becomes practically hard to achieve in real systems due to the high volumes of queries and updates, especially in periods of flash crowds. In this work, we argue it is beneficial to allow users to specify their preferences and let the system optimize towards satisfying user preferences, instead of simply improving the average case. We believe that this user-centric approach will empower the system to gracefully deal with a broader spectrum of workloads.Towards user-centric web-databases, we propose a Quality Contracts framework to help users express their preferences over multiple quality specifications. Moreover, we propose a suite of algorithms to effectively perform load balancing and scheduling for both queries and updates according to user preferences. We evaluate the proposed framework and algorithms through a simulation with real traces from disk accesses and from a stock information website. Finally, to increase the applicability of Quality Contracts enhanced Web-database systems, we propose an algorithm to help users adapt to the Web-database system behavior and maximize their query success ratio

    Peer to Peer Information Retrieval: An Overview

    Get PDF
    Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real- world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom

    Focused kraulinh capacity resources as a means of reducing search in WEB

    Get PDF
    У статті досліджується проблема створення системи моніторингу тематичних Web-ресурсів для корпоративного середовища. Запропоновано класифікацію основних алгоритмів обходу ресурсів. Розроблено класифікацію метрик ранжування сайтів за ознакою об’єктів, на основі яких виконується оцінювання. Проведено попередній розрахунок оцінки придатності сфокусованого пошуку для даної задачі.В статье исследуется проблема создания системы мониторинга тематических Web-ресурсов для корпоративной среды. Предложена классификация основных алгоритмов обхода ресурсов. Разработана классификация метрик ранжирования сайтов по признаку объектов, на основе которых выполняется оценка. Проведен предварительный расчет оценки пригодности сфокусированного поиска для данной задачи.This paper examines the problem of creating a system for monitoring Web-themed resources for the corporate environment. The classification of the basic algorithms traversing resources. The classification of metrics ranking sites on the basis of the objects on which the evaluation is performed. A preliminary calculation of the suitability assessment focused search for this problem

    Models and algorithms for parallel text retrieval

    Get PDF
    Cataloged from PDF version of article.In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.Cambazoğlu, Berkant BarlaPh.D

    Hypergraph-theoretic partitioning models for parallel web crawling

    Get PDF
    Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead. © 2012 Springer-Verlag London Limited

    Effective web crawlers

    Get PDF
    Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed
    corecore