1,224 research outputs found

    Document replication strategies for geographically distributed web search engines

    Get PDF
    Cataloged from PDF version of article.Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. (C) 2012 Elsevier Ltd. All rights reserved

    Towards Distributed Web Mining in Net-Enabled Enterprises

    Get PDF

    A Guide to Distributed Digital Preservation

    Get PDF
    This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways."--P. [4] of cove

    Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

    Get PDF
    Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week

    Independent task assignment for heterogeneous systems

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 136-150.We study the problem of assigning nonuniform tasks onto heterogeneous systems. We investigate two distinct problems in this context. The first problem is the one-dimensional partitioning of nonuniform workload arrays with optimal load balancing. The second problem is the assignment of nonuniform independent tasks onto heterogeneous systems. For one-dimensional partitioning of nonuniform workload arrays, we investigate two cases: chain-on-chain partitioning (CCP), where the order of the processors is specified, and chain partitioning (CP), where processor permutation is allowed. We present polynomial time algorithms to solve the CCP problem optimally, while we prove that the CP problem is NP complete. Our empirical studies show that our proposed exact algorithms for the CCP problem produce substantially better results than the state-of-the-art heuristics while the solution times remain comparable. For the independent task assignment problem, we investigate improving the performance of the well-known and widely used constructive heuristics MinMin, MaxMin and Sufferage. All three heuristics are known to run in O(KN2 ) time in assigning N tasks to K processors. In this thesis, we present our work on an algorithmic improvement that asymptotically decreases the running time complexity of MinMin to O(KN log N) without affecting its solution quality. Furthermore, we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage, obtaining two hybrid algorithms. The motivation behind the former hybrid algorithm is to address the drawback of MaxMin in solving problem instances with highly skewed cost distributions while also improving the running time performance of MaxMin. The latter hybrid algorithm improves the running time performance of Sufferage without degrading its solution quality. The proposed algorithms are easy to implement and we illustrate them through detailed pseudocodes. The experimental results over a large number of real-life datasets show that the proposed fast MinMin algorithm and the proposed hybrid algorithms perform significantly better than their traditional counterparts as well as more recent state-of-the-art assignment heuristics. For the large datasets used in the experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art heuristics, require days, weeks, or even months to produce a solution, whereas all of the proposed algorithms produce solutions within only two or three minutes. For the independent task assignment problem, we also investigate adopting the multi-level framework which was successfully utilized in several applications including graph and hypergraph partitioning. For the coarsening phase of the multi-level framework, we present an efficient matching algorithm which runs in O(KN) time in most cases. For the uncoarsening phase, we present two refinement algorithms: an efficient O(KN)-time move-based refinement and an efficient O(K2N log N)-time swap-based refinement. Our results indicate that multi-level approach improves the quality of task assignments, while also improving the running time performance, especially for large datasets. As a realistic distributed application of the independent task assignment problem, we introduce the site-to-crawler assignment problem, where a large number of geographically distributed web servers are crawled by a multi-site distributed crawling system and the objective is to minimize the duration of the crawl. We show that this problem can be modeled as an independent task assignment problem. As a solution to the problem, we evaluate a large number of state-of-the-art task assignment heuristics selected from the literature as well as the improved versions and the newly developed multi-level task assignment algorithm. We compare the performance of different approaches through simulations on very large, real-life web datasets. Our results indicate that multi-site web crawling efficiency can be considerably improved using the independent task assignment approach, when compared to relatively easy-to-implement, yet naive baselines.Tabak, E KartalPh.D

    Review Analysis of Automated Mobile Application Testing

    Get PDF
    Software testing is essential and important task to validate & verify software correctness and completeness. Before the product is being released to customer, set of activities are carried out with the intent of finding errors. Testing of mobile application involves additional testing from the viewpoint of its usability and consistency. The mobile app testing demands stepwise & orderly detection of specific classes of errors with less amount of time & efforts. Choosing best suited testing techniques for individual mobile applications is an art. As testing is also used to evaluate software quality, choosing a test strategy for mobile app becomes significant. This paper reviews various aspects of mobile app testing covering automated testing, testing tools and challenges. It also provides direction for selecting the best strategy for mobile app testing

    Reducing Electricity Demand Charge for Data Centers with Partial Execution

    Full text link
    Data centers consume a large amount of energy and incur substantial electricity cost. In this paper, we study the familiar problem of reducing data center energy cost with two new perspectives. First, we find, through an empirical study of contracts from electric utilities powering Google data centers, that demand charge per kW for the maximum power used is a major component of the total cost. Second, many services such as Web search tolerate partial execution of the requests because the response quality is a concave function of processing time. Data from Microsoft Bing search engine confirms this observation. We propose a simple idea of using partial execution to reduce the peak power demand and energy cost of data centers. We systematically study the problem of scheduling partial execution with stringent SLAs on response quality. For a single data center, we derive an optimal algorithm to solve the workload scheduling problem. In the case of multiple geo-distributed data centers, the demand of each data center is controlled by the request routing algorithm, which makes the problem much more involved. We decouple the two aspects, and develop a distributed optimization algorithm to solve the large-scale request routing problem. Trace-driven simulations show that partial execution reduces cost by 3%10.5%3\%--10.5\% for one data center, and by 15.5%15.5\% for geo-distributed data centers together with request routing.Comment: 12 page

    A bimodal accessibility analysis of Australia using web-based resources

    Get PDF
    A range of potentially disruptive changes to research strategies have been taking root in the field of transport research. Many of these relate to the emergence of data sources and travel applications reshaping how we conduct accessibility analyses. This paper, based on Meire et al. (in press) and Meire and Derudder (under review), aims to explore the potential of some of these data sources by focusing on a concrete example: we introduce a framework for (road and air) transport data extraction and processing using publicly available web-based resources that can be accessed via web Application Programming Interfaces (APIs), illustrated by a case study evaluating the combined land- and airside accessibility of Australia at the level of statistical units. Given that car and air travel (or a combination thereof) are so dominant in the production of Australia’s accessibility landscape, a systematic bimodal accessibility analysis based on the automated extraction of web-based data shows the practical value of our research framework. With regard to our case study, results show a largely-expected accessibility pattern centred on major agglomerations, supplemented by a number of idiosyncratic and perhaps less-expected geographical patterns. Beyond the lessons learned from our case study, we show some of the major strengths and limitations of web-based data accessed via web-APIs for transport related research topics

    Preserving Our Collections, Preserving Our Missions

    Get PDF
    A Guide to Distributed Digital Preservation is intentionally structured such that every chaptercan stand on its own or be paired with other segments of the book at will, allowing readers topick their own pathway through the guide as best suits their needs. This approach hasnecessitated that the authors and editors include some level of repetition of basic principlesacross chapters, and has also made the Glossary (included at the back of this guide) an essentialreference resource for all readers.This guide is written with a broad audience in mind that includes librarians, curators, archivists,scholars, technologists, lawyers, and administrators. Any resourceful reader should be able to usethis guide to gain both a philosophical and practical understanding of the emerging field ofdistributed digital preservation (DDP), including how to establish or join a Private LOCKSSNetwork (PLN)
    corecore