1,224 research outputs found
Document replication strategies for geographically distributed web search engines
Cataloged from PDF version of article.Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. (C) 2012 Elsevier Ltd. All rights reserved
A Guide to Distributed Digital Preservation
This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways."--P. [4] of cove
Lumbricus webis: a parallel and distributed crawling architecture for the Italian web
Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week
Independent task assignment for heterogeneous systems
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 136-150.We study the problem of assigning nonuniform tasks onto heterogeneous systems.
We investigate two distinct problems in this context. The first problem is the
one-dimensional partitioning of nonuniform workload arrays with optimal load
balancing. The second problem is the assignment of nonuniform independent
tasks onto heterogeneous systems.
For one-dimensional partitioning of nonuniform workload arrays, we investigate
two cases: chain-on-chain partitioning (CCP), where the order of the processors
is specified, and chain partitioning (CP), where processor permutation
is allowed. We present polynomial time algorithms to solve the CCP problem
optimally, while we prove that the CP problem is NP complete. Our empirical
studies show that our proposed exact algorithms for the CCP problem produce
substantially better results than the state-of-the-art heuristics while the solution
times remain comparable.
For the independent task assignment problem, we investigate improving the
performance of the well-known and widely used constructive heuristics MinMin,
MaxMin and Sufferage. All three heuristics are known to run in O(KN2
) time in
assigning N tasks to K processors. In this thesis, we present our work on an algorithmic
improvement that asymptotically decreases the running time complexity
of MinMin to O(KN log N) without affecting its solution quality. Furthermore,
we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage,
obtaining two hybrid algorithms. The motivation behind the former hybrid
algorithm is to address the drawback of MaxMin in solving problem instances
with highly skewed cost distributions while also improving the running time performance
of MaxMin. The latter hybrid algorithm improves the running time
performance of Sufferage without degrading its solution quality. The proposed
algorithms are easy to implement and we illustrate them through detailed pseudocodes.
The experimental results over a large number of real-life datasets show
that the proposed fast MinMin algorithm and the proposed hybrid algorithms
perform significantly better than their traditional counterparts as well as more
recent state-of-the-art assignment heuristics. For the large datasets used in the
experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art
heuristics, require days, weeks, or even months to produce a solution, whereas all
of the proposed algorithms produce solutions within only two or three minutes.
For the independent task assignment problem, we also investigate adopting
the multi-level framework which was successfully utilized in several applications
including graph and hypergraph partitioning. For the coarsening phase of the
multi-level framework, we present an efficient matching algorithm which runs in
O(KN) time in most cases. For the uncoarsening phase, we present two refinement
algorithms: an efficient O(KN)-time move-based refinement and an efficient
O(K2N log N)-time swap-based refinement. Our results indicate that multi-level
approach improves the quality of task assignments, while also improving the running
time performance, especially for large datasets.
As a realistic distributed application of the independent task assignment problem,
we introduce the site-to-crawler assignment problem, where a large number
of geographically distributed web servers are crawled by a multi-site distributed
crawling system and the objective is to minimize the duration of the crawl. We
show that this problem can be modeled as an independent task assignment problem.
As a solution to the problem, we evaluate a large number of state-of-the-art
task assignment heuristics selected from the literature as well as the improved
versions and the newly developed multi-level task assignment algorithm. We
compare the performance of different approaches through simulations on very
large, real-life web datasets. Our results indicate that multi-site web crawling
efficiency can be considerably improved using the independent task assignment
approach, when compared to relatively easy-to-implement, yet naive baselines.Tabak, E KartalPh.D
Review Analysis of Automated Mobile Application Testing
Software testing is essential and important task to validate & verify software correctness and completeness. Before the product is being released to customer, set of activities are carried out with the intent of finding errors. Testing of mobile application involves additional testing from the viewpoint of its usability and consistency. The mobile app testing demands stepwise & orderly detection of specific classes of errors with less amount of time & efforts. Choosing best suited testing techniques for individual mobile applications is an art. As testing is also used to evaluate software quality, choosing a test strategy for mobile app becomes significant. This paper reviews various aspects of mobile app testing covering automated testing, testing tools and challenges. It also provides direction for selecting the best strategy for mobile app testing
Reducing Electricity Demand Charge for Data Centers with Partial Execution
Data centers consume a large amount of energy and incur substantial
electricity cost. In this paper, we study the familiar problem of reducing data
center energy cost with two new perspectives. First, we find, through an
empirical study of contracts from electric utilities powering Google data
centers, that demand charge per kW for the maximum power used is a major
component of the total cost. Second, many services such as Web search tolerate
partial execution of the requests because the response quality is a concave
function of processing time. Data from Microsoft Bing search engine confirms
this observation.
We propose a simple idea of using partial execution to reduce the peak power
demand and energy cost of data centers. We systematically study the problem of
scheduling partial execution with stringent SLAs on response quality. For a
single data center, we derive an optimal algorithm to solve the workload
scheduling problem. In the case of multiple geo-distributed data centers, the
demand of each data center is controlled by the request routing algorithm,
which makes the problem much more involved. We decouple the two aspects, and
develop a distributed optimization algorithm to solve the large-scale request
routing problem. Trace-driven simulations show that partial execution reduces
cost by for one data center, and by for geo-distributed
data centers together with request routing.Comment: 12 page
A bimodal accessibility analysis of Australia using web-based resources
A range of potentially disruptive changes to research strategies have been taking root
in the field of transport research. Many of these relate to the emergence of data sources and
travel applications reshaping how we conduct accessibility analyses. This paper, based on
Meire et al. (in press) and Meire and Derudder (under review), aims to explore the potential of
some of these data sources by focusing on a concrete example: we introduce a framework for
(road and air) transport data extraction and processing using publicly available web-based
resources that can be accessed via web Application Programming Interfaces (APIs), illustrated
by a case study evaluating the combined land- and airside accessibility of Australia at the level
of statistical units. Given that car and air travel (or a combination thereof) are so dominant in
the production of Australia’s accessibility landscape, a systematic bimodal accessibility
analysis based on the automated extraction of web-based data shows the practical value of our
research framework. With regard to our case study, results show a largely-expected
accessibility pattern centred on major agglomerations, supplemented by a number of
idiosyncratic and perhaps less-expected geographical patterns. Beyond the lessons learned
from our case study, we show some of the major strengths and limitations of web-based data
accessed via web-APIs for transport related research topics
Preserving Our Collections, Preserving Our Missions
A Guide to Distributed Digital Preservation is intentionally structured such that every chaptercan stand on its own or be paired with other segments of the book at will, allowing readers topick their own pathway through the guide as best suits their needs. This approach hasnecessitated that the authors and editors include some level of repetition of basic principlesacross chapters, and has also made the Glossary (included at the back of this guide) an essentialreference resource for all readers.This guide is written with a broad audience in mind that includes librarians, curators, archivists,scholars, technologists, lawyers, and administrators. Any resourceful reader should be able to usethis guide to gain both a philosophical and practical understanding of the emerging field ofdistributed digital preservation (DDP), including how to establish or join a Private LOCKSSNetwork (PLN)
- …