Search CORE

850 research outputs found

i2MapReduce: Incremental MapReduce for Mining Evolving Big Data

Author: Chen Shimin
Wang Qiang
Yu Ge
Zhang Yanfeng
Publication venue
Publication date: 20/01/2015
Field of study

As new data and updates are constantly arriving, the results of data mining applications become stale and obsolete over time. Incremental processing is a promising approach to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to MapReduce, the most widely used framework for mining big data. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states. We evaluate i2MapReduce using a one-step algorithm and three iterative algorithms with diverse computation characteristics. Experimental results on Amazon EC2 show significant performance improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-computation

arXiv.org e-Print Archive

CiteSeerX

Crossref

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Author: Castillo Carlos
Cho Junghoo
Dalal Gal
Kumar Rahul
Niño-Mora José
Singh Sanasam Ranbir
Publication venue
Publication date: 05/04/2020
Field of study

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).Comment: This paper has been accepted to the 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS'20, May 18--20, 2020, Tsukuba, Japan. This is the author version of the pape

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

An Optimal Trade-off between Content Freshness and Refresh Cost

Author: Cho
Cohen
Jie Mi
Lee
Notess
Ross
Wessels
Yibei Ling
Publication venue: 'Applied Probability Trust'
Publication date: 02/08/2010
Field of study

Caching is an effective mechanism for reducing bandwidth usage and alleviating server load. However, the use of caching entails a compromise between content freshness and refresh cost. An excessive refresh allows a high degree of content freshness at a greater cost of system resource. Conversely, a deficient refresh inhibits content freshness but saves the cost of resource usages. To address the freshness-cost problem, we formulate the refresh scheduling problem with a generic cost model and use this cost model to determine an optimal refresh frequency that gives the best tradeoff between refresh cost and content freshness. We prove the existence and uniqueness of an optimal refresh frequency under the assumptions that the arrival of content update is Poisson and the age-related cost monotonically increases with decreasing freshness. In addition, we provide an analytic comparison of system performance under fixed refresh scheduling and random refresh scheduling, showing that with the same average refresh frequency two refresh schedulings are mathematically equivalent in terms of the long-run average cost

arXiv.org e-Print Archive

Crossref

Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation

Author: Borgs Christian
Brautbar Michael
Chayes Jennifer
Teng Shang-Hua
Publication venue
Publication date: 01/01/1202
Field of study

A fundamental problem arising in many applications in Web science and social network analysis is, given an arbitrary approximation factor

c>1

, to output a set

S

of nodes that with high probability contains all nodes of PageRank at least

\Delta

, and no node of PageRank smaller than

\Delta/c

. We call this problem {\sc SignificantPageRanks}. We develop a nearly optimal, local algorithm for the problem with runtime complexity

\tilde{O}(n/\Delta)

on networks with

n

nodes. We show that any algorithm for solving this problem must have runtime of

{\Omega}(n/\Delta)

, rendering our algorithm optimal up to logarithmic factors. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. In the abstract matrix problem it is assumed that one can access an unknown {\em right-stochastic matrix} by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter

\epsilon

. At a cost propositional to

1/\epsilon

, the query will return a list of

O(1/\epsilon)

entries and their indices that provide an

\epsilon

-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least

\Delta

, and omits any column whose sum is less than

\Delta/c

. Our multi-scale sampling scheme solves this problem with cost

\tilde{O}(n/\Delta)

, while traditional sampling algorithms would take time

\Theta((n/\Delta)^2)

. Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the {\sc SignificantPageRanks} problem.Comment: Accepted to Internet Mathematics journal for publication. An extended abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time Algorithm for PageRank Computations

arXiv.org e-Print Archive

CiteSeerX

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Author: Avrachenkov Konstantin
Patil Kishor
Thoppe Gugan
Publication venue: HAL CCSD
Publication date: 18/05/2020
Field of study

International audienceFor providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE)

INRIA a CCSD electronic archive server

Test Generation and Dependency Analysis for Web Applications

Author: Biagiola Matteo
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 15/01/2020
Field of study

In web application testing existing model based web test generators derive test paths from a navigation model of the web application, completed with either manually or randomly generated inputs. Test paths extraction and input generation are handled separately, ignoring the fact that generating inputs for test paths is difficult or even impossible if such paths are infeasible. In this thesis, we propose three directions to mitigate the path infeasibility problem. The first direction uses a search based approach defining novel set of genetic operators that support the joint generation of test inputs and feasible test paths. Results show that such search based approach can achieve higher level of model coverage than existing approaches. Secondly, we propose a novel web test generation algorithm that pre-selects the most promising candidate test cases based on their diversity from previously generated tests. Results of our empirical evaluation show that promoting diversity is beneficial not only to a thorough exploration of the web application behaviours, but also to the feasibility of automatically generated test cases. Moreover, the diversity based approach achieves higher coverage of the navigation model significantly faster than crawling based and search based approaches. The third approach we propose uses a web crawler as a test generator. As such, the generated tests are concrete, hence their navigations among the web application states are feasible by construction. However, the crawling trace cannot be easily turned into a minimal test suite that achieves the same coverage due to test dependencies. Indeed, test dependencies are undesirable in the context of regression testing, preventing the adoption of testing optimization techniques that assume tests to be independent. In this thesis, we propose the first approach to detect test dependencies in a given web test suite by leveraging the information available both in the web test code and on the client side of the web application. Results of our empirical validation show that our approach can effectively and efficiently detect test dependencies and it enables dependency aware formulations of test parallelization and test minimization

Archivio istituzionale della ricerca - Università di Genova

Digital archives : comparative study and interoperability framework

Author: Lemos Filipe Daniel Figueiredo de
Publication venue
Publication date: 01/01/2008
Field of study

Estágio realizado na ParadigmaXis e orientado pelo Eng.º Filipe CorreiaTese de mestrado integrado. Engenharia Informátca e Computação. Faculdade de Engenharia. Universidade do Porto. 200

Repositório Aberto da Universidade do Porto

Model-Driven Engineering in the Large: Refactoring Techniques for Models and Model Transformation Systems

Author: Strüber Daniel
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2016
Field of study

Model-Driven Engineering (MDE) is a software engineering paradigm that aims to increase the productivity of developers by raising the abstraction level of software development. It envisions the use of models as key artifacts during design, implementation and deployment. From the recent arrival of MDE in large-scale industrial software development – a trend we refer to as MDE in the large –, a set of challenges emerges: First, models are now developed at distributed locations, by teams of teams. In such highly collaborative settings, the presence of large monolithic models gives rise to certain issues, such as their proneness to editing conflicts. Second, in large-scale system development, models are created using various domain-specific modeling languages. Combining these models in a disciplined manner calls for adequate modularization mechanisms. Third, the development of models is handled systematically by expressing the involved operations using model transformation rules. Such rules are often created by cloning, a practice related to performance and maintainability issues. In this thesis, we contribute three refactoring techniques, each aiming to tackle one of these challenges. First, we propose a technique to split a large monolithic model into a set of sub-models. The aim of this technique is to enable a separation of concerns within models, promoting a concern-based collaboration style: Collaborators operate on the submodels relevant for their task at hand. Second, we suggest a technique to encapsulate model components by introducing modular interfaces in a set of related models. The goal of this technique is to establish modularity in these models. Third, we introduce a refactoring to merge a set of model transformation rules exhibiting a high degree of similarity. The aim of this technique is to improve maintainability and performance by eliminating the drawbacks associated with cloning. The refactoring creates variability-based rules, a novel type of rule allowing to capture variability by using annotations. The refactoring techniques contributed in this work help to reduce the manual effort during the refactoring of models and transformation rules to a large extent. As indicated in a series of realistic case studies, the output produced by the techniques is comparable or, in the case of transformation rules, partly even preferable to the result of manual refactoring, yielding a promising outlook on the applicability in real-world settings

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg