Search CORE

269 research outputs found

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Author: Castillo Carlos
Cho Junghoo
Dalal Gal
Kumar Rahul
Niño-Mora José
Singh Sanasam Ranbir
Publication venue
Publication date: 05/04/2020
Field of study

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).Comment: This paper has been accepted to the 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS'20, May 18--20, 2020, Tsukuba, Japan. This is the author version of the pape

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Author: Avrachenkov Konstantin
Patil Kishor
Thoppe Gugan
Publication venue: HAL CCSD
Publication date: 18/05/2020
Field of study

International audienceFor providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE)

INRIA a CCSD electronic archive server

LiveRank: How to Refresh Old Datasets

Author: Huynh The Dang
Mathieu Fabien
Viennot Laurent
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2015
Field of study

This paper considers the problem of refreshing a dataset. More precisely , given a collection of nodes gathered at some time (Web pages, users from an online social network) along with some structure (hyperlinks, social relationships), we want to identify a significant fraction of the nodes that still exist at present time. The liveness of an old node can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the active nodes when using the LiveRank order. We study different scenarios from a static setting where the Liv-eRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks, for Web graphs as well as for online social networks

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Learning to Crawl

Author: Busa-Fekete Robert
Kotlowski Wojciech
Pal David
Szorenyi Balazs
Upadhyay Utkarsh
Publication venue
Publication date: 22/11/2019
Field of study

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follow a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an

\mathcal{O}(\sqrt{T})

regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Online algorithms for estimating change rates of web pages

Author: Avrachenkov Konstantin
Patil Kishor
Thoppe Gugan
Publication venue: 'Elsevier BV'
Publication date: 04/11/2021
Field of study

International audienceA search engine maintains local copies of different web pages to provide quick search results. This local cache is kept up-to-date by a web crawler that frequently visits these different pages to track changes in them. Ideally, the local copy should be updated as soon as a page changes on the web. However, finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. This brings forth the following optimization problem: maximize the freshness of the local cache subject to the crawling frequencies being within prescribed bounds. While tractable algorithms do exist to solve this problem, these either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We address this issue here. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per iteration. The first is based on the law of large numbers and the second on stochastic approximation. The third is an extension of the second and includes a heavy-ball momentum term. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. Our main theoretical results concern asymptotic convergence and convergence rates of these three schemes. In fact, our work is the first to show convergence of the original stochastic heavy-ball method when neither the gradient nor the noise variance is uniformly bounded. We also provide some numerical experiments (based on real and synthetic data) to demonstrate the superiority of our proposed estimators over existing ones such as MLE. We emphasize that our algorithms are also readily applicable to the synchronization of databases and network inventory management

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Energy management and consumer modeling in smart grid systems

Author: Χατζή Βασιλική Α.
Publication venue
Publication date: 01/01/2017
Field of study

University of Thessaly Institutional Repository