Search CORE

91 research outputs found

Whittle Index Policy for Crawling Ephemeral Content

Author: Avrachenkov Konstantin
Borkar Vivek
Publication venue
Publication date: 01/03/2015
Field of study

We consider a task of scheduling a crawler to retrieve content from several sites with ephemeral content. A user typically loses interest in ephemeral content, like news or posts at social network groups, after several days or hours. Thus, development of timely crawling policy for such ephemeral information sources is very important. We first formulate this problem as an optimal control problem with average reward. The reward can be measured in the number of clicks or relevant search requests. The problem in its initial formulation suffers from the curse of dimensionality and quickly becomes intractable even with moderate number of information sources. Fortunately, this problem admits a Whittle index, which leads to problem decomposition and to a very simple and efficient crawling policy. We derive the Whittle index and provide its theoretical justification

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Whittle Index Policy for Crawling Ephemeral Content

Author: Avrachenkov Konstantin
Borkar Vivek,
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2018
Field of study

International audienceWe consider the task of scheduling a crawler to retrieve from several sites their ephemeral content. This is content, such as news or posts at social network groups, for which a user typically loses interest after some days or hours. Thus development of a timely crawling policy for ephemeral information sources is very important. We first formulate this problem as an optimal control problem with average reward. The reward can be measured in terms of the number of clicks or relevant search requests. The problem in its exact formulation suffers from the curse of dimensionality and quickly becomes intractable even with moderate number of information sources. Fortunately, this problem admits a Whittle index, a celebrated heuristics which leads to problem decomposition and to a very simple and efficient crawling policy. We derive the Whittle index for a simple deterministic model and provide its theoretical justification. We also outline an extension to a fully stochastic model

INRIA a CCSD electronic archive server

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Author: Castillo Carlos
Cho Junghoo
Dalal Gal
Kumar Rahul
Niño-Mora José
Singh Sanasam Ranbir
Publication venue
Publication date: 05/04/2020
Field of study

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).Comment: This paper has been accepted to the 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS'20, May 18--20, 2020, Tsukuba, Japan. This is the author version of the pape

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server