91 research outputs found
Whittle Index Policy for Crawling Ephemeral Content
We consider a task of scheduling a crawler to retrieve content from several
sites with ephemeral content. A user typically loses interest in ephemeral
content, like news or posts at social network groups, after several days or
hours. Thus, development of timely crawling policy for such ephemeral
information sources is very important. We first formulate this problem as an
optimal control problem with average reward. The reward can be measured in the
number of clicks or relevant search requests. The problem in its initial
formulation suffers from the curse of dimensionality and quickly becomes
intractable even with moderate number of information sources. Fortunately, this
problem admits a Whittle index, which leads to problem decomposition and to a
very simple and efficient crawling policy. We derive the Whittle index and
provide its theoretical justification
Whittle Index Policy for Crawling Ephemeral Content
International audienceWe consider the task of scheduling a crawler to retrieve from several sites their ephemeral content. This is content, such as news or posts at social network groups, for which a user typically loses interest after some days or hours. Thus development of a timely crawling policy for ephemeral information sources is very important. We first formulate this problem as an optimal control problem with average reward. The reward can be measured in terms of the number of clicks or relevant search requests. The problem in its exact formulation suffers from the curse of dimensionality and quickly becomes intractable even with moderate number of information sources. Fortunately, this problem admits a Whittle index, a celebrated heuristics which leads to problem decomposition and to a very simple and efficient crawling policy. We derive the Whittle index for a simple deterministic model and provide its theoretical justification. We also outline an extension to a fully stochastic model
Change Rate Estimation and Optimal Freshness in Web Page Crawling
For providing quick and accurate results, a search engine maintains a local
snapshot of the entire web. And, to keep this local cache fresh, it employs a
crawler for tracking changes across various web pages. However, finite
bandwidth availability and server restrictions impose some constraints on the
crawling frequency. Consequently, the ideal crawling rates are the ones that
maximise the freshness of the local cache and also respect the above
constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve
this optimisation problem. However, they assume the knowledge of the exact page
change rates, which is unrealistic in practice. We address this issue here.
Specifically, we provide two novel schemes for online estimation of page change
rates. Both schemes only need partial information about the page change
process, i.e., they only need to know if the page has changed or not since the
last crawled instance. For both these schemes, we prove convergence and, also,
derive their convergence rates. Finally, we provide some numerical experiments
to compare the performance of our proposed estimators with the existing ones
(e.g., MLE).Comment: This paper has been accepted to the 13th EAI International Conference
on Performance Evaluation Methodologies and Tools, VALUETOOLS'20, May 18--20,
2020, Tsukuba, Japan. This is the author version of the pape
- …