Search CORE

6 research outputs found

Learning to Crawl

Author: Busa-Fekete Robert
Kotlowski Wojciech
Pal David
Szorenyi Balazs
Upadhyay Utkarsh
Publication venue
Publication date: 22/11/2019
Field of study

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follow a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an

\mathcal{O}(\sqrt{T})

regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Author: Dalal Gal
Szorenyi Balazs
Thoppe Gugan
Publication venue
Publication date: 04/12/2019
Field of study

Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates,

\theta_n

and

w_n,

which are updated using two distinct stepsize sequences,

\alpha_n

and

\beta_n,

respectively. Assuming

\alpha_n = n^{-\alpha}

and

\beta_n = n^{-\beta}

with

1 > \alpha > \beta > 0,

we show that, with high probability, the two iterates converge to their respective solutions

\theta^*

and

w^*

at rates given by

\|\theta_n - \theta^*\| = \tilde{O}( n^{-\alpha/2})

and

\|w_n - w^*\| = \tilde{O}(n^{-\beta/2});

here,

\tilde{O}

hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Author: Busa-Fekete Róbert
Hüllermeier Eyke
Szorenyi Balazs
Weng Paul
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

International audienceWe formalize and study the multi-armed bandit (MAB) problem in a generalized stochastic setting, in which rewards are not assumed to be numerical. Instead, rewards are measured on a qualitative scale that allows for comparison but invalidates arithmetic operations such as averaging. Correspondingly, instead of characterizing an arm in terms of the mean of the underlying distribution, we opt for using a quantile of that distribution as a representative value. We address the problem of quantile-based online learning both for the case of a finite (pure exploration) and infinite time horizon (cumulative regret minimization). For both cases, we propose suitable algorithms and analyze their properties. These properties are also illustrated by means of first experimental studies

Hal-Diderot