6 research outputs found
Learning to Crawl
Web crawling is the problem of keeping a cache of webpages fresh, i.e.,
having the most recent copy available when a page is requested. This problem is
usually coupled with the natural restriction that the bandwidth available to
the web crawler is limited. The corresponding optimization problem was solved
optimally by Azar et al. [2018] under the assumption that, for each webpage,
both the elapsed time between two changes and the elapsed time between two
requests follow a Poisson distribution with known parameters. In this paper, we
study the same control problem but under the assumption that the change rates
are unknown a priori, and thus we need to estimate them in an online fashion
using only partial observations (i.e., single-bit signals indicating whether
the page has changed since the last refresh). As a point of departure, we
characterise the conditions under which one can solve the problem with such
partial observability. Next, we propose a practical estimator and compute
confidence intervals for it in terms of the elapsed time between the
observations. Finally, we show that the explore-and-commit algorithm achieves
an regret with a carefully chosen exploration horizon.
Our simulation study shows that our online policy scales well and achieves
close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202
A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound
Policy evaluation in reinforcement learning is often conducted using
two-timescale stochastic approximation, which results in various gradient
temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide
convergence rate bounds for this suite of algorithms. Algorithms such as these
have two iterates, and which are updated using two distinct
stepsize sequences, and respectively. Assuming and with we
show that, with high probability, the two iterates converge to their respective
solutions and at rates given by and here,
hides logarithmic terms. Via comparable lower bounds, we show that
these bounds are, in fact, tight. To the best of our knowledge, ours is the
first finite-time analysis which achieves these rates. While it was known that
the two timescale components decouple asymptotically, our results depict this
phenomenon more explicitly by showing that it in fact happens from some finite
time onwards. Lastly, compared to existing works, our result applies to a
broader family of stepsizes, including non-square summable ones
Qualitative Multi-Armed Bandits: A Quantile-Based Approach
International audienceWe formalize and study the multi-armed bandit (MAB) problem in a generalized stochastic setting, in which rewards are not assumed to be numerical. Instead, rewards are measured on a qualitative scale that allows for comparison but invalidates arithmetic operations such as averaging. Correspondingly, instead of characterizing an arm in terms of the mean of the underlying distribution, we opt for using a quantile of that distribution as a representative value. We address the problem of quantile-based online learning both for the case of a finite (pure exploration) and infinite time horizon (cumulative regret minimization). For both cases, we propose suitable algorithms and analyze their properties. These properties are also illustrated by means of first experimental studies