6 research outputs found

    Learning to Crawl

    Full text link
    Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. [2018] under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follow a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O(T)\mathcal{O}(\sqrt{T}) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202

    A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

    Full text link
    Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn\theta_n and wn,w_n, which are updated using two distinct stepsize sequences, αn\alpha_n and βn,\beta_n, respectively. Assuming αn=n−α\alpha_n = n^{-\alpha} and βn=n−β\beta_n = n^{-\beta} with 1>α>β>0,1 > \alpha > \beta > 0, we show that, with high probability, the two iterates converge to their respective solutions θ∗\theta^* and w∗w^* at rates given by ∥θn−θ∗∥=O~(n−α/2)\|\theta_n - \theta^*\| = \tilde{O}( n^{-\alpha/2}) and ∥wn−w∗∥=O~(n−β/2);\|w_n - w^*\| = \tilde{O}(n^{-\beta/2}); here, O~\tilde{O} hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones

    Qualitative Multi-Armed Bandits: A Quantile-Based Approach

    No full text
    International audienceWe formalize and study the multi-armed bandit (MAB) problem in a generalized stochastic setting, in which rewards are not assumed to be numerical. Instead, rewards are measured on a qualitative scale that allows for comparison but invalidates arithmetic operations such as averaging. Correspondingly, instead of characterizing an arm in terms of the mean of the underlying distribution, we opt for using a quantile of that distribution as a representative value. We address the problem of quantile-based online learning both for the case of a finite (pure exploration) and infinite time horizon (cumulative regret minimization). For both cases, we propose suitable algorithms and analyze their properties. These properties are also illustrated by means of first experimental studies
    corecore