8 research outputs found

    Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

    Full text link
    We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an O~(H2.5TSA(R(O)+Hlog(δ1)))\widetilde{O}(H^{2.5} \sqrt{ T|S||A| ( \mathcal{R}(\mathcal{O}) + H \log(\delta^{-1}) )}) regret guarantee, with TT being the number of episodes, SS the state space, AA the action space, HH the horizon and R(O)=R(OsqF)+R(OlogP)\mathcal{R}(\mathcal{O}) = \mathcal{R}(\mathcal{O}_{\mathrm{sq}}^\mathcal{F}) + \mathcal{R}(\mathcal{O}_{\mathrm{log}}^\mathcal{P}) is the sum of the regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation

    Counterfactual Optimism: Rate Optimal Regret for Stochastic Contextual MDPs

    Full text link
    We present the UC3^3RL algorithm for regret minimization in Stochastic Contextual MDPs (CMDPs). The algorithm operates under the minimal assumptions of realizable function class, and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys an O~(H3TSA(log(F/δ)+log(P/δ)))\widetilde{O}(H^3 \sqrt{T |S| |A|(\log (|\mathcal{F}|/\delta) + \log (|\mathcal{P}|/ \delta) )}) regret guarantee, with TT being the number of episodes, SS the state space, AA the action space, HH the horizon, and P\mathcal{P} and F\mathcal{F} are finite function classes, used to approximate the context-dependent dynamics and rewards, respectively. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs, which operates under the general offline function approximation setting

    Rate-Optimal Online Convex Optimization in Adaptive Linear Control

    Full text link
    We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal T\smash{\sqrt{T}}-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.Comment: arXiv admin note: text overlap with arXiv:2203.0117
    corecore