Existing online learning algorithms for adversarial Markov Decision Processes
achieve O(T) regret after T rounds of interactions even if the
loss functions are chosen arbitrarily by an adversary, with the caveat that the
transition function has to be fixed. This is because it has been shown that
adversarial transition functions make no-regret learning impossible. Despite
such impossibility results, in this work, we develop algorithms that can handle
both adversarial losses and adversarial transitions, with regret increasing
smoothly in the degree of maliciousness of the adversary. More concretely, we
first propose an algorithm that enjoys O(T+CP) regret where CP measures how adversarial the
transition functions are and can be at most O(T). While this algorithm
itself requires knowledge of CP, we further develop a black-box
reduction approach that removes this requirement. Moreover, we also show that
further refinements of the algorithm not only maintains the same regret bound,
but also simultaneously adapts to easier environments (where losses are
generated in a certain stochastically constrained manner as in Jin et al.
[2021]) and achieves O(U+UCL+CP) regret, where U is some standard gap-dependent coefficient
and CL is the amount of corruption on losses.Comment: 66 page