Autoregressive processes naturally arise in a large variety of real-world
scenarios, including e.g., stock markets, sell forecasting, weather prediction,
advertising, and pricing. When addressing a sequential decision-making problem
in such a context, the temporal dependence between consecutive observations
should be properly accounted for converge to the optimal decision policy. In
this work, we propose a novel online learning setting, named Autoregressive
Bandits (ARBs), in which the observed reward follows an autoregressive process
of order k, whose parameters depend on the action the agent chooses, within a
finite set of n actions. Then, we devise an optimistic regret minimization
algorithm AutoRegressive Upper Confidence Bounds (AR-UCB) that suffers regret
of order O((1−Γ)2(k+1)3/2nT​​), being T the optimization
horizon and Γ<1 an index of the stability of the system. Finally, we
present a numerical validation in several synthetic and one real-world setting,
in comparison with general and specific purpose bandit baselines showing the
advantages of the proposed approach