688 research outputs found
Corrupted Contextual Bandits with Action Order Constraints
We consider a variant of the novel contextual bandit problem with corrupted
context, which we call the contextual bandit problem with corrupted context and
action correlation, where actions exhibit a relationship structure that can be
exploited to guide the exploration of viable next decisions. Our setting is
primarily motivated by adaptive mobile health interventions and related
applications, where users might transitions through different stages requiring
more targeted action selection approaches. In such settings, keeping user
engagement is paramount for the success of interventions and therefore it is
vital to provide relevant recommendations in a timely manner. The context
provided by users might not always be informative at every decision point and
standard contextual approaches to action selection will incur high regret. We
propose a meta-algorithm using a referee that dynamically combines the policies
of a contextual bandit and multi-armed bandit, similar to previous work, as
wells as a simple correlation mechanism that captures action to action
transition probabilities allowing for more efficient exploration of
time-correlated actions. We evaluate empirically the performance of said
algorithm on a simulation where the sequence of best actions is determined by a
hidden state that evolves in a Markovian manner. We show that the proposed
meta-algorithm improves upon regret in situations where the performance of both
policies varies such that one is strictly superior to the other for a given
time period. To demonstrate that our setting has relevant practical
applicability, we evaluate our method on several real world data sets, clearly
showing better empirical performance compared to a set of simple algorithms
Off-Policy Evaluation for Action-Dependent Non-Stationary Environments
Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy’s past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
A prominent challenge of offline reinforcement learning (RL) is the issue of
hidden confounding: unobserved variables may influence both the actions taken
by the agent and the observed outcomes. Hidden confounding can compromise the
validity of any causal conclusion drawn from data and presents a major obstacle
to effective offline RL. In the present paper, we tackle the problem of hidden
confounding in the nonidentifiable setting. We propose a definition of
uncertainty due to hidden confounding bias, termed delphic uncertainty, which
uses variation over world models compatible with the observations, and
differentiate it from the well-known epistemic and aleatoric uncertainties. We
derive a practical method for estimating the three types of uncertainties, and
construct a pessimistic offline RL algorithm to account for them. Our method
does not assume identifiability of the unobserved confounders, and attempts to
reduce the amount of confounding bias. We demonstrate through extensive
experiments and ablations the efficacy of our approach on a sepsis management
benchmark, as well as on electronic health records. Our results suggest that
nonidentifiable hidden confounding bias can be mitigated to improve offline RL
solutions in practice
- …