We study policy evaluation of offline contextual bandits subject to
unobserved confounders. Sensitivity analysis methods are commonly used to
estimate the policy value under the worst-case confounding over a given
uncertainty set. However, existing work often resorts to some coarse relaxation
of the uncertainty set for the sake of tractability, leading to overly
conservative estimation of the policy value. In this paper, we propose a
general estimator that provides a sharp lower bound of the policy value using
convex programming. The generality of our estimator enables various extensions
such as sensitivity analysis with f-divergence, model selection with cross
validation and information criterion, and robust policy learning with the sharp
lower bound. Furthermore, our estimation method can be reformulated as an
empirical risk minimization problem thanks to the strong duality, which enables
us to provide strong theoretical guarantees of the proposed estimator using
techniques of the M-estimation.Comment: This is an extension of the following work
https://proceedings.mlr.press/v206/ishikawa23a.html. arXiv admin note: text
overlap with arXiv:2302.1334