We address a practical problem ubiquitous in modern marketing campaigns, in
which a central agent tries to learn a policy for allocating strategic
financial incentives to customers and observes only bandit feedback. In
contrast to traditional policy optimization frameworks, we take into account
the additional reward structure and budget constraints common in this setting,
and develop a new two-step method for solving this constrained counterfactual
policy optimization problem. Our method first casts the reward estimation
problem as a domain adaptation problem with supplementary structure, and then
subsequently uses the estimators for optimizing the policy with constraints. We
also establish theoretical error bounds for our estimation procedure and we
empirically show that the approach leads to significant improvement on both
synthetic and real datasets