We consider finite horizon Markov decision processes under performance
measures that involve both the mean and the variance of the cumulative reward.
We show that either randomized or history-based policies can improve
performance. We prove that the complexity of computing a policy that maximizes
the mean reward under a variance constraint is NP-hard for some cases, and
strongly NP-hard for others. We finally offer pseudopolynomial exact and
approximation algorithms.Comment: A full version of an ICML 2011 pape