We develop methodology for a multistage decision problem with flexible number
of stages in which the rewards are survival times that are subject to
censoring. We present a novel Q-learning algorithm that is adjusted for
censored data and allows a flexible number of stages. We provide finite sample
bounds on the generalization error of the policy learned by the algorithm, and
show that when the optimal Q-function belongs to the approximation space, the
expected survival time for policies obtained by the algorithm converges to that
of the optimal policy. We simulate a multistage clinical trial with flexible
number of stages and apply the proposed censored-Q-learning algorithm to find
individualized treatment regimens. The methodology presented in this paper has
implications in the design of personalized medicine trials in cancer and in
other life-threatening diseases.Comment: Published in at http://dx.doi.org/10.1214/12-AOS968 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org