This supplementary material contains the detailed proofs and analysis of the theoretical results presented in the paper. Additional Notation: We first introduce additional notation not used in the paper that is useful in some proofs. In particular, we define d t ω,π the distribution of states at time t if we executed π from time step 1 to t−1, starting from distribution ω at time 1, and dω,π = (1 − γ) ∑ ∞ t=1 γt−1 d t ω,π the discounted distribution of states over the infinite horizon if we follow π, starting in ω at time 1. 1.1. Relating Performance to Error in Model This subsection presents a number of useful lemmas for relating the performance (in terms of expected total cost) of a policy in the real system to the predictive error in the learned model from which the policy was computed. Lemma 1.1. Suppose we learned an approximate model ˆ T instead of the true model T and let ˆ V π represent the value function of π under ˆ T. Then for any state distribution ω: Es∼ω[V π (s) − ˆ V π (s)] Proof
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.