A major challenge in reinforcement learning is to determine which
state-action pairs are responsible for future rewards that are delayed. Return
Decomposition offers a solution by redistributing rewards from observed
sequences while preserving policy invariance. While the majority of current
approaches construct the reward redistribution in an uninterpretable manner, we
propose to explicitly model the contributions of state and action from a causal
perspective, resulting in an interpretable return decomposition. In this paper,
we start by studying the role of causal generative models in return
decomposition by characterizing the generation of Markovian rewards and
trajectory-wise long-term return and further propose a framework, called
Generative Return Decomposition (GRD), for policy optimization in delayed
reward scenarios. Specifically, GRD first identifies the unobservable Markovian
rewards and causal relations in the generative process. Then, GRD makes use of
the identified causal generative model to form a compact representation to
train policy over the most favorable subspace of the state space of the agent.
Theoretically, we show that the unobservable Markovian reward function is
identifiable, as well as the underlying causal structure and causal models.
Experimental results show that our method outperforms state-of-the-art methods
and the provided visualization further demonstrates the interpretability of our
method