4 research outputs found

    Deep Variational Reinforcement Learning for POMDPs

    Full text link
    Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose deep variational reinforcement learning (DVRL), which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an n-step approximation to the evidence lower bound (ELBO), allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past

    Fast exploration and learning of latent graphs with aliased observations

    Full text link
    We consider the problem of recovering a latent graph where the observations at each node are \emph{aliased}, and transitions are stochastic. Observations are gathered by an agent traversing the graph. Aliasing means that multiple nodes emit the same observation, so the agent can not know in which node it is located. The agent needs to uncover the hidden topology as accurately as possible and in as few steps as possible. This is equivalent to efficient recovery of the transition probabilities of a partially observable Markov decision process (POMDP) in which the observation probabilities are known. An algorithm for efficiently exploring (and ultimately recovering) the latent graph is provided. Our approach is exponentially faster than naive exploration in a variety of challenging topologies with aliased observations while remaining competitive with existing baselines in the unaliased regime

    Dynamic-depth context tree weighting

    No full text
    Reinforcement learning (RL) in partially observable settings is challenging be- cause the agent’s immediate observations are not Markov. Recently proposed methods can learn variable-order Markov models of the underlying process but have steep memory requirements and are sensitive to aliasing between observa- tion histories due to sensor noise. This paper proposes utile context tree weighting (UCTW), a model-learning method that addresses these limitations. UCTW dy- namically expands a suffix tree while ensuring that the total size of the model, but not its depth, remains bounded. We show that UCTW approximately matches the performance of state-of-the-art alternatives at stochastic time-series prediction while using at least an order of magnitude less memory. We also apply UCTW to model-based RL, showing that, on tasks that require memory of past observations, UCTW can learn without prior knowledge of a good state representation, or even the length of history upon which such a representation should depend

    Dynamic-depth context tree weighting

    No full text
    Reinforcement learning (RL) in partially observable settings is challenging be- cause the agent’s immediate observations are not Markov. Recently proposed methods can learn variable-order Markov models of the underlying process but have steep memory requirements and are sensitive to aliasing between observa- tion histories due to sensor noise. This paper proposes utile context tree weighting (UCTW), a model-learning method that addresses these limitations. UCTW dy- namically expands a suffix tree while ensuring that the total size of the model, but not its depth, remains bounded. We show that UCTW approximately matches the performance of state-of-the-art alternatives at stochastic time-series prediction while using at least an order of magnitude less memory. We also apply UCTW to model-based RL, showing that, on tasks that require memory of past observations, UCTW can learn without prior knowledge of a good state representation, or even the length of history upon which such a representation should depend
    corecore