3 research outputs found

    Q-learning with Experience Replay in a Dynamic Environment

    Get PDF
    Most research in reinforcement learning has focused on stationary environments. In this paper, we propose several adaptations of Q-learning for a dynamic environment, for both single and multiple agents. The environment consists of a grid of random rewards, where every reward is removed after a visit. We focus on experience replay, a technique that receives a lot of attention nowadays, and combine this method with Q-learning. We compare two variations of experience replay, where experiences are reused based on time or based on the obtained reward. For multi-agent reinforcement learning we compare two variations of policy representation. In the first variation the agents share a Q-function, while in the second variation both agents have a separate Q-function. Furthermore, in both variations we test the effect of reward sharing between the agents. This leads to four different multi-agent reinforcement learning algorithms, from which sharing a Q-function and sharing the rewards is the most cooperative method. The results show that in the single-agent environment both experience replay algorithms significantly outperform standard Q-learning and a greedy benchmark agent. In the multi-agent environment the highest maximum reward sum in a trial is achieved by using one Q-function and reward sharing. The highest mean reward sum is obtained with separate Q-functions and separate rewards

    Reinforcement Learning in Dynamic Environments using Instantiated Information

    No full text
    We study using reinforcement learning in dynamic environments. Such environments may contain many dynamic objects which makes optimal planning hard. One way of using information about all dynamic objects is to expand the state description, but this results in a high dimensional policy space. Our approach is to instantiate information about dynamic objects in the model of the environment and to replan using model-based reinforcement learning whenever this information changes. Furthermore, our approach is combined with an a-priori model of the changing parts of the environment, which enables the agent to optimally plan a course of action. Results on a navigation task with multiple dynamic hostile agents show that our system is able to learn good solutions minimizing the risk of hitting hostile agents
    corecore