23 research outputs found

    Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes

    Get PDF
    Balancing the ratio between exploration and exploitation is an important problem in reinforcement learning. This paper evaluates four different exploration strategies combined with Q-learning using random stochastic mazes to investigate their performances. We will compare: UCB-1, softmax, epsilon-greedy, and pursuit. For this purpose we adapted the UCB-1 and pursuit strategies to be used in the Q-learning algorithm. The mazes consist of a single optimal goal state and two suboptimal goal states that lie closer to the starting position of the agent, which makes efficient exploration an important part of the learning agent. Furthermore, we evaluate two different kinds of reward functions, a normalized one with rewards between 0 and 1, and an unnormalized reward function that penalizes the agent for each step with a negative reward. We have performed an extensive grid-search to find the best parameters for each method and used the best parameters on novel randomly generated maze problems of different sizes. The results show that softmax exploration outperforms the other strategies, although it is harder to tune its temperature parameter. The worst performing exploration strategy is epsilon-greedy

    Energy Regularized RNNs for Solving Non-Stationary Bandit Problems

    Full text link
    We consider a Multi-Armed Bandit problem in which the rewards are non-stationary and are dependent on past actions and potentially on past contexts. At the heart of our method, we employ a recurrent neural network, which models these sequences. In order to balance between exploration and exploitation, we present an energy minimization term that prevents the neural network from becoming too confident in support of a certain action. This term provably limits the gap between the maximal and minimal probabilities assigned by the network. In a diverse set of experiments, we demonstrate that our method is at least as effective as methods suggested to solve the sub-problem of Rotting Bandits, and can solve intuitive extensions of various benchmark problems. We share our implementation at https://github.com/rotmanmi/Energy-Regularized-RNN

    Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes

    Get PDF
    Este trabalho, apresenta o desenvolvimento de um sistema de inteligência artificial (IA) aplicado na navegação de robôs autônomos. Particularmente, o sistema de IA, aqui desenvolvido, é  representado pela técnica de aprendizado por reforço (AR) aplicada para a solução de labirintos  dinâmicos. A abordagem de diferentes áreas de pesquisa, como IA, processamento de sinais e controle e automação, permite a investigação de importantes temas da engenharia. Nesse contexto, este trabalho disponibiliza um framework de AR em robótica. Os resultados obtidos  através das estratégias de AR, permitem inferir acerca da qualidade dos sistema de IA implementado e comprovam a eficácia do framework desenvolvido neste artigo. 

    Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes

    Get PDF
    Este trabalho, apresenta o desenvolvimento de um sistema de inteligência artificial (IA) aplicado na navegação de robôs autônomos. Particularmente, o sistema de IA, aqui desenvolvido, é  representado pela técnica de aprendizado por reforço (AR) aplicada para a solução de labirintos  dinâmicos. A abordagem de diferentes áreas de pesquisa, como IA, processamento de sinais e controle e automação, permite a investigação de importantes temas da engenharia. Nesse contexto, este trabalho disponibiliza um framework de AR em robótica. Os resultados obtidos  através das estratégias de AR, permitem inferir acerca da qualidade dos sistema de IA implementado e comprovam a eficácia do framework desenvolvido neste artigo. 

    Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

    Get PDF
    In this paper, a novel racing environment for OpenAI Gym is introduced. This environment operates with continuous action- and state-spaces and requires agents to learn to control the acceleration and steering of a car while navigating a randomly generated racetrack. Different versions of two actor-critic learning algorithms are tested on this environment: Sampled Policy Gradient (SPG) and Proximal Policy Optimization (PPO). An extension of SPG is introduced that aims to improve learning performance by weighting action samples during the policy update step. The effect of using experience replay (ER) is also investigated. To this end, a modification to PPO is introduced that allows for training using old action samples by optimizing the actor in log space. Finally, a new technique for performing ER is tested that aims to improve learning speed without sacrificing performance by splitting the training into two parts, whereby networks are first trained using state transitions from the replay buffer, and then using only recent experiences. The results indicate that experience replay is not beneficial to PPO in continuous action spaces. The training of SPG seems to be more stable when actions are weighted. All versions of SPG outperform PPO when ER is used. The ER trick is effective at improving training speed on a computationally less intensive version of SPG.Comment: 12 pages, 9 figures. Code is available at https://github.com/mario-holubar/RacingR
    corecore