Search CORE

23 research outputs found

Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes

Author: Drugan Madalina
Tijsma Arryon
Wiering Marco
Publication venue
Publication date: 09/02/2017
Field of study

Balancing the ratio between exploration and exploitation is an important problem in reinforcement learning. This paper evaluates four different exploration strategies combined with Q-learning using random stochastic mazes to investigate their performances. We will compare: UCB-1, softmax, epsilon-greedy, and pursuit. For this purpose we adapted the UCB-1 and pursuit strategies to be used in the Q-learning algorithm. The mazes consist of a single optimal goal state and two suboptimal goal states that lie closer to the starting position of the agent, which makes efficient exploration an important part of the learning agent. Furthermore, we evaluate two different kinds of reward functions, a normalized one with rewards between 0 and 1, and an unnormalized reward function that penalizes the agent for each step with a negative reward. We have performed an extensive grid-search to find the best parameters for each method and used the best parameters on novel randomly generated maze problems of different sizes. The results show that softmax exploration outperforms the other strategies, although it is harder to tune its temperature parameter. The worst performing exploration strategy is epsilon-greedy

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Q-Learning: Solutions for Grid World Problem with Forward and Backward Reward Propagrations

Author: Antony Snobin
Bi Y
Roy Raghi
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2023
Field of study

Ulster University's Research Portal

Energy Regularized RNNs for Solving Non-Stationary Bandit Problems

Author: Rotman Michael
Wolf Lior
Publication venue
Publication date: 28/03/2023
Field of study

We consider a Multi-Armed Bandit problem in which the rewards are non-stationary and are dependent on past actions and potentially on past contexts. At the heart of our method, we employ a recurrent neural network, which models these sequences. In order to balance between exploration and exploitation, we present an energy minimization term that prevents the neural network from becoming too confident in support of a certain action. This term provably limits the gap between the maximal and minimal probabilities assigned by the network. In a diverse set of experiments, we demonstrate that our method is at least as effective as methods suggested to solve the sub-problem of Rotting Bandits, and can solve intuitive extensions of various benchmark problems. We share our implementation at https://github.com/rotmanmi/Energy-Regularized-RNN

arXiv.org e-Print Archive

Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes

Author: Albuquerque Almir dos Santos
Bastos Rogério Cid
Silva Douglas Hemkemaier da
Silva Ênio dos Santos
Publication venue: Brazilian Journals Publicações de Periódicos e Editora Ltda.
Publication date: 17/03/2020
Field of study

Este trabalho, apresenta o desenvolvimento de um sistema de inteligência artificial (IA) aplicado na navegação de robôs autônomos. Particularmente, o sistema de IA, aqui desenvolvido, é representado pela técnica de aprendizado por reforço (AR) aplicada para a solução de labirintos dinâmicos. A abordagem de diferentes áreas de pesquisa, como IA, processamento de sinais e controle e automação, permite a investigação de importantes temas da engenharia. Nesse contexto, este trabalho disponibiliza um framework de AR em robótica. Os resultados obtidos através das estratégias de AR, permitem inferir acerca da qualidade dos sistema de IA implementado e comprovam a eficácia do framework desenvolvido neste artigo.

Brazilian Journals

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

Author: Holubar Mario
Wiering Marco
Publication venue
Publication date: 15/01/2020
Field of study

University of Groningen

Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes

Author: Albuquerque Almir dos Santos
Bastos Rogério Cid
Silva Douglas Hemkemaier da
Silva Ênio dos Santos
Publication venue: Brazilian Journals Publicações de Periódicos e Editora Ltda.
Publication date: 17/03/2020
Field of study

Brazilian Journals

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

Author: Holubar Mario
Wiering Marco
Publication venue
Publication date: 15/01/2020
Field of study

In this paper, a novel racing environment for OpenAI Gym is introduced. This environment operates with continuous action- and state-spaces and requires agents to learn to control the acceleration and steering of a car while navigating a randomly generated racetrack. Different versions of two actor-critic learning algorithms are tested on this environment: Sampled Policy Gradient (SPG) and Proximal Policy Optimization (PPO). An extension of SPG is introduced that aims to improve learning performance by weighting action samples during the policy update step. The effect of using experience replay (ER) is also investigated. To this end, a modification to PPO is introduced that allows for training using old action samples by optimizing the actor in log space. Finally, a new technique for performing ER is tested that aims to improve learning speed without sacrificing performance by splitting the training into two parts, whereby networks are first trained using state transitions from the replay buffer, and then using only recent experiences. The results indicate that experience replay is not beneficial to PPO in continuous action spaces. The training of SPG seems to be more stable when actions are weighted. All versions of SPG outperform PPO when ER is used. The ER trick is effective at improving training speed on a computationally less intensive version of SPG.Comment: 12 pages, 9 figures. Code is available at https://github.com/mario-holubar/RacingR

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen