23 research outputs found
Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes
Balancing the ratio between exploration and exploitation is an important problem in reinforcement learning. This paper evaluates four different exploration strategies combined with Q-learning using random stochastic mazes to investigate their performances. We will compare: UCB-1, softmax, epsilon-greedy, and pursuit. For this purpose we adapted the UCB-1 and pursuit strategies to be used in the Q-learning algorithm. The mazes consist of a single optimal goal state and two suboptimal goal states that lie closer to the starting position of the agent, which makes efficient exploration an important part of the learning agent. Furthermore, we evaluate two different kinds of reward functions, a normalized one with rewards between 0 and 1, and an unnormalized reward function that penalizes the agent for each step with a negative reward. We have performed an extensive grid-search to find the best parameters for each method and used the best parameters on novel randomly generated maze problems of different sizes. The results show that softmax exploration outperforms the other strategies, although it is harder to tune its temperature parameter. The worst performing exploration strategy is epsilon-greedy
Energy Regularized RNNs for Solving Non-Stationary Bandit Problems
We consider a Multi-Armed Bandit problem in which the rewards are
non-stationary and are dependent on past actions and potentially on past
contexts. At the heart of our method, we employ a recurrent neural network,
which models these sequences. In order to balance between exploration and
exploitation, we present an energy minimization term that prevents the neural
network from becoming too confident in support of a certain action. This term
provably limits the gap between the maximal and minimal probabilities assigned
by the network. In a diverse set of experiments, we demonstrate that our method
is at least as effective as methods suggested to solve the sub-problem of
Rotting Bandits, and can solve intuitive extensions of various benchmark
problems. We share our implementation at
https://github.com/rotmanmi/Energy-Regularized-RNN
Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes
Este trabalho, apresenta o desenvolvimento de um sistema de inteligência artificial (IA) aplicado na navegação de robôs autônomos. Particularmente, o sistema de IA, aqui desenvolvido, é representado pela técnica de aprendizado por reforço (AR) aplicada para a solução de labirintos dinâmicos. A abordagem de diferentes áreas de pesquisa, como IA, processamento de sinais e controle e automação, permite a investigação de importantes temas da engenharia. Nesse contexto, este trabalho disponibiliza um framework de AR em robótica. Os resultados obtidos através das estratégias de AR, permitem inferir acerca da qualidade dos sistema de IA implementado e comprovam a eficácia do framework desenvolvido neste artigo.
Proposta de uma ferramenta de ensino de inteligência artificial utilizando aprendizado por reforço aplicado a solução de labirintos dinâmicos / Proposal for a teaching tool of artificial intelligence using learning by enhancement applied to the solution of dynamic mazes
Este trabalho, apresenta o desenvolvimento de um sistema de inteligência artificial (IA) aplicado na navegação de robôs autônomos. Particularmente, o sistema de IA, aqui desenvolvido, é representado pela técnica de aprendizado por reforço (AR) aplicada para a solução de labirintos dinâmicos. A abordagem de diferentes áreas de pesquisa, como IA, processamento de sinais e controle e automação, permite a investigação de importantes temas da engenharia. Nesse contexto, este trabalho disponibiliza um framework de AR em robótica. Os resultados obtidos através das estratégias de AR, permitem inferir acerca da qualidade dos sistema de IA implementado e comprovam a eficácia do framework desenvolvido neste artigo.
Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO
In this paper, a novel racing environment for OpenAI Gym is introduced. This
environment operates with continuous action- and state-spaces and requires
agents to learn to control the acceleration and steering of a car while
navigating a randomly generated racetrack. Different versions of two
actor-critic learning algorithms are tested on this environment: Sampled Policy
Gradient (SPG) and Proximal Policy Optimization (PPO). An extension of SPG is
introduced that aims to improve learning performance by weighting action
samples during the policy update step. The effect of using experience replay
(ER) is also investigated. To this end, a modification to PPO is introduced
that allows for training using old action samples by optimizing the actor in
log space. Finally, a new technique for performing ER is tested that aims to
improve learning speed without sacrificing performance by splitting the
training into two parts, whereby networks are first trained using state
transitions from the replay buffer, and then using only recent experiences. The
results indicate that experience replay is not beneficial to PPO in continuous
action spaces. The training of SPG seems to be more stable when actions are
weighted. All versions of SPG outperform PPO when ER is used. The ER trick is
effective at improving training speed on a computationally less intensive
version of SPG.Comment: 12 pages, 9 figures. Code is available at
https://github.com/mario-holubar/RacingR