5 research outputs found

    On the Convergence of Techniques that Improve Value Iteration

    Get PDF
    Prioritisation of Bellman backups or updating only a small subset of actions represent important techniques for speeding up planning in MDPs. The recent literature showed new efficient approaches which exploit these directions. Backward value iteration and backing up only the best actions were shown to lead to a significant reduction of the planning time. This paper conducts a theoretical and empirical analysis of these techniques and shows new important proofs. In particular, (1) it identifies weaker requirements for the convergence of backups based on best actions only, (2) a new method for evaluation of the Bellman error is shown for the update that updates one best action once, (3) it presents the theoretical proof of backward value iteration and establishes required initialisation, (4) and shows that the default state ordering of backups in standard value iteration can significantly influence its performance. Additionally, (5) the existing literature did not compare these methods, either empirically or analytically, against policy iteration. The rigorous empirical and novel theoretical parts of the paper reveal important associations and allow drawing guidelines on which type of value or policy iteration is suitable for a given domain. Finally, our chief message is that standard value iteration can be made far more efficient by simple modifications shown in the paper

    Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation

    Get PDF
    In the field of reinforcement learning, robot task learning in a specific environment with a Markov decision process backdrop has seen much success. But, extending these results to learning a task for an environment domain has not been as fruitful, even for advanced methodologies such as relational reinforcement learning. In our research into robot learning in environment domains, we utilize a form of deictic representation for the robot’s description of the task environment. However, the non-Markovian nature of the deictic representation leads to perceptual aliasing and conflicting actions, invalidating standard reinforcement learning algorithms. To circumvent this difficulty, several past research studies have modified and extended the Q-learning algorithm to the deictic representation case with mixed results. Taking a different tact, we introduce a learning algorithm which searches deictic policy space directly, abandoning the indirect value based methods. We apply the policy learning algorithm to several different tasks in environment domains. The results compare favorably with value based learners and existing literature results

    Combinando Modelos de Interação para Melhorar a Coordenação em Sistemas Multiagente

    Get PDF
    A contribuição principal deste artigo é a implementação de um método híbrido de coordenação a partir da combinação de modelos de interação desenvolvidos anteriormente. Os modelos de interação são baseados no compartilhamento de recompensas para aprendizagem com múltiplos agentes, no intuito de descobrir de maneira interativa políticas de boa qualidade. A troca de recompensas entre os agentes durante a interação é uma tarefa complexa e se realizada de forma inadequada pode ocasionar atrasos no aprendizado ou até mesmo causar comportamentos inesperados, tornando a cooperação ineficiente e convergindo para uma política não-satisfatória. A partir desses conceitos, o método híbrido utiliza as particularidades de cada modelo, reduzindo possíveis conflitos entre ações com recompensas de políticas diferentes, melhorando a coordenação dos agentes em problemas de aprendizagem por reforço. Resultados experimentais mostram que o método híbrido é capaz de acelerar a convergência, conquistando rapidamente políticas ótimas mesmo em grandes espaços de estados, superando os resultados de abordagens clássicas de aprendizagem por reforço

    A spiking neural network of state transition probabilities in model-based reinforcement learning

    Get PDF
    The development of the field of reinforcement learning was based on psychological studies of the instrumental conditioning of humans and other animals. Recently, reinforcement learning algorithms have been applied to neuroscience to help characterize neural activity and animal behaviour in instrumental conditioning tasks. A specific example is the hybrid learner developed to match human behaviour on a two-stage decision task. This hybrid learner is composed of a model-free and a model-based system. The model presented in this thesis is an implementation of that model-based system where the state transition probabilities and Q-value calculations use biologically plausible spiking neurons. Two variants of the model demonstrate the behaviour when the state transition probabilities are encoded in the network at the beginning of the task, and when these probabilities are learned over the course of the task. Various parameters that affect the behaviour of the model are explored, and ranges of these parameters that produce characteristically model-based behaviour are found. This work provides an important first step toward understanding how a model-based system in the human brain could be implemented, and how this system contributes to human behaviour
    corecore