    Multiagent reactive plan application learning in dynamic environments

    Generic Reinforcement Learning Beyond Small MDPs

    Feature reinforcement learning (FRL) is a framework within which an agent can automatically reduce a complex environment to a Markov Decision Process (MDP) by finding a map which aggregates similar histories into the states of an MDP. The primary motivation behind this thesis is to build FRL agents that work in practice, both for larger environments and larger classes of environments. We focus on empirical work targeted at practitioners in the field of general reinforcement learning, with theoretical results wherever necessary. The current state-of-the-art in FRL uses suffix trees which have issues with large observation spaces and long-term dependencies. We start by addressing the issue of long-term dependency using a class of maps known as looping suffix trees, which have previously been used to represent deterministic POMDPs. We show the best existing results on the TMaze domain and good results on larger domains that require long-term memory. We introduce a new value-based cost function that can be evaluated model-free. The value- based cost allows for smaller representations, and its model-free nature allows for its extension to the function approximation setting, which has computational and representational advantages for large state spaces. We evaluate the performance of this new cost in both the tabular and function approximation settings on a variety of domains, and show performance better than the state-of-the-art algorithm MC-AIXI-CTW on the domain POCMAN. When the environment is very large, an FRL agent needs to explore systematically in order to find a good representation. However, it needs a good representation in order to perform this systematic exploration. We decouple both by considering a different setting, one where the agent has access to the value of any state-action pair from an oracle in a training phase. The agent must learn an approximate representation of the optimal value function. We formulate a regression-based solution based on online learning methods to build an such an agent. We test this agent on the Arcade Learning Environment using a simple class of linear function approximators. While we made progress on the issue of scalability, two major issues with the FRL framework remain: the need for a stochastic search method to minimise the objective function and the need to store an uncompressed history, both of which can be very computationally demanding

    Aprendizagem de coordenação em sistemas multi-agente

    The ability for an agent to coordinate with others within a system is a valuable property in multi-agent systems. Agents either cooperate as a team to accomplish a common goal, or adapt to opponents to complete different goals without being exploited. Research has shown that learning multi-agent coordination is significantly more complex than learning policies in singleagent environments, and requires a variety of techniques to deal with the properties of a system where agents learn concurrently. This thesis aims to determine how can machine learning be used to achieve coordination within a multi-agent system. It asks what techniques can be used to tackle the increased complexity of such systems and their credit assignment challenges, how to achieve coordination, and how to use communication to improve the behavior of a team. Many algorithms for competitive environments are tabular-based, preventing their use with high-dimension or continuous state-spaces, and may be biased against specific equilibrium strategies. This thesis proposes multiple deep learning extensions for competitive environments, allowing algorithms to reach equilibrium strategies in complex and partially-observable environments, relying only on local information. A tabular algorithm is also extended with a new update rule that eliminates its bias against deterministic strategies. Current state-of-the-art approaches for cooperative environments rely on deep learning to handle the environment’s complexity and benefit from a centralized learning phase. Solutions that incorporate communication between agents often prevent agents from being executed in a distributed manner. This thesis proposes a multi-agent algorithm where agents learn communication protocols to compensate for local partial-observability, and remain independently executed. A centralized learning phase can incorporate additional environment information to increase the robustness and speed with which a team converges to successful policies. The algorithm outperforms current state-of-the-art approaches in a wide variety of multi-agent environments. A permutation invariant network architecture is also proposed to increase the scalability of the algorithm to large team sizes. Further research is needed to identify how can the techniques proposed in this thesis, for cooperative and competitive environments, be used in unison for mixed environments, and whether they are adequate for general artificial intelligence.A capacidade de um agente se coordenar com outros num sistema é uma propriedade valiosa em sistemas multi-agente. Agentes cooperam como uma equipa para cumprir um objetivo comum, ou adaptam-se aos oponentes de forma a completar objetivos egoístas sem serem explorados. Investigação demonstra que aprender coordenação multi-agente é significativamente mais complexo que aprender estratégias em ambientes com um único agente, e requer uma variedade de técnicas para lidar com um ambiente onde agentes aprendem simultaneamente. Esta tese procura determinar como aprendizagem automática pode ser usada para encontrar coordenação em sistemas multi-agente. O documento questiona que técnicas podem ser usadas para enfrentar a superior complexidade destes sistemas e o seu desafio de atribuição de crédito, como aprender coordenação, e como usar comunicação para melhorar o comportamento duma equipa. Múltiplos algoritmos para ambientes competitivos são tabulares, o que impede o seu uso com espaços de estado de alta-dimensão ou contínuos, e podem ter tendências contra estratégias de equilíbrio específicas. Esta tese propõe múltiplas extensões de aprendizagem profunda para ambientes competitivos, permitindo a algoritmos atingir estratégias de equilíbrio em ambientes complexos e parcialmente-observáveis, com base em apenas informação local. Um algoritmo tabular é também extendido com um novo critério de atualização que elimina a sua tendência contra estratégias determinísticas. Atuais soluções de estado-da-arte para ambientes cooperativos têm base em aprendizagem profunda para lidar com a complexidade do ambiente, e beneficiam duma fase de aprendizagem centralizada. Soluções que incorporam comunicação entre agentes frequentemente impedem os próprios de ser executados de forma distribuída. Esta tese propõe um algoritmo multi-agente onde os agentes aprendem protocolos de comunicação para compensarem por observabilidade parcial local, e continuam a ser executados de forma distribuída. Uma fase de aprendizagem centralizada pode incorporar informação adicional sobre ambiente para aumentar a robustez e velocidade com que uma equipa converge para estratégias bem-sucedidas. O algoritmo ultrapassa abordagens estado-da-arte atuais numa grande variedade de ambientes multi-agente. Uma arquitetura de rede invariante a permutações é também proposta para aumentar a escalabilidade do algoritmo para grandes equipas. Mais pesquisa é necessária para identificar como as técnicas propostas nesta tese, para ambientes cooperativos e competitivos, podem ser usadas em conjunto para ambientes mistos, e averiguar se são adequadas a inteligência artificial geral.Apoio financeiro da FCT e do FSE no âmbito do III Quadro Comunitário de ApoioPrograma Doutoral em Informátic

    Reinforcement learning in a multi-agent framework for pedestrian simulation

    El objetivo de la tesis consiste en la utilización de Aprendizaje por refuerzo (Reinforcement Learning) para generar simulaciones plausibles de peatones en diferentes entornos. Metodología Se ha desarrollado un marco de trabajo multi-agente donde cada agente virtual que aprende un comportamiento de navegación por interacción con el mundo virtual en el que se encuentra junto con el resto de agentes. El mundo virtual es simulado con un motor físico (ODE) que está calibrado con parámetros de peatones humanos extraídos de la bibliografía de la materia. El marco de trabajo es flexible y permite utilizar diferentes algoritmos de aprendizaje (en concreto Q-Learning y Sarsa(lambda) en combinación con diferentes técnicas de generalización del espacio de estados (en concreto cuantización Vectorial y tile coding). Como herramientas de análisis de los comportamientos aprendidos se utilizan diagramas fundamentales (relación velocidad/densidad), mapas de densidad, cronogramas y rendimientos (en términos del porcentaje de agentes que consiguen llegar al objetivo). Conclusiones: Tras una batería de experimentos en diferentes escenarios (un total de 6 escenarios distintos) y los correspondientes analisis de resultados, las conclusiones son las siguientes: - Se han conseguido comportamientos plausibles de peatones -Los comportamientos son robustos al escalado y presentan capacidades de abstracción (comportamientos a niveles táctico y de planificación) -Los comportamientos aprendidos son capaces de generar comportamientos colectivos emergentes -La comparación con otro modelo de peatones estandar (Modelo de Helbing) y los análisis realizados a nivel de diagramas fundamentales, indican que la dinámica aprendida es coherente y similar a una dinámica de peatones

    Novel approaches to cooperative coevolution of heterogeneous multiagent systems

    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2017Heterogeneous multirobot systems are characterised by the morphological and/or behavioural heterogeneity of their constituent robots. These systems have a number of advantages over the more common homogeneous multirobot systems: they can leverage specialisation for increased efficiency, and they can solve tasks that are beyond the reach of any single type of robot, by combining the capabilities of different robots. Manually designing control for heterogeneous systems is a challenging endeavour, since the desired system behaviour has to be decomposed into behavioural rules for the individual robots, in such a way that the team as a whole cooperates and takes advantage of specialisation. Evolutionary robotics is a promising alternative that can be used to automate the synthesis of controllers for multirobot systems, but so far, research in the field has been mostly focused on homogeneous systems, such as swarm robotics systems. Cooperative coevolutionary algorithms (CCEAs) are a type of evolutionary algorithm that facilitate the evolution of control for heterogeneous systems, by working over a decomposition of the problem. In a typical CCEA application, each agent evolves in a separate population, with the evaluation of each agent depending on the cooperation with agents from the other coevolving populations. A CCEA is thus capable of projecting the large search space into multiple smaller, and more manageable, search spaces. Unfortunately, the use of cooperative coevolutionary algorithms is associated with a number of challenges. Previous works have shown that CCEAs are not necessarily attracted to the global optimum, but often converge to mediocre stable states; they can be inefficient when applied to large teams; and they have not yet been demonstrated in real robotic systems, nor in morphologically heterogeneous multirobot systems. In this thesis, we propose novel methods for overcoming the fundamental challenges in cooperative coevolutionary algorithms mentioned above, and study them in multirobot domains: we propose novelty-driven cooperative coevolution, in which premature convergence is avoided by encouraging behavioural novelty; and we propose Hyb-CCEA, an extension of CCEAs that places the team heterogeneity under evolutionary control, significantly improving its scalability with respect to the team size. These two approaches have in common that they take into account the exploration of the behaviour space by the evolutionary process. Besides relying on the fitness function for the evaluation of the candidate solutions, the evolutionary process analyses the behaviour of the evolving agents to improve the effectiveness of the evolutionary search. The ultimate goal of our research is to achieve general methods that can effectively synthesise controllers for heterogeneous multirobot systems, and therefore help to realise the full potential of this type of systems. To this end, we demonstrate the proposed approaches in a variety of multirobot domains used in previous works, and we study the application of CCEAs to new robotics domains, including a morphological heterogeneous system and a real robotic system.Fundação para a Ciência e a Tecnologia (FCT, PEst-OE/EEI/LA0008/2011

    Practical reinforcement learning using representation learning and safe exploration for large scale Markov decision processes

    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 157-168).While creating intelligent agents who can solve stochastic sequential decision making problems through interacting with the environment is the promise of Reinforcement Learning (RL), scaling existing RL methods to realistic domains such as planning for multiple unmanned aerial vehicles (UAVs) has remained a challenge due to three main factors: 1) RL methods often require a plethora of data to find reasonable policies, 2) the agent has limited computation time between interactions, and 3) while exploration is necessary to avoid convergence to the local optima, in sensitive domains visiting all parts of the planning space may lead to catastrophic outcomes. To address the first two challenges, this thesis introduces incremental Feature Dependency Discovery (iFDD) as a representation expansion method with cheap per-timestep computational complexity that can be combined with any online, value-based reinforcement learning using binary features. In addition to convergence and computational complexity guarantees, when coupled with SARSA, iFDD achieves much faster learning (i.e., requires much less data samples) in planning domains including two multi-UAV mission planning scenarios with hundreds of millions of state-action pairs. In particular, in a UAV mission planning domain, iFDD performed more than 12 times better than the best competitor given the same number of samples. The third challenge is addressed through a constructive relationship between a planner and a learner in order to mitigate the learning risk while boosting the asymptotic performance and safety of an agent's behavior. The framework is an instance of the intelligent cooperative control architecture where a learner initially follows a safe policy generated by a planner. The learner incrementally improves this baseline policy through interaction, while avoiding behaviors believed to be risky. The new approach is demonstrated to be superior in two multi-UAV task assignment scenarios. For example in one case, the proposed method reduced the risk by 8%, while improving the performance of the planner up to 30%.by Alborz Geramifard.Ph.D

    Specifying User Preferences for Autonomous Robots through Interactive Learning

    This thesis studies a central problem in human-robot interaction (HRI): How can non-expert users specify complex behaviours for autonomous robots? A common technique for robot task specification that does not require expert knowledge is active preference learning. The desired behaviour of a robot is learned by iteratively presenting the user with alternative behaviours of the robot. The user then chooses the alternative they prefer. It is assumed that they make this decision based on an internal, hidden cost function. From the user's choice among the alternatives, the robot learns the hidden user cost function. We use an interactive framework allowing users to create robot task specifications. The behaviour of an autonomous robot can be specified by defining constraints on allowable robot states and actions. For instance, for a mobile robot a user can define traffic rules such as roads, slow zones or areas of avoidance. These constraints form the user-specified terms of the cost function. However, inexperienced users might be oblivious to the impact such constraints have on the robot task performance. Employing an active preference learning framework we present users with the behaviour of the robot following their specification, i.e., the constraints, together with an alternative behaviour where some constraints might be violated. A user cost function trades-off the importance of constraints and the performance of the robot. From the user feedback, the robot learns about the importance of constraints, i.e., parameters in the cost function. We first introduce an algorithm for specification revision that is based on a deterministic user model: We assume that the user always follows the proposed cost function. This allows for dividing the set of possible weights for the user constraints into infeasible and feasible weights whenever user feedback is obtained. In each iteration we present the path the user preferred previously again, together with an alternative path that is optimal for a weight that is feasible with respect to all previous iterations. This path is found with a local search, iterating over the feasible weights until a new path is found. As the number of paths is finite for any discrete motion planner, the algorithm is guaranteed to find the optimal solution within a finite number of iterations. Simulation results show that this approach is suitable to effectively revise user specifications within few iterations. The practicality of the framework is investigated in a user study. The algorithm is extended to learn about multiple tasks for the robot simultaneously, which allows for more realistic scenarios and another active learning component: The choice of task for which the user is presented with two alternative solutions. Through the study we show that nearly all users accept alternative solutions and thus obtain a revised specification through the learning process, leading to a substantial improvement in robot performance. Also, the users whose initial specifications had the largest impact on performance benefit the most from the interactive learning. Next, we weaken the assumptions about the user: In a probabilistic model we do not require the user to always follow our cost function. Based on the sensitivity of a motion planning problem, we show that different values in the user cost function, i.e., weights for the user constraints, do not necessarily lead to different robot behaviour. From the implied discretization of the space of possible parameters we derive an algorithm for efficiently learning a specification revision and demonstrate the performance and robustness in simulations. We build on the notion of sensitivity to an active preference learning technique based on maximum regret, i.e., the maximum error ratio over all possible solutions. We show that active preference learning based on regret substantially outperforms other state of the art approaches. Further, regret based preference learning can be used as an heuristic for both discrete and continuous state and action spaces. An emerging technique for real-time motion planning are state lattice planners, based on a regular discrete set of robot states and pre-computed motions connecting the states, called motion primitives. We study how learning from demonstrations can be used to learn global preferences for robot movement, such as the trade-off between time and jerkiness of the motions. We show how to compute a user optimal set of motion primitives of given size, based on an estimate of the user preferences. We demonstrate that by learning about the motion primitives of a lattice planner, we can shape the robot's behaviour to follow the global user preferences while ensuring good computation time of the motion planner. Furthermore, we study how a robot can simultaneously learn about user preferences on both motions of a lattice planner and parts of the environment when a user is iteratively correcting the robot behaviour. We demonstrate in simulations that this approach is suitable to adapt to user preferences even when the features on the environment that a user considers are not given

    Reinforcement Learning

    Brains rule the world, and brain-like computation is increasingly used in computers and electronic devices. Brain-like computation is about processing and interpreting data or directly putting forward and performing actions. Learning is a very important aspect. This book is on reinforcement learning which involves performing actions to achieve a goal. The first 11 chapters of this book describe and extend the scope of reinforcement learning. The remaining 11 chapters show that there is already wide usage in numerous fields. Reinforcement learning can tackle control tasks that are too complex for traditional, hand-designed, non-learning controllers. As learning computers can deal with technical complexities, the tasks of human operators remain to specify goals on increasingly higher levels. This book shows that reinforcement learning is a very dynamic area in terms of theory and applications and it shall stimulate and encourage new research in this field

    Policy Search Based Relational Reinforcement Learning using the Cross-Entropy Method

    Relational Reinforcement Learning (RRL) is a subfield of machine learning in which a learning agent seeks to maximise a numerical reward within an environment, represented as collections of objects and relations, by performing actions that interact with the environment. The relational representation allows more dynamic environment states than an attribute-based representation of reinforcement learning, but this flexibility also creates new problems such as a potentially infinite number of states. This thesis describes an RRL algorithm named Cerrla that creates policies directly from a set of learned relational “condition-action” rules using the Cross-Entropy Method (CEM) to control policy creation. The CEM assigns each rule a sampling probability and gradually modifies these probabilities such that the randomly sampled policies consist of ‘better’ rules, resulting in larger rewards received. Rule creation is guided by an inferred partial model of the environment that defines: the minimal conditions needed to take an action, the possible specialisation conditions per rule, and a set of simplification rules to remove redundant and illegal rule conditions, resulting in compact, efficient, and comprehensible policies. Cerrla is evaluated on four separate environments, where each environment has several different goals. Results show that compared to existing RRL algorithms, Cerrla is able to learn equal or better behaviour in less time on the standard RRL environment. On other larger, more complex environments, it can learn behaviour that is competitive to specialised approaches. The simplified rules and CEM’s bias towards compact policies result in comprehensive and effective relational policies created in a relatively short amount of time

    Proceedings of The Multi-Agent Logics, Languages, and Organisations Federated Workshops (MALLOW 2010)

    http://ceur-ws.org/Vol-627/allproceedings.pdfInternational audienceMALLOW-2010 is a third edition of a series initiated in 2007 in Durham, and pursued in 2009 in Turin. The objective, as initially stated, is to "provide a venue where: the cost of participation was minimum; participants were able to attend various workshops, so fostering collaboration and cross-fertilization; there was a friendly atmosphere and plenty of time for networking, by maximizing the time participants spent together"