497 research outputs found

    Spatial representation for planning and executing robot behaviors in complex environments

    Get PDF
    Robots are already improving our well-being and productivity in different applications such as industry, health-care and indoor service applications. However, we are still far from developing (and releasing) a fully functional robotic agent that can autonomously survive in tasks that require human-level cognitive capabilities. Robotic systems on the market, in fact, are designed to address specific applications, and can only run pre-defined behaviors to robustly repeat few tasks (e.g., assembling objects parts, vacuum cleaning). They internal representation of the world is usually constrained to the task they are performing, and does not allows for generalization to other scenarios. Unfortunately, such a paradigm only apply to a very limited set of domains, where the environment can be assumed to be static, and its dynamics can be handled before deployment. Additionally, robots configured in this way will eventually fail if their "handcrafted'' representation of the environment does not match the external world. Hence, to enable more sophisticated cognitive skills, we investigate how to design robots to properly represent the environment and behave accordingly. To this end, we formalize a representation of the environment that enhances the robot spatial knowledge to explicitly include a representation of its own actions. Spatial knowledge constitutes the core of the robot understanding of the environment, however it is not sufficient to represent what the robot is capable to do in it. To overcome such a limitation, we formalize SK4R, a spatial knowledge representation for robots which enhances spatial knowledge with a novel and "functional" point of view that explicitly models robot actions. To this end, we exploit the concept of affordances, introduced to express opportunities (actions) that objects offer to an agent. To encode affordances within SK4R, we define the "affordance semantics" of actions that is used to annotate an environment, and to represent to which extent robot actions support goal-oriented behaviors. We demonstrate the benefits of a functional representation of the environment in multiple robotic scenarios that traverse and contribute different research topics relating to: robot knowledge representations, social robotics, multi-robot systems and robot learning and planning. We show how a domain-specific representation, that explicitly encodes affordance semantics, provides the robot with a more concrete understanding of the environment and of the effects that its actions have on it. The goal of our work is to design an agent that will no longer execute an action, because of mere pre-defined routine, rather, it will execute an actions because it "knows'' that the resulting state leads one step closer to success in its task

    Reinforcement Learning from Self-Play in Imperfect-Information Games

    Get PDF
    This thesis investigates artificial agents learning to make strategic decisions in imperfect-information games. In particular, we introduce a novel approach to reinforcement learning from self-play. We introduce Smooth UCT, which combines the game-theoretic notion of fictitious play with Monte Carlo Tree Search (MCTS). Smooth UCT outperformed a classic MCTS method in several imperfect-information poker games and won three silver medals in the 2014 Annual Computer Poker Competition. We develop Extensive-Form Fictitious Play (XFP) that is entirely implemented in sequential strategies, thus extending this prominent game-theoretic model of learning to sequential games. XFP provides a principled foundation for self-play reinforcement learning in imperfect-information games. We introduce Fictitious Self-Play (FSP), a class of sample-based reinforcement learning algorithms that approximate XFP. We instantiate FSP with neuralnetwork function approximation and deep learning techniques, producing Neural FSP (NFSP). We demonstrate that (approximate) Nash equilibria and their representations (abstractions) can be learned using NFSP end to end, i.e. interfacing with the raw inputs and outputs of the domain. NFSP approached the performance of state-of-the-art, superhuman algorithms in Limit Texas Hold’em - an imperfect-information game at the absolute limit of tractability using massive computational resources. This is the first time that any reinforcement learning algorithm, learning solely from game outcomes without prior domain knowledge, achieved such a feat

    An Ordinal Agent Framework

    Get PDF
    In this thesis, we introduce algorithms to solve ordinal multi-armed bandit problems, Monte-Carlo tree search, and reinforcement learning problems. With ordinal problems, an agent does not receive numerical rewards, but ordinal rewards that cope without any distance measure. For humans, it is often hard to define or to determine exact numerical feedback signals but simpler to come up with an ordering over possibilities. For instance, when looking at medical treatment, the ordering patient death < patient ill < patient cured is easy to come up with but it is hard to assign numerical values to them. As most state-of-the-art algorithms rely on numerical operations, they can not be applied in the presence of ordinal rewards. We present a preference-based approach leveraging dueling bandits to sequential decision problems and discuss its disadvantages in terms of sample efficiency and scalability. Following another idea, our final approach to identify optimal arms is based on the comparison of reward distributions using the Borda method. We test this approach on multi-armed bandits, leverage it to Monte-Carlo tree search, and also apply it to reinforcement learning. To do so, we introduce a framework that encapsulates the similarities of the different problem definitions. We test our ordinal algorithms on frameworks like the General Video Game Framework (GVGAI), OpenAI, or synthetic data and compare it to ordinal, numerical, or domain-specific algorithms. Since our algorithms are time-dependent on the number of perceived ordinal rewards, we introduce a binning method that artificially reduces the number of rewards

    Efficient Preference-based Reinforcement Learning

    Get PDF
    Common reinforcement learning algorithms assume access to a numeric feedback signal. The numeric feedback contains a high amount of information and can be maximized efficiently. However, the definition of a numeric feedback signal can be difficult in practise due to several limitations and badly defined values may lead to an unintended outcome. For humans, it is usually easier to define qualitative feedback signals than quantitative. Hence, we want to solve reinforcement learning problems with a qualitative signal, potentially capable of overcoming several of the limitations of numeric feedback. Preferences have several advantages over other qualitative settings, like ordinal feedback or advice. Preferences are scale-free and do not require assumptions over the optimal outcome. However, preferences are difficult to use for solving sequential decision problems, because it is unknown which decisions are responsible for the observed preference. Hence, we analyze different approaches for learning from preferences and show the design principles that can be used, as well as the advantages and problems that occur. We also survey the field of preference-based reinforcement learning and categorize the algorithms according to the design principles. Efficiency is of special interest in this setting, as it is important to keep the amount of required preferences low, because they depend on human evaluation. Hence, our focus is on efficient use of the preferences. It can be stated that it is important to be able to generalize the obtained preferences, as this keeps the amount of required preferences low. Therefore, we consider methods that are able to generalize the obtained preferences to models not yet evaluated. However, this introduces uncertain feedback and the exploration/exploitation problem already known from classical reinforcement learning has to be considered with the preferences in mind. We show how to efficiently solve this dual exploration problem by interleaving both tasks, in an undirected manner. We use undirected exploration methods, because they scale better to high-dimensional spaces. Furthermore, human feedback has to be assumed to be error-prone and we analyze the problems that arise when using human evaluation. We show that noise is the most substantial problem when dealing with human preferences and present a solution to this problem

    Opponent awareness at all levels of the multiagent reinforcement learning stack

    Get PDF
    Multiagent Reinforcement Learning (MARL) has experienced numerous high profile successes in recent years in terms of generating superhuman gameplaying agents for a wide variety of videogames. Despite these successes, MARL techniques have failed to be adopted by game developers as a useful tool to be used when developing their games, often citing the high computational cost associated with training agents alongside the difficulty of understanding and evaluating MARL methods as the two main obstacles. This thesis attempts to close this gap by introducing an informative modular abstraction under which any Reinforcement Learning (RL) training pipeline can be studied. This is defined as the MARL stack, which explicitly expresses any MARL pipeline as an environment where agents equipped with learning algorithms train via simulated experience as orchestrated by a training scheme. Within the context of 2-player zero-sum games, different approaches at granting opponent awareness at all levels of the proposed MARL stack are explored in broad study of the field. At the level of training schemes, a grouping generalization over many modern MARL training schemes is introduced under a unified framework. Empirical results are shown which demonstrate that the decision over which sequence of opponents a learning agent will face during training greatly affects learning dynamics. At the agent level, the introduction of opponent modelling in state-of-the art algorithms is explored as a way of generating targeted best responses towards opponents encountered during training, improving upon the sample efficiency of these methods. At the environment level the use of MARL as a game design tool is explored by using MARL trained agents as metagame evaluators inside an automated process of game balancing

    A Survey on Compiler Autotuning using Machine Learning

    Full text link
    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook

    Full text link
    In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics, and last but not least, the speech and natural language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving the training of deep neural networks with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this survey, we present an overview of recent advancements of reinforcement learning and bandits, and discuss how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable.Comment: To appear in Expert Systems with Applications. Accompanying INTERSPEECH 2022 Tutorial on the same topic. Including latest advancements in large language models (LLMs

    Learning Augmented Optimization for Network Softwarization in 5G

    Get PDF
    The rapid uptake of mobile devices and applications are posing unprecedented traffic burdens on the existing networking infrastructures. In order to maximize both user experience and investment return, the networking and communications systems are evolving to the next gen- eration – 5G, which is expected to support more flexibility, agility, and intelligence towards provisioned services and infrastructure management. Fulfilling these tasks is challenging, as nowadays networks are increasingly heterogeneous, dynamic and expanded with large sizes. Network softwarization is one of the critical enabling technologies to implement these requirements in 5G. In addition to these problems investigated in preliminary researches about this technology, many new emerging application requirements and advanced opti- mization & learning technologies are introducing more challenges & opportunities for its fully application in practical production environment. This motivates this thesis to develop a new learning augmented optimization technology, which merges both the advanced opti- mization and learning techniques to meet the distinct characteristics of the new application environment. To be more specific, the abstracts of the key contents in this thesis are listed as follows: • We first develop a stochastic solution to augment the optimization of the Network Function Virtualization (NFV) services in dynamical networks. In contrast to the dominant NFV solutions applied for the deterministic networking environments, the inherent network dynamics and uncertainties from 5G infrastructure are impeding the rollout of NFV in many emerging networking applications. Therefore, Chapter 3 investigates the issues of network utility degradation when implementing NFV in dynamical networks, and proposes a robust NFV solution with full respect to the underlying stochastic features. By exploiting the hierarchical decision structures in this problem, a distributed computing framework with two-level decomposition is designed to facilitate a distributed implementation of the proposed model in large-scale networks. • Next, Chapter 4 aims to intertwin the traditional optimization and learning technologies. In order to reap the merits of both optimization and learning technologies but avoid their limitations, promissing integrative approaches are investigated to combine the traditional optimization theories with advanced learning methods. Subsequently, an online optimization process is designed to learn the system dynamics for the network slicing problem, another critical challenge for network softwarization. Specifically, we first present a two-stage slicing optimization model with time-averaged constraints and objective to safeguard the network slicing operations in time-varying networks. Directly solving an off-line solution to this problem is intractable since the future system realizations are unknown before decisions. To address this, we combine the historical learning and Lyapunov stability theories, and develop a learning augmented online optimization approach. This facilitates the system to learn a safe slicing solution from both historical records and real-time observations. We prove that the proposed solution is always feasible and nearly optimal, up to a constant additive factor. Finally, simulation experiments are also provided to demonstrate the considerable improvement of the proposals. • The success of traditional solutions to optimizing the stochastic systems often requires solving a base optimization program repeatedly until convergence. For each iteration, the base program exhibits the same model structure, but only differing in their input data. Such properties of the stochastic optimization systems encourage the work of Chapter 5, in which we apply the latest deep learning technologies to abstract the core structures of an optimization model and then use the learned deep learning model to directly generate the solutions to the equivalent optimization model. In this respect, an encoder-decoder based learning model is developed in Chapter 5 to improve the optimization of network slices. In order to facilitate the solving of the constrained combinatorial optimization program in a deep learning manner, we design a problem-specific decoding process by integrating program constraints and problem context information into the training process. The deep learning model, once trained, can be used to directly generate the solution to any specific problem instance. This avoids the extensive computation in traditional approaches, which re-solve the whole combinatorial optimization problem for every instance from the scratch. With the help of the REINFORCE gradient estimator, the obtained deep learning model in the experiments achieves significantly reduced computation time and optimality loss
    • …
    corecore