23 research outputs found

    Heuristic Search Value Iteration for zero-sum Stochastic Games

    Get PDF
    International audienceIn sequential decision-making, heuristic search algorithms allow exploiting both the initial situation and an admissible heuristic to efficiently search for an optimal solution, often for planning purposes. Such algorithms exist for problems with uncertain dynamics, partial observability, multiple criteria, or multiple collaborating agents. Here we look at two-player zero-sum stochastic games with discounted criterion, in a view to propose a solution tailored to the fully observable case, while solutions have been proposed for particular, though still more general, partially observable cases. This setting induces reasoning on both a lower and an upper bound of the value function, which leads us to proposing zsSG-HSVI, an algorithm based on Heuristic Search Value Iteration (HSVI), and which thus relies on generating trajectories. We demonstrate that, each player acting optimistically, and employing simple heuristic initializations, HSVI's convergence in finite time to an ϵ-optimal solution is preserved. An empirical study of the resulting approach is conducted on benchmark problems of various sizes

    A Practical Guide to Multi-Objective Reinforcement Learning and Planning

    Get PDF
    Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems

    A practical guide to multi-objective reinforcement learning and planning

    Get PDF
    Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems. © 2022, The Author(s)

    Neural combinatorial optimization as an enabler technology to design real-time virtual network function placement decision systems

    Get PDF
    158 p.The Fifth Generation of the mobile network (5G) represents a breakthrough technology for thetelecommunications industry. 5G provides a unified infrastructure capable of integrating over thesame physical network heterogeneous services with different requirements. This is achieved thanksto the recent advances in network virtualization, specifically in Network Function Virtualization(NFV) and Software Defining Networks (SDN) technologies. This cloud-based architecture not onlybrings new possibilities to vertical sectors but also entails new challenges that have to be solvedaccordingly. In this sense, it enables to automate operations within the infrastructure, allowing toperform network optimization at operational time (e.g., spectrum optimization, service optimization,traffic optimization). Nevertheless, designing optimization algorithms for this purpose entails somedifficulties. Solving the underlying Combinatorial Optimization (CO) problems that these problemspresent is usually intractable due to their NP-Hard nature. In addition, solutions to these problems arerequired in close to real-time due to the tight time requirements on this dynamic environment. Forthis reason, handwritten heuristic algorithms have been widely used in the literature for achievingfast approximate solutions on this context.However, particularizing heuristics to address CO problems can be a daunting task that requiresexpertise. The ability to automate this resolution processes would be of utmost importance forachieving an intelligent network orchestration. In this sense, Artificial Intelligence (AI) is envisionedas the key technology for autonomously inferring intelligent solutions to these problems. Combining AI with network virtualization can truly transform this industry. Particularly, this Thesis aims at using Neural Combinatorial Optimization (NCO) for inferring endsolutions on CO problems. NCO has proven to be able to learn near optimal solutions on classicalcombinatorial problems (e.g., the Traveler Salesman Problem (TSP), Bin Packing Problem (BPP),Vehicle Routing Problem (VRP)). Specifically, NCO relies on Reinforcement Learning (RL) toestimate a Neural Network (NN) model that describes the relation between the space of instances ofthe problem and the solutions for each of them. In other words, this model for a new instance is ableto infer a solution generalizing from the problem space where it has been trained. To this end, duringthe learning process the model takes instances from the learning space, and uses the reward obtainedfrom evaluating the solution to improve its accuracy.The work here presented, contributes to the NCO theory in two main directions. First, this workargues that the performance obtained by sequence-to-sequence models used for NCO in the literatureis improved presenting combinatorial problems as Constrained Markov Decision Processes (CMDP).Such property can be exploited for building a Markovian model that constructs solutionsincrementally based on interactions with the problem. And second, this formulation enables toaddress general constrained combinatorial problems under this framework. In this context, the modelin addition to the reward signal, relies on penalty signals generated from constraint dissatisfactionthat direct the model toward a competitive policy even in highly constrained environments. Thisstrategy allows to extend the number of problems that can be addressed using this technology.The presented approach is validated in the scope of intelligent network management, specifically inthe Virtual Network Function (VNF) placement problem. This problem consists of efficientlymapping a set of network service requests on top of the physical network infrastructure. Particularly,we seek to obtain the optimal placement for a network service chain considering the state of thevirtual environment, so that a specific resource objective is accomplished, in this case theminimization of the overall power consumption. Conducted experiments prove the capability of theproposal for learning competitive solutions when compared to classical heuristic, metaheuristic, andConstraint Programming (CP) solvers

    Neural combinatorial optimization as an enabler technology to design real-time virtual network function placement decision systems

    Get PDF
    158 p.The Fifth Generation of the mobile network (5G) represents a breakthrough technology for thetelecommunications industry. 5G provides a unified infrastructure capable of integrating over thesame physical network heterogeneous services with different requirements. This is achieved thanksto the recent advances in network virtualization, specifically in Network Function Virtualization(NFV) and Software Defining Networks (SDN) technologies. This cloud-based architecture not onlybrings new possibilities to vertical sectors but also entails new challenges that have to be solvedaccordingly. In this sense, it enables to automate operations within the infrastructure, allowing toperform network optimization at operational time (e.g., spectrum optimization, service optimization,traffic optimization). Nevertheless, designing optimization algorithms for this purpose entails somedifficulties. Solving the underlying Combinatorial Optimization (CO) problems that these problemspresent is usually intractable due to their NP-Hard nature. In addition, solutions to these problems arerequired in close to real-time due to the tight time requirements on this dynamic environment. Forthis reason, handwritten heuristic algorithms have been widely used in the literature for achievingfast approximate solutions on this context.However, particularizing heuristics to address CO problems can be a daunting task that requiresexpertise. The ability to automate this resolution processes would be of utmost importance forachieving an intelligent network orchestration. In this sense, Artificial Intelligence (AI) is envisionedas the key technology for autonomously inferring intelligent solutions to these problems. Combining AI with network virtualization can truly transform this industry. Particularly, this Thesis aims at using Neural Combinatorial Optimization (NCO) for inferring endsolutions on CO problems. NCO has proven to be able to learn near optimal solutions on classicalcombinatorial problems (e.g., the Traveler Salesman Problem (TSP), Bin Packing Problem (BPP),Vehicle Routing Problem (VRP)). Specifically, NCO relies on Reinforcement Learning (RL) toestimate a Neural Network (NN) model that describes the relation between the space of instances ofthe problem and the solutions for each of them. In other words, this model for a new instance is ableto infer a solution generalizing from the problem space where it has been trained. To this end, duringthe learning process the model takes instances from the learning space, and uses the reward obtainedfrom evaluating the solution to improve its accuracy.The work here presented, contributes to the NCO theory in two main directions. First, this workargues that the performance obtained by sequence-to-sequence models used for NCO in the literatureis improved presenting combinatorial problems as Constrained Markov Decision Processes (CMDP).Such property can be exploited for building a Markovian model that constructs solutionsincrementally based on interactions with the problem. And second, this formulation enables toaddress general constrained combinatorial problems under this framework. In this context, the modelin addition to the reward signal, relies on penalty signals generated from constraint dissatisfactionthat direct the model toward a competitive policy even in highly constrained environments. Thisstrategy allows to extend the number of problems that can be addressed using this technology.The presented approach is validated in the scope of intelligent network management, specifically inthe Virtual Network Function (VNF) placement problem. This problem consists of efficientlymapping a set of network service requests on top of the physical network infrastructure. Particularly,we seek to obtain the optimal placement for a network service chain considering the state of thevirtual environment, so that a specific resource objective is accomplished, in this case theminimization of the overall power consumption. Conducted experiments prove the capability of theproposal for learning competitive solutions when compared to classical heuristic, metaheuristic, andConstraint Programming (CP) solvers

    Stochastic Dynamic Optimization Under Ambiguity

    Full text link
    Stochastic dynamic optimization methods are powerful mathematical tools for informing sequential decision-making in environments where the outcomes of decisions are uncertain. For instance, the Markov decision process (MDP) has found success in many application areas, including the evaluation and design of treatment and screening protocols for medical decision making. However, the usefulness of these models is only as good as the data used to parameterize them, and multiple competing data sources are common in many application areas. Unfortunately, the recommendations that result from the optimization process can be sensitive to the data used and thus, susceptible to the impacts of ambiguity in the choices regarding the model's construction. To address the issue of ambiguity in MDPs, we introduce the Multi-model MDP (MMDP) which generalizes a standard MDP by allowing for multiple models of the rewards and transition probabilities. The solution of the MMDP is a policy that considers the performance with respect to the different models and allows for the decision-maker (DM) to explicitly trade-off conflicting sources of data. In this thesis, we study this problem in three parts. In the first part, we study the weighted value problem (WVP) in which the DM’s objective is to find a single policy that maximizes the weighted value of expected rewards in each model. We identify two important variants of this problem: the non-adaptive WVP in which the DM must specify the decision-making strategy before the outcome of ambiguity is observed and the adaptive WVP in which the DM is allowed to adapt to the outcomes of ambiguity. To solve these problems, we develop exact methods and fast approximation methods supported by error bounds. Finally, we illustrate the effectiveness and the scalability of our approach using a case study in preventative blood pressure and cholesterol management that accounts for conflicting published cardiovascular risk models. In the second part, we leverage the special structure of the non-adaptive WVP to design exact decomposition methods for solving MMDPs with a larger number of models. We present a branch-and-cut approach to solve a mixed-integer programming formulation of the problem and a custom branch-and-bound approach. Numerical experiments show that a customized implementation of branch-and-bound significantly outperforms branch-and-cut and allows for the solution of MMDPs with larger numbers of models. In the third part, we extend the MMDP beyond the WVP to consider other objective functions that are sensitive to the ambiguity arising from the existence of multiple models. We modify the branch-and-bound procedure to solve these alternate formulations and compare the resulting policies to policies found using tractable heuristics. Using two case studies, we show that the solution to the mean value problem, wherein all parameters take on their mean values, can perform quite well with respect to several measures of performance under ambiguity. In summary, in this dissertation, we present new methods for stochastic dynamic optimization under ambiguity. We represent ambiguity through multiple plausible models of the MDP. We analyze alternative forms of the problems, develop solution methods, and identify properties of the optimal solutions that provide insight into the effects of ambiguity on optimal policies. Although we illustrate our methods on decision-making for medical treatment and machine maintenance, the methods we present in this thesis can be applied to other domains in which optimal sequential decision-making uncertainty is clouded by ambiguity.PHDIndustrial & Operations EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149947/1/steimle_1.pd

    Many-agent Reinforcement Learning

    Get PDF
    Multi-agent reinforcement learning (RL) solves the problem of how each agent should behave optimally in a stochastic environment in which multiple agents are learning simultaneously. It is an interdisciplinary domain with a long history that lies in the joint area of psychology, control theory, game theory, reinforcement learning, and deep learning. Following the remarkable success of the AlphaGO series in single-agent RL, 2019 was a booming year that witnessed significant advances in multi-agent RL techniques; impressive breakthroughs have been made on developing AIs that outperform humans on many challenging tasks, especially multi-player video games. Nonetheless, one of the key challenges of multi-agent RL techniques is the scalability; it is still non-trivial to design efficient learning algorithms that can solve tasks including far more than two agents (N2N \gg 2), which I name by \emph{many-agent reinforcement learning} (MARL\footnote{I use the world of ``MARL" to denote multi-agent reinforcement learning with a particular focus on the cases of many agents; otherwise, it is denoted as ``Multi-Agent RL" by default.}) problems. In this thesis, I contribute to tackling MARL problems from four aspects. Firstly, I offer a self-contained overview of multi-agent RL techniques from a game-theoretical perspective. This overview fills the research gap that most of the existing work either fails to cover the recent advances since 2010 or does not pay adequate attention to game theory, which I believe is the cornerstone to solving many-agent learning problems. Secondly, I develop a tractable policy evaluation algorithm -- αα\alpha^\alpha-Rank -- in many-agent systems. The critical advantage of αα\alpha^\alpha-Rank is that it can compute the solution concept of α\alpha-Rank tractably in multi-player general-sum games with no need to store the entire pay-off matrix. This is in contrast to classic solution concepts such as Nash equilibrium which is known to be PPADPPAD-hard in even two-player cases. αα\alpha^\alpha-Rank allows us, for the first time, to practically conduct large-scale multi-agent evaluations. Thirdly, I introduce a scalable policy learning algorithm -- mean-field MARL -- in many-agent systems. The mean-field MARL method takes advantage of the mean-field approximation from physics, and it is the first provably convergent algorithm that tries to break the curse of dimensionality for MARL tasks. With the proposed algorithm, I report the first result of solving the Ising model and multi-agent battle games through a MARL approach. Fourthly, I investigate the many-agent learning problem in open-ended meta-games (i.e., the game of a game in the policy space). Specifically, I focus on modelling the behavioural diversity in meta-games, and developing algorithms that guarantee to enlarge diversity during training. The proposed metric based on determinantal point processes serves as the first mathematically rigorous definition for diversity. Importantly, the diversity-aware learning algorithms beat the existing state-of-the-art game solvers in terms of exploitability by a large margin. On top of the algorithmic developments, I also contribute two real-world applications of MARL techniques. Specifically, I demonstrate the great potential of applying MARL to study the emergent population dynamics in nature, and model diverse and realistic interactions in autonomous driving. Both applications embody the prospect that MARL techniques could achieve huge impacts in the real physical world, outside of purely video games

    A Survey on Quantum Reinforcement Learning

    Full text link
    Quantum reinforcement learning is an emerging field at the intersection of quantum computing and machine learning. While we intend to provide a broad overview of the literature on quantum reinforcement learning (our interpretation of this term will be clarified below), we put particular emphasis on recent developments. With a focus on already available noisy intermediate-scale quantum devices, these include variational quantum circuits acting as function approximators in an otherwise classical reinforcement learning setting. In addition, we survey quantum reinforcement learning algorithms based on future fault-tolerant hardware, some of which come with a provable quantum advantage. We provide both a birds-eye-view of the field, as well as summaries and reviews for selected parts of the literature.Comment: 62 pages, 16 figure

    Automatic Algorithm Selection for Complex Simulation Problems

    Get PDF
    To select the most suitable simulation algorithm for a given task is often difficult. This is due to intricate interactions between model features, implementation details, and runtime environment, which may strongly affect the overall performance. The thesis consists of three parts. The first part surveys existing approaches to solve the algorithm selection problem and discusses techniques to analyze simulation algorithm performance.The second part introduces a software framework for automatic simulation algorithm selection, which is evaluated in the third part.Die Auswahl des passendsten Simulationsalgorithmus für eine bestimmte Aufgabe ist oftmals schwierig. Dies liegt an der komplexen Interaktion zwischen Modelleigenschaften, Implementierungsdetails und Laufzeitumgebung. Die Arbeit ist in drei Teile gegliedert. Der erste Teil befasst sich eingehend mit Vorarbeiten zur automatischen Algorithmenauswahl, sowie mit der Leistungsanalyse von Simulationsalgorithmen. Der zweite Teil der Arbeit stellt ein Rahmenwerk zur automatischen Auswahl von Simulationsalgorithmen vor, welches dann im dritten Teil evaluiert wird
    corecore