120 research outputs found

    Population-Based Reinforcement Learning for Combinatorial Optimization

    Full text link
    Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam-search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple theoretically grounded training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on three popular NP-hard problems: the traveling salesman (TSP), the capacitated vehicle routing (CVRP), and 0-1 knapsack (KP) problems. On TSP specifically, Poppy outperforms the previous state-of-the-art, dividing the optimality gap by 5 while reducing the inference time by more than an order of magnitude

    Multi-Period Stochastic Resource Planning: Models, Algorithms and Applications

    Get PDF
    This research addresses the problem of sequential decision making in the presence of uncertainty in the professional service industry. Specifically, it considers the problem of dynamically assigning resources to tasks in a stochastic environment with both the uncertainty of resource availability due to attrition, and the uncertainty of job availability due to unknown project bid outcome. This problem is motivated by the resource planning application at the Hewlett Packard (HP) Enterprises. The challenge is to provide resource planning support over a time horizon under the influence of internal resource attrition and demand uncertainty. To ensure demand is satisfied, the external contingent resources can be engaged to make up for internal resource attrition. The objective is to maximize profitability by identifying the optimal mix of internal and contingent resources and their assignments to project tasks under explicit uncertainty. While the sequential decision problems under uncertainty can often be modeled as a Markov decision process (MDP), the classical dynamic programming (DP) method using the Bellman’s equation suffers the well-known curses-of-dimensionality and only works for small size instances. To tackle the challenge of curses-of-dimensionality this research focuses on developing computationally tractable closed-loop Approximate Dynamic Programming (ADP) algorithms to obtain near-optimal solutions in reasonable computational time. Various approximation schemes are developed to approximate the cost-to-go function. A comprehensive computational experiment is conducted to investigate the performance and behavior of the ADP algorithm. The performance of ADP is also compared with that of a rolling horizon approach as a benchmark solution. Computational results show that the optimization model and algorithm developed in this thesis are able to offer solutions with higher profitability and utilization of internal resource for companies in the professional service industry

    On a Vehicle Routing Problem with Customer Costs and Multi Depots

    Get PDF
    The Vehicle Routing Problem with Customer Costs (short VRPCC) was developed for railway maintenance scheduling. In detail, corrective maintenance jobs for unexpected occurring failures are planned to a short time horizon. These jobs are geographically distributed in the railway net. Furthermore, dependent on the severity of the failure, it can be necessary to reduce the top speed on the track section in order to avoid safety risks or a too fast deterioration. For fatal failures, it can even be necessary to close the track section. The resulting limitations on railway service lead to penalty costs for the maintenance operator. These must be paid until the track is repaired and the restrictions are removed. By scheduling the maintenance tasks, these penalty costs can be reduced by proceeding corresponding maintenance tasks earlier. However, this may in return lead to increased costs for moving the maintenance machines and crews. For this scheduling problem, the VRPCC was developed. With it, for each maintenance vehicle and crew, a route is defined that describes the order to proceed maintenance tasks. Two kinds of costs are considered: Firstly, travel costs for machinery and crew; and secondly, penalty costs for an unsafe track condition that have to be paid for each day from failure detection to maintenance completion. To model the penalties, the novel customer costs are defined. In detail, for each maintenance activity a customer cost coefficient is given which incur for each day between failure detection and failure repair. The objective function of this problem is defined by the sum of travel costs and time-dependent customer costs. With it, the priority of customers can be taken into account without losing the sight on travel costs. This new vehicle routing problem was introduced in this thesis by a non-linear partition and permutation model. In this model, a feasible solution is defined by a partition of the job set into subsets that represent the allocation of jobs to vehicles and a permutation for each subset that represent the order of processing the jobs. Then, the start times of the jobs were calculated based on the order given by the permutations. It was taken into account that work can only be done in eight hour shifts during the night. Based on the start times, the customer cost value of each job is computed which equals to the paid penalty costs. Then, the costs of a schedule are calculated via the sum of travel costs and customer costs. To solve the VRPCC by a commercial linear programming solver, different formulations of the VRPCC as mixed-integer linear program were developed. In doing so, the start times became decision variables. It turned out that including customer costs led to problems harder to solve than vehicle routing problems where only travel costs are minimized. Further, in the thesis several construction heuristics for the VRPCC were designed and investigated. Also two local search algorithms, first and best improvement, were applied. The computational experiments showed that the solutions generated by the local search algorithm were much better than the solutions of the construction heuristics. The main part of this thesis was to design a Branch-and-Bound algorithm for the VRPCC. For this purpose, new lower bounds for the customer cost part of the objective function were formulated. The computational experiments showed that a lower bound computed from the LP relaxation of a specific bin packing problem had the best trade-off between computational effort and bound quality. For the travel cost part of the objective function, several known lower bounds from the TSP were compared. To design a Branch-and-Bound algorithm, beside efficient lower bound, also suitable branching strategies are necessary to split the problem space into smaller subspaces. In this thesis two branching strategies were developed which are based on the non-linear partition and permutation model to take advantage from the problem structure. To be more precise, new branches are generated by appending or including a job to an uncompleted schedule. Consequently, the start times can be computed directly from the so far planned jobs and more tight lower bounds can be computed for the so far unplanned jobs. By means of computational experiments, the developed Branch-and-Bound algorithms were compared with the classical approach, which means solving a mixed-integer linear program of the VRPCC by a commercial solver. The results showed that both Branch-and-Bound algorithms solved the small instances faster than the classical approach

    A novel class of scheduling policies for the stochastic resource-constrained project scheduling problem.

    Get PDF
    We study the resource-constrained project scheduling problem with stochastic activity durations. We introduce a new class of scheduling policies for this problem, which make a number of a-priori sequencing decisions in a pre-processing phase, while the remaining decisions are made dynamically during project execution. The pre-processing decisions entail the addition of precedence constraints to the scheduling instance, hereby resolving some potential resource conflicts. We compare the performance of this new class with existing scheduling policies for the stochastic resource-constrained project scheduling problem, and we observe that the new class is significantly better when the variability in the activity durations is medium to high.Project scheduling; Uncertainty; Stochastic activity durations; Scheduling policies;

    A priori and on-line route optimization for unmanned underwater vehicles

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-156).The U.S. military considers Unmanned Underwater Vehicles (UUVs) a critical component of the future for two primary reasons - they are effective force multipliers and a significant risk-reducing agent. As the military's technology improves and UUVs become a reliable mission asset, the vehicle's ability to make intelligent decisions will be crucial to future operations. The thesis develops various algorithms to solve the UUV Mission-Planning Problem (UUVMPP), where the UUV must choose which tasks to perform in which sequence in a stochastic mission environment. The objective is to find the most profitable way to execute tasks with restrictions of total mission time, energy, time-restricted areas, and weather conditions. Since the UUV accumulates navigation error over time while maneuvering underwater, the UUV must occasionally halt operations to re-orient itself via a navigation fix. While a navigation fix takes time and increases the likelihood of exposing the vehicle's position to potential adversaries, a reduction in navigation error allows the UUV to perform tasks and navigate with a greater amount of certainty. The algorithms presented in this thesis successfully incorporate navigation fixes into the mission-planning process. The thesis considers Mixed-Integer Programming, Exact Dynamic Programming, and an Approximate Dynamic Programming technique known as Rollout to determine the optimal a priori route that meets operational constraints with a specified probability. The thesis then shows how these formulations can solve and re-solve the UUVMPP on-line. In particular, the Rollout Algorithm finds task route solutions on average 96% of the optimal solution a priori and 98% of the optimal solution on-line compared to exact algorithms; with a significant reduction in computation run time, the Rollout Algorithm permits the solving of increasingly complex mission scenarios.by Brian A. Crimmel.S.M

    Adaptive Information Gathering via Imitation Learning

    Full text link
    In the adaptive information gathering problem, a policy is required to select an informative sensing location using the history of measurements acquired thus far. While there is an extensive amount of prior work investigating effective practical approximations using variants of Shannon's entropy, the efficacy of such policies heavily depends on the geometric distribution of objects in the world. On the other hand, the principled approach of employing online POMDP solvers is rendered impractical by the need to explicitly sample online from a posterior distribution of world maps. We present a novel data-driven imitation learning framework to efficiently train information gathering policies. The policy imitates a clairvoyant oracle - an oracle that at train time has full knowledge about the world map and can compute maximally informative sensing locations. We analyze the learnt policy by showing that offline imitation of a clairvoyant oracle is implicitly equivalent to online oracle execution in conjunction with posterior sampling. This observation allows us to obtain powerful near-optimality guarantees for information gathering problems possessing an adaptive sub-modularity property. As demonstrated on a spectrum of 2D and 3D exploration problems, the trained policies enjoy the best of both worlds - they adapt to different world map distributions while being computationally inexpensive to evaluate.Comment: Robotics Science and Systems, 201

    BQ-NCO: Bisimulation Quotienting for Efficient Neural Combinatorial Optimization

    Full text link
    Despite the success of neural-based combinatorial optimization methods for end-to-end heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of Combinatorial Optimization Problems (COPs) as Markov Decision Processes (MDPs) that effectively leverages common symmetries of COPs to improve out-of-distribution robustness. Starting from a direct MDP formulation of a constructive method, we introduce a generic way to reduce the state space, based on Bisimulation Quotienting (BQ) in MDPs. Then, for COPs with a recursive nature, we specialize the bisimulation and show how the reduced state exploits the symmetries of these problems and facilitates MDP solving. Our approach is principled and we prove that an optimal policy for the proposed BQ-MDP actually solves the associated COPs. We illustrate our approach on five classical problems: the Euclidean and Asymmetric Traveling Salesman, Capacitated Vehicle Routing, Orienteering and Knapsack Problems. Furthermore, for each problem, we introduce a simple attention-based policy network for the BQ-MDPs, which we train by imitation of (near) optimal solutions of small instances from a single distribution. We obtain new state-of-the-art results for the five COPs on both synthetic and realistic benchmarks. Notably, in contrast to most existing neural approaches, our learned policies show excellent generalization performance to much larger instances than seen during training, without any additional search procedure

    The multilevel critical node problem : theoretical intractability and a curriculum learning approach

    Full text link
    Évaluer la vulnérabilité des réseaux est un enjeu de plus en plus critique. Dans ce mémoire, nous nous penchons sur une approche étudiant la défense d’infrastructures stratégiques contre des attaques malveillantes au travers de problèmes d'optimisations multiniveaux. Plus particulièrement, nous analysons un jeu séquentiel en trois étapes appelé le « Multilevel Critical Node problem » (MCN). Ce jeu voit deux joueurs s'opposer sur un graphe: un attaquant et un défenseur. Le défenseur commence par empêcher préventivement que certains nœuds soient attaqués durant une phase de vaccination. Ensuite, l’attaquant infecte un sous ensemble des nœuds non vaccinés. Finalement, le défenseur réagit avec une stratégie de protection. Dans ce mémoire, nous fournissons les premiers résultats de complexité pour MCN ainsi que ceux de ses sous-jeux. De plus, en considérant les différents cas de graphes unitaires, pondérés ou orientés, nous clarifions la manière dont la complexité de ces problèmes varie. Nos résultats contribuent à élargir les familles de problèmes connus pour être complets pour les classes NP, Σ2p\Sigma_2^p et Σ3p\Sigma_3^p. Motivés par l’insolubilité intrinsèque de MCN, nous concevons ensuite une heuristique efficace pour le jeu. Nous nous appuyons sur les approches récentes cherchant à apprendre des heuristiques pour des problèmes d’optimisation combinatoire en utilisant l’apprentissage par renforcement et les réseaux de neurones graphiques. Contrairement aux précédents travaux, nous nous intéressons aux situations dans lesquelles de multiples joueurs prennent des décisions de manière séquentielle. En les inscrivant au sein du formalisme d’apprentissage multiagent, nous concevons un algorithme apprenant à résoudre des problèmes d’optimisation combinatoire multiniveaux budgétés opposant deux joueurs dans un jeu à somme nulle sur un graphe. Notre méthode est basée sur un simple curriculum : si un agent sait estimer la valeur d’une instance du problème ayant un budget au plus B, alors résoudre une instance avec budget B+1 peut être fait en temps polynomial quelque soit la direction d’optimisation en regardant la valeur de tous les prochains états possibles. Ainsi, dans une approche ascendante, nous entraînons notre agent sur des jeux de données d’instances résolues heuristiquement avec des budgets de plus en plus grands. Nous rapportons des résultats quasi optimaux sur des graphes de tailles au plus 100 et un temps de résolution divisé par 185 en moyenne comparé au meilleur solutionneur exact pour le MCN.Evaluating the vulnerability of networks is a problem which has gain momentum in recent decades. In this work, we focus on a Multilevel Programming approach to study the defense of critical infrastructures against malicious attacks. We analyze a three-stage sequential game played in a graph called the Multilevel Critical Node problem (MCN). This game sees two players competing with each other: a defender and an attacker. The defender starts by preventively interdicting nodes from being attacked during what is called a vaccination phase. Then, the attacker infects a subset of non-vaccinated nodes and, finally, the defender reacts with a protection strategy. We provide the first computational complexity results associated with MCN and its subgames. Moreover, by considering unitary, weighted, undirected and directed graphs, we clarify how the theoretical tractability or intractability of those problems vary. Our findings contribute with new NP-complete, Σ2p\Sigma_2^p-complete and Σ3p\Sigma_3^p-complete problems. Motivated by the intrinsic intractability of the MCN, we then design efficient heuristics for the game by building upon the recent approaches seeking to learn heuristics for combinatorial optimization problems through graph neural networks and reinforcement learning. But contrary to previous work, we tackle situations with multiple players taking decisions sequentially. By framing them in a multi-agent reinforcement learning setting, we devise a value-based method to learn to solve multilevel budgeted combinatorial problems involving two players in a zero-sum game over a graph. Our framework is based on a simple curriculum: if an agent knows how to estimate the value of instances with budgets up to B, then solving instances with budget B+1 can be done in polynomial time regardless of the direction of the optimization by checking the value of every possible afterstate. Thus, in a bottom-up approach, we generate datasets of heuristically solved instances with increasingly larger budgets to train our agent. We report results close to optimality on graphs up to 100 nodes and a 185 x speedup on average compared to the quickest exact solver known for the MCN
    • …
    corecore